|   xen-devel
RE: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT 
| | Appreciate for the detail. 
 I notice the spin_lock for the code I referred, which as you mentioned will introduce a deadlock.
 In fact, during the 48 hours long run, there was a VM hung, and from the xm list command,
 the cpu time is quite high, to ten thousands, but the other VMS worked fine. I don't know whether
 it related to the potential deadlock, since Xen still worked.
 
 So a quick question is if we replace the spin_lock with spin_lock_recursive, could we avoid this deadlock?
 
 The if statement was executed during the test since I happend put the log and got the output log.
 As a matter of fact,  HVMS(all windowns 2003) under my test all are have PV driver installed. I think that's
 why the patch take effects.
 
 Besides, I have been working on this issue for sometime, it is not possible I made a build mistake
 since I have been carefully all the time.
 
 Anyway, I plan to kick off two reproduce on two physical servers, one has this patch enabled(use spin_lock_recursive
 instead of spin_lock) and the other with no change, completely on clean code.  It would be useful if u have some
 trace to be added into the test. I will keep you informed.
 
 In addtion, my kernel is
 2.6.31.13-pvops-patch #1 SMP Tue Aug 24 11:23:51 CST 2010 x86_64 x86_64 x86_64 GNU/Linux
 Xen is
 4.0.0
 
 Thanks.
 
 
 
 
 > Date: Thu, 26 Aug 2010 08:39:03 +0100
 > Subject: Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT
 > From: keir.fraser@xxxxxxxxxxxxx
 > To: tinnycloud@xxxxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx
 >
 > On 26/08/2010 05:49, "MaoXiaoyun" <tinnycloud@xxxxxxxxxxx> wrote:
 >
 > > Hi:
 > >
 > > This issue can be easily reproduced by continuous and almost concurrently
 > > reboot 12 Xen HVM VMS on a single physic server. The reproduce hit the back
 > > trace about 6 to 14 hours after it started. I have several similar Xen back
 > > traces, please refer to the end of the mail. The first three back traces
 > > almost the same, they happened in domain_kill, while the last backtrace
 > > happened in do_multicall.
 > >
 > > As go through the Xen code, in /xen-4.0.0/xen/arch/x86/mm.c, it shows
 > > that the author aware of the race competiti
 on between
 > > domain_relinquish_resources and presented code. It occurred me to simply move
 > > line 2765 and 2766 before 2764, that is move put_page_and_type(page) into the
 > > spin_lock to avoid competition.
 >
 > Well, thanks for the detailed bug report: it is good to have a report that
 > includes an attempt at a fix!
 >
 > In the below code, the put_page_and_type() is outside the locked region for
 > good reason. Put_page_and_type() -> put_page() -> free_domheap_pages() which
 > acquires d->page_alloc_lock. Because we do not use spin_lock_recursive() in
 > the below code, this recursive acquisition of the lock in
 > free_domheap_pages() would deadlock!
 >
 > Now, I do not think this fix really affected your testing anyway, because
 > the below code is part of the MMUEXT_PIN_... hypercalls, and further is only
 > triggered when a domain executes one of those hypercalls
  on *another*
 > domain's memory. The *only* time that should happen is when dom0 builds a
 > *PV* VM. So since all your testing is on HVM guests I wouldn't expect the
 > code in the if() statement below to be executed ever. Well, maybe unless you
 > are using qemu stub domains, or pvgrub.
 >
 > But even if the below code is being executed, I don't think your change is a
 > fix, or anything that should greatly affect the system apart from
 > introducing a deadlock. Is it instead possible that you somehow were testing
 > a broken build of Xen before, and simply re-building Xen with your change is
 > what fixed things? I wonder if the bug stays gone away if you revert your
 > change and re-build?
 >
 > If it still appears that your fix is good, I would add tracing to the below
 > code and find out a bit more about when/why it is being executed.
 >
 > -- Keir
 >
 > > 2753 /* A page i
 s dirtied when its pin status is set. */
 > > 2754 paging_mark_dirty(pg_owner, mfn);
 > > 2755
 > > 2756 /* We can race domain destruction
 > > (domain_relinquish_resources). */
 > > 2757 if ( unlikely(pg_owner != d) )
 > > 2758 {
 > > 2759 int drop_ref;
 > > 2760 spin_lock(&pg_owner->page_alloc_lock);
 > > 2761 drop_ref = (pg_owner->is_dying &&
 > > 2762 test_and_clear_bit(_PGT_pinned,
 > > 2763
 > > &page->u.inuse.type_info));
 > > 2764 spin_unlock(&pg_owner->page_alloc_lock);
 > > 2765 if ( drop_ref )
 > > 2766 put_page_and_type(page);
 > > 2767 }
 > > 2768
 > > 2769 break;
 > > 2770 }
 > >
 > > Form the result of reproduce on patched code, it appears the patch
 > > worked well since the reproduce succeed during a 48hours long run. But I am
 > > not sure of
  the side effects it brings.
 > > Appreciate in advance if someone could give more clauses, thx.
 > >
 > > =============Trace 1: =============
 > >
 > > (XEN) ----[ Xen-4.0.0 x86_64 debug=y Not tainted ]----
 > > (XEN) CPU: 0
 > > (XEN) RIP: e008:[<ffff82c48011617c>] free_heap_pages+0x55a/0x575
 > > (XEN) RFLAGS: 0000000000010286 CONTEXT: hypervisor
 > > (XEN) rax: 0000001fffffffe0 rbx: ffff82f60b8bbfc0 rcx: ffff83063fe01a20
 > > (XEN) rdx: ffff8315ffffffe0 rsi: ffff8315ffffffe0 rdi: 00000000ffffffff
 > > (XEN) rbp: ffff82c48037fc98 rsp: ffff82c48037fc58 r8: 0000000000000000
 > > (XEN) r9: ffffffffffffffff r10: ffff82c48020e770 r11: 0000000000000282
 > > (XEN) r12: 00007d0a00000000 r13: 0000000000000000 r14: ffff82f60b8bbfe0
 > > (XEN) r15: 0000000000000001 cr0: 000000008005003b cr4: 00000000000026f0
 > > (XEN) cr3: 0000000232914000 cr2: ffff8315ffff
 ffe4
 > > (XEN) ds: 0000 es: 0000 fs: 0063 gs: 0000 ss: e010 cs: e008
 > > (XEN) Xen stack trace from rsp=ffff82c48037fc58:
 > > (XEN) 0000000000000016 0000000000000000 00000000000001a2 ffff8304afc40000
 > > (XEN) 0000000000000000 ffff82f60b8bbfe0 00000000000330fe ffff82f60b8bc000
 > > (XEN) ffff82c48037fcd8 ffff82c48011647e 0000000100000000 ffff82f60b8bbfe0
 > > (XEN) ffff8304afc40020 0000000000000000 ffff8304afc40000 0000000000000000
 > > (XEN) ffff82c48037fcf8 ffff82c480160caf ffff8304afc40000 ffff82f60b8bbfe0
 > > (XEN) ffff82c48037fd68 ffff82c48014deaf 0000000000000ca3 ffff8304afc40fd8
 > > (XEN) ffff8304afc40fd8 ffff8304afc40fd8 4000000000000000 ffff82c48037ff28
 > > (XEN) 0000000000000000 ffff8304afc40000 ffff8304afc40000 000000000099e000
 > > (XEN) 00000000ffffffda 0000000000000001 ffff82c48037fd98 ffff82c4801504de
 > > (XEN) ffff8304afc40000 0000000000000000 000000000099e0
 00 00000000ffffffda
 > > (XEN) ffff82c48037fdb8 ffff82c4801062ee 000000000099e000 fffffffffffffff3
 > > (XEN) ffff82c48037ff08 ffff82c480104cd7 ffff82c40000f800 0000000000000286
 > > (XEN) 0000000000000286 ffff8300bf76c000 000000ea864b1814 ffff8300bf76c030
 > > (XEN) ffff83023ff1ded8 ffff83023ff1ded0 ffff82c48037fe38 ffff82c48011c9f5
 > > (XEN) ffff82c48037ff08 ffff82c480272100 ffff8300bf76c000 ffff82c48037fe48
 > > (XEN) ffff82c48011f557 ffff82c480272100 0000000600000002 000000004700000a
 > > (XEN) 000000004700bf2c 0000000000000000 000000004700c158 0000000000000000
 > > (XEN) 00002b3b59e7d050 0000000000000000 0000007f00b14140 00002b3b5f257a80
 > > (XEN) 0000000000996380 00002aaaaaad0830 00002b3b5f257a80 00000000009bb690
 > > (XEN) 00002aaaaaad0830 000000398905abf3 000000000078de60 00002b3b5f257aa4
 > > (XEN) Xen call trace:
 > > (XEN) [<ffff82c48011617c>] free_heap_pages+0x5
 5a/0x575
 > > (XEN) [<ffff82c48011647e>] free_domheap_pages+0x2e7/0x3ab
 > > (XEN) [<ffff82c480160caf>] put_page+0x69/0x70
 > > (XEN) [<ffff82c48014deaf>] relinquish_memory+0x36e/0x499
 > > (XEN) [<ffff82c4801504de>] domain_relinquish_resources+0x1ac/0x24c
 > > (XEN) [<ffff82c4801062ee>] domain_kill+0x93/0xe4
 > > (XEN) [<ffff82c480104cd7>] do_domctl+0xa1c/0x1205
 > > (XEN) [<ffff82c4801f71bf>] syscall_enter+0xef/0x149
 > > (XEN)
 > > (XEN) Pagetable walk from ffff8315ffffffe4:
 > > (XEN) L4[0x106] = 00000000bf589027 5555555555555555
 > > (XEN) L3[0x057] = 0000000000000000 ffffffffffffffff
 > > (XEN)
 > > (XEN) ****************************************
 > > (XEN) Panic on CPU 0:
 > > (XEN) FATAL PAGE FAULT
 > > (XEN) [error_code=0002]
 > > (XEN) Faulting linear address: ffff8315ffffffe4
 > > 
 (XEN) ****************************************
 > > (XEN)
 > > (XEN) Manual reset required ('noreboot' specified)
 > >
 > > =============Trace 2: =============
 > >
 > > (XEN) Xen call trace:
 > > (XEN) [<ffff82c4801153c3>] free_heap_pages+0x283/0x4a0
 > > (XEN) [<ffff82c480115732>] free_domheap_pages+0x152/0x380
 > > (XEN) [<ffff82c48014aa89>] relinquish_memory+0x169/0x500
 > > (XEN) [<ffff82c48014b2cd>] domain_relinquish_resources+0x1ad/0x280
 > > (XEN) [<ffff82c480105fe0>] domain_kill+0x80/0xf0
 > > (XEN) [<ffff82c4801043ce>] do_domctl+0x1be/0x1000
 > > (XEN) [<ffff82c48010739b>] evtchn_set_pending+0xab/0x1b0
 > > (XEN) [<ffff82c4801e3169>] syscall_enter+0xa9/0xae
 > > (XEN)
 > > (XEN) Pagetable walk from ffff8315ffffffe4:
 > > (XEN) L4[0x106] = 00000000bf569027 5555555555555555
 > &
 gt; (XEN) L3[0x057] = 0000000000000000 ffffffffffffffff
 > > (XEN) stdvga.c:147:d60 entering stdvga and caching modes
 > > (XEN)
 > > (XEN) ****************************************
 > > (XEN) HVM60: VGABios $Id: vgabios.c,v 1.67 2008/01/27 09:44:12 vruppert Exp $
 > > (XEN) Panic on CPU 0:
 > > (XEN) FATAL PAGE FAULT
 > > (XEN) [error_code=0002]
 > > (XEN) Faulting linear address: ffff8315ffffffe4
 > > (XEN) ****************************************
 > > (XEN)
 > > (XEN) Manual reset required ('noreboot' specified)
 > >
 > > =============Trace 3: =============
 > >
 > >
 > > (XEN) Xen call trace:
 > > (XEN) [<ffff82c4801153c3>] free_heap_pages+0x283/0x4a0
 > > (XEN) [<ffff82c480115732>] free_domheap_pages+0x152/0x380
 > > (XEN) [<ffff82c48014aa89>] relinquish_memory+0x169/0x500
 > > (XEN) [<fff
 f82c48014b2cd>] domain_relinquish_resources+0x1ad/0x280
 > > (XEN) [<ffff82c480105fe0>] domain_kill+0x80/0xf0
 > > (XEN) [<ffff82c4801043ce>] do_domctl+0x1be/0x1000
 > > (XEN) [<ffff82c480117804>] csched_acct+0x384/0x430
 > > (XEN) [<ffff82c4801e3169>] syscall_enter+0xa9/0xae
 > >
 >
 >
 
 | 
 _______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
 | 
 
| <Prev in Thread] | Current Thread | [Next in Thread> |  | 
Re:Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT, MaoXiaoyun
Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT, Keir Fraser
RE: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT,
MaoXiaoyun <=
Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT, Keir Fraser
RE: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT, MaoXiaoyun
Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT, Keir Fraser
RE: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT, MaoXiaoyun
Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT, Keir Fraser
RE: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT, MaoXiaoyun
Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT, Keir Fraser
Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT, Keir Fraser
Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT, Jan Beulich
Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT, Keir Fraser
 |  |  |