WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT

To: MaoXiaoyun <tinnycloud@xxxxxxxxxxx>, xen devel <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject: Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT
From: Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Date: Tue, 31 Aug 2010 15:49:29 +0100
Cc: Jan Beulich <JBeulich@xxxxxxxxxx>
Delivery-date: Tue, 31 Aug 2010 07:50:16 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <BAY121-W50277AB5A3C8DF48290B8DDA8A0@xxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: ActJE2sb2mrVWFa4QKS3QBTx+9N9SgACEuxC
Thread-topic: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT
User-agent: Microsoft-Entourage/12.26.0.100708
Do you have a line in Xen boot output that starts "PFN compression on bits"?
If so what does it say?

My suspicion is that Jan Beulich's patches to implement a consolidated page
array for sparse memory maps has broken the assumption in some Xen code
that:
 page_to_mfn(mfn_to_page(x)+y) == x+y, for all valid mfns x, and all y up to
some pretty big limit.

Looking in free_heap_pages() I see we do a whole bunch of chunk merging in
our buddy allocator, doing arithmetic on variable 'pg' to find neigbouring
chunks. It's a bit dodgy I suspect.

I'm cc'ing Jan to see what we can get away with in doing arithmetic on
page_info pointers. What's the guaranteed smallest aligned contiguous ranges
of mfn in the frame_table now, Jan? (i.e., ranges in which adjacent
page_info structs relate to adjacent MFNs)

If this is the problem I'm pretty sure we can come up with a patch quite
easily, but depending on the answer to my above question to Jan, we may need
to do some code auditing.

 -- Keir

On 31/08/2010 14:49, "MaoXiaoyun" <tinnycloud@xxxxxxxxxxx> wrote:

> Hi Keir:
>  
>          Thank you for correcting my mistakes.
>          Here is the lastest panic and its objdump.
>          I am not familiar with assemble language and those regigsters usage.
>          I will try to spend some other time to get more understandings.
>          What's your opionion?
>          btw, the memtest is still running, so far so good, thanks.
>  
> ------------------objdump-----------------------------------------------------
> -------------------
>  177 ffff82c480115396:<++48 c1 e1 04          <++shl    $0x4,%rcx
>  178 ffff82c48011539a:<++4a 03 0c f8          <++add    (%rax,%r15,8),%rcx
>  179 }
>  180 static inline void
>  181 page_list_del(struct page_info *page, struct page_list_head *head)
>  182 {
>  183     struct page_info *next = pdx_to_page(page->list.next);
>  184 ffff82c48011539e:<++8b 03                <++mov    (%rbx),%eax
>  185 ffff82c4801153a0:<++48 c1 e0 05          <++shl    $0x5,%rax
>  186 ffff82c4801153a4:<++48 29 e8             <++sub    %rbp,%rax 187
> ffff82c4801153a7:<++48 3b 19             <++cmp    (%rcx),%rbx
>  188 ffff82c4801153aa:<++0f 84 95 01 00 00    <++je     ffff82c480115545
> <free_heap_pages+0x405>
>  189     struct page_info *prev = pdx_to_page(page->list.prev);
>  190 ffff82c4801153b0:<++89 f2                <++mov    %esi,%edx
>  191 ffff82c4801153b2:<++48 c1 e2 05          <++shl    $0x5,%rdx
>  192 ffff82c4801153b6:<++48 29 ea             <++sub    %rbp,%rdx
>  193 ffff82c4801153b9:<++48 3b 59 08          <++cmp &nbs p;  0x8(%rcx),%rbx
>  194 ffff82c4801153bd:<++0f 84 bd 01 00 00    <++je     ffff82c480115580
> <free_heap_pages+0x440>
>  195 
>  196     if ( !__page_list_del_head(page, head, next, prev) )
>  197     {
>  198         next->list.prev = page->list.prev;
>  199 ffff82c4801153c3:<++89 70 04             <++mov    %esi,0x4(%rax)
>  200         prev->list.next = page->list.next;
>  201 ffff82c4801153c6:<++8b 03                <++mov    (%rbx),%eax
> &nbs p;          
>  202 ffff82c4801153c8:<++89 02                <++mov    %eax,(%rdx)
>  203 ffff82c4801153ca:<++49 89 dd             <++mov    %rbx,%r13
>  204 ffff82c4801153cd:<++41 83 c4 01          <++add    $0x1,%r12d
>  205 ffff82c4801153d1:<++41 83 fc 12          <++cmp   ;  $0x12,%r12d
>  206 ffff82c4801153d5:<++0f 84 e3 00 00 00    <++je     ffff82c4801154be
> <free_heap_pages+0x37e>
>  207 ffff82c4801153db:<++48 bd 00 00 00 00 0a <++mov    $0x7d0a00000000,%rbp
>  208 ffff82c4801153e2:<++7d 00 00
>  209 ffff82c4801153e5:<++44 89 e1             <++mov    %r12d,%ecx
>  210 ffff82c4801153e8:<++be 01 00 00 00       <++mov    $0x1,%esi
>  
>  
> ------------------------------------------------------------------------------
> ---------------------
> blktap_sysfs_create: adding attributes for dev ffff880239496c00
> (XEN) ----[ Xen-4.0.0  x86_64  debug=n  Not tainted ]----
> (XEN) CPU:    2
> (XEN) RIP:    e008:[<ffff82c4801153c3>] free_heap_pages+0x283/0x4a0
> (XEN) RFLAGS: 0000000000010282   CONTEXT: hypervisor
> (XEN) rax: ffff8315ffffffe0   rbx: ffff82f6093b0040   rcx: ffff83063fc01a20
> (XEN) rdx: ffff8315ffffffe0   rsi: 00000000ffffffff   rdi: 000000000049d802
> (XEN) rbp: 00007d0a00000000   rsp: ffff83023ff37cb8   r8:  0000000000000000
> (XEN) r9:  ffffffffffffffff   r10: ffff83060a3c0018   r11: 0000000000000282
> (XEN) r12: 0000000000000000   r13: ffff82f6093b0060   r14: 00000000000001a2
> (XEN) r15: 0000000000000001   cr0: 000000008005003b   cr4: 00000000000026f0
> (XEN) cr3: 000000008da54000   cr2: ffff83 15ffffffe4
> (XEN) ds: 0000   es: 0000   fs: 0063   gs: 0000   ss: e010   cs: e008
> (XEN) Xen stack trace from rsp=ffff83023ff37cb8:
> (XEN)    ffff82f6093b7f80 00000000ffffffe0 00000000000001a2 ffff83060a3c0000
> (XEN)    0000000000000000 0000000000000001 ffff82f6093b0060 0000000000000000
> (XEN)    ffff82f6093b0080 ffff82c480115732 00000001093b7cc0 ffff82f6093b0060
> (XEN)    ffff83060a3c0018 0000000000000000 ffff83060a3c0000 ffff83060a3c0fa8
> (XEN)    0000000000000000 ffff82c48014aaa6 ffff83060a3c0fa8 ffff83060a3c0fa8
> (XEN)    ffff83060a3c0014 4000000000000000 ffff83023ff37f28 ffff83060a3c0018
> (XEN)    0000000000000000 ffff83060a3c0000 0000000000305000 0000000000000009
> (XEN)    0000000000000009 ffff82c48014b2fd 00ffffffffffffff ffff83060a3c0000
> (XEN)    0000000000000000  ffff83023ff37e28 0000000000305000 ffff82c480105fe0
> (XEN)    ffff82c480255240 fffffffffffffff3 0000000002599000 ffff82c4801043ce
> (XEN)    ffff82c4801447da 0000000000000080 ffff83023ff37f28 0000000000000096
> (XEN)    ffff83023ff37f28 00000000000000fc 0000000600000002 00000000023c0031
> (XEN)    0000000000000001 00000039890a8e2a 0000003000000018 000000004523af30
> (XEN)    000000004523ae70 0000000000000000 00007fc608ea8a70 000000398903c8a4
> (XEN)    000000004523af44 0000000000000000 000000004523b158 0000000000000000
> (XEN)    0000007f024f6d20 00007fc60a094750 000000000255ff40 00007fc607be5ea8
> (XEN)    fffffffffffffff5 0000000000000246 00000039880cc557 0000000000000100
> (XEN)    00000039880cc557 0000000000000033 0000000000000246 ffff8300bf562000
> (XEN)    ffff8801db8d3e78 000000004523aec0 0000000000305000 000000 0000000009
> (XEN)    0000000000000009 ffff82c4801e3169 0000000000000009 0000000000000009
> (XEN) Xen call trace:
> (XEN)    [<ffff82c4801153c3>] free_heap_pages+0x283/0x4a0
> (XEN)    [<ffff82c480115732>] free_domheap_pages+0x152/0x380
> (XEN)    [<ffff82c48014aaa6>] relinquish_memory+0x186/0x530
> (XEN)    [<ffff82c48014b2fd>] domain_relinquish_resources+0x1ad/0x280
> (XEN)    [<ffff82c480105fe0>] domain_kill+0x80/0xf0
> (XEN)    [<ffff82c4801043ce>] do_domctl+0x1be/0x1000
> (XEN)    [<ffff82c4801447da>] __find_next_bit+0x6a/0x70
> (XEN)    [<ffff82c4801e3169>] syscall_enter+0xa9/0xae
> (XEN)    
> (XEN) Pagetable walk from ffff8315ffffffe4:
> (XEN)  L4[0x106] = 00000000bf569027 5555555555555555
> (XEN)  L3[0x057] = 0000000000000000 ffffffffffffffff
> (XE N) 
> (XEN) ****************************************
> (XEN) Panic on CPU 2:
> (XEN) FATAL PAGE FAULT
> (XEN) [error_code=0002]
> (XEN) Faulting linear address: ffff8315ffffffe4
> (XEN) ****************************************
> (XEN) 
> (XEN) Manual reset required ('noreboot' specified)
>  
> ------------------------------------------------------------------------------
> ---------------------
>> Date: Mon, 30 Aug 2010 14:16:09 +0100
>> Subject: Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT
>> From: keir.fraser@xxxxxxxxxxxxx
>> To: tinnycloud@xxxxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx
>> 
>> On 30/08/2010 14:03, "MaoXiaoyun" <tinnycloud@xxxxxxxxxxx> wrote:
>> 
>>> Appreciate for the quick response.
>>> 
>>> Actually I have done some decode on the backtrace last Friday.
>>> According the RIP ffff82c4801153c3, I cut the "objdump -dS xen-syms"
>>> (please see below). It looks like the bug happened on the domain page list
>> 
>> ffff82c4801153c3 isn't the start of an instruction in your below
>> disassembly. Hence you didn't disassemble exactly the build of Xen which
>> crashed. It needs to be exactly the same image.
>> 
>> -- keir
>> 
>> & gt; travels, which is beyond my understanding. Since in my understanding,
>>> those domain pages come from kernel memory zone, they are always
>>> reside in the physical memory, and the address shouldn't have the chance
>>> to be changed, right?
>>> If so, what is the relationship between all those panic and free_heap_pages?
>>> 
>>> Several servers (at least 3) experienced the same panic on the same test.
>>> Those servers have the identical hardware, kernel and xen configuration.
>>> Right now, on one server, memtest is running, shall be finished in a few
>>> hours.
>>> (24G memory)
>>> 
>>> ----------------------------------------------------------------------------
>>> --
>>> ------
>>> 169 static inline void
>>> 170 page_list_del(struct page_info *page, struct page_list_head *head)
>>> 171 {
>>> 172 struct page_info *next = p dx_to_page(page->list.next);
>>> 173 struct page_info *prev = pdx_to_page(page->list.prev);
>>> 174 ffff82c4801153b8:<++8b 73 04 <++mov 0x4(%rbx),%esi
>>> 175 ffff82c4801153bb:<++49 8d 0c 06 <++lea (%r14,%rax,1),%rcx
>>> 176 ffff82c4801153bf:<++48 8d 05 fa 10 26 00 <++lea 2494714(%rip),%rax
>>> # ffff82c4803764c0 <_heap>
>>> 177 ffff82c4801153c6:<++48 c1 e1 04 <++shl $0x4,%rcx
>>> 178 ffff82c4801153ca:<++4a 03 0c f8 <++add (%rax,%r15,8),%rcx
>>> 179 }
>>> 180 static inline void
>>> 181 page_list_del(struct page_info *page, struct page_list_head *head)
>>> 182 {
>>> 183 struct page_info *next = pdx_to_page(page->list.next);
>>> 184 ffff82c4801153ce:<++8b 03 <++mov (%rbx),%eax
>>> 185 ffff82c4801153d0:<++48 c1 e0 05 <++shl $0x5,%rax
>>> 186 ffff82c4801153d4:<++48 29 e8 <++sub %rbp,%r ax
>>> 187 ffff82c4801153d7:<++48 3b 19 <++cmp (%rcx),%rbx
>>> 188 ffff82c4801153da:<++0f 84 95 01 00 00 <++je ffff82c480115575
>>> <free_heap_pages+0x405>
>>> 189 struct page_info *prev = pdx_to_page(page->list.prev);
>>> 190 ffff82c4801153e0:<++89 f2 <++mov %esi,%edx
>>> 191 ffff82c4801153e2:<++48 c1 e2 05 <++shl $0x5,%rdx
>>> 192 ffff82c4801153e6:<++48 29 ea <++sub %rbp,%rdx
>>> 193 ffff82c4801153e9:<++48 3b 59 08 <++cmp 0x8(%rcx),%rbx
>>> 194 ffff82c4801153ed:<++0f 84 bd 01 00 00 <++je ffff82c4801155b0
>>> <free_heap_pages+0x440>
>>> 195 
>>> 196 if ( !__page_list_del_head(page, head, next, prev) )
>>> 197 {
>>> 198 
>>> ----------------------------------------------------------------------------
>>> --
>>> ------
>>> 
>>>> Date: Mon, 30 Aug 2010 10:02:05 +01 00
>>>> Subject: Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT
>>>> From: keir.fraser@xxxxxxxxxxxxx
>>>> To: tinnycloud@xxxxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx
>>>> 
>>>> On 30/08/2010 09:47, "MaoXiaoyun" <tinnycloud@xxxxxxxxxxx> wrote:
>>>> 
>>>>> 3) Every panic pointer to the same address: ffff8315ffffffe4, which is
>>>>> not a valid page address.
>>>>> I printted pages of the domain in assign_pages, which all looks like
>>>>> ffff82f60bd64000, at least
>>>>> ffff82f60 is the same.
>>>> 
>>>> Yes, well you may not be crashing on a supposed page address. Certainly the
>>>> page pointer that relinquish_memory() is working on, and passed to
>>>> put_page->free_domheap_pages is valid enough to not cause any of those
>>>> functions to crash when dereferenci ng it. At the moment you really have no
>>>> idea what is causing free_heap_pages() to crash.
>>>> 
>>>>> A bit of lost direction to go further. Thanks.
>>>> 
>>>> You need to find out which line of code in free_heap_pages() is crashing,
>>>> and what variable it is trying to dereference when it crashes. You have a
>>>> nice backtrace with an EIP value, so you can 'objdump -d xen-syms' and
>>>> search for the EIP in the disassembly. If you have a debug build of Xen you
>>>> can even do 'objdump -S xen-syms' and have the disassembly annotated with
>>>> corresponding source lines.
>>>> 
>>>> Have you seen this on more than one physical machine? If not, have you run
>>>> memtest on the offending machine?
>>>> 
>>>> -- Keir
>>>> 
>>>> 
>>> 
>> 
>> 
>       



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel