Do you have a line in Xen boot output that starts "PFN compression on bits"?
If so what does it say?
My suspicion is that Jan Beulich's patches to implement a consolidated page
array for sparse memory maps has broken the assumption in some Xen code
that:
page_to_mfn(mfn_to_page(x)+y) == x+y, for all valid mfns x, and all y up to
some pretty big limit.
Looking in free_heap_pages() I see we do a whole bunch of chunk merging in
our buddy allocator, doing arithmetic on variable 'pg' to find neigbouring
chunks. It's a bit dodgy I suspect.
I'm cc'ing Jan to see what we can get away with in doing arithmetic on
page_info pointers. What's the guaranteed smallest aligned contiguous ranges
of mfn in the frame_table now, Jan? (i.e., ranges in which adjacent
page_info structs relate to adjacent MFNs)
If this is the problem I'm pretty sure we can come up with a patch quite
easily, but depending on the answer to my above question to Jan, we may need
to do some code auditing.
-- Keir
On 31/08/2010 14:49, "MaoXiaoyun" <tinnycloud@xxxxxxxxxxx> wrote:
> Hi Keir:
>
> Thank you for correcting my mistakes.
> Here is the lastest panic and its objdump.
> I am not familiar with assemble language and those regigsters usage.
> I will try to spend some other time to get more understandings.
> What's your opionion?
> btw, the memtest is still running, so far so good, thanks.
>
> ------------------objdump-----------------------------------------------------
> -------------------
> 177 ffff82c480115396:<++48 c1 e1 04 <++shl $0x4,%rcx
> 178 ffff82c48011539a:<++4a 03 0c f8 <++add (%rax,%r15,8),%rcx
> 179 }
> 180 static inline void
> 181 page_list_del(struct page_info *page, struct page_list_head *head)
> 182 {
> 183 struct page_info *next = pdx_to_page(page->list.next);
> 184 ffff82c48011539e:<++8b 03 <++mov (%rbx),%eax
> 185 ffff82c4801153a0:<++48 c1 e0 05 <++shl $0x5,%rax
> 186 ffff82c4801153a4:<++48 29 e8 <++sub %rbp,%rax 187
> ffff82c4801153a7:<++48 3b 19 <++cmp (%rcx),%rbx
> 188 ffff82c4801153aa:<++0f 84 95 01 00 00 <++je ffff82c480115545
> <free_heap_pages+0x405>
> 189 struct page_info *prev = pdx_to_page(page->list.prev);
> 190 ffff82c4801153b0:<++89 f2 <++mov %esi,%edx
> 191 ffff82c4801153b2:<++48 c1 e2 05 <++shl $0x5,%rdx
> 192 ffff82c4801153b6:<++48 29 ea <++sub %rbp,%rdx
> 193 ffff82c4801153b9:<++48 3b 59 08 <++cmp &nbs p; 0x8(%rcx),%rbx
> 194 ffff82c4801153bd:<++0f 84 bd 01 00 00 <++je ffff82c480115580
> <free_heap_pages+0x440>
> 195
> 196 if ( !__page_list_del_head(page, head, next, prev) )
> 197 {
> 198 next->list.prev = page->list.prev;
> 199 ffff82c4801153c3:<++89 70 04 <++mov %esi,0x4(%rax)
> 200 prev->list.next = page->list.next;
> 201 ffff82c4801153c6:<++8b 03 <++mov (%rbx),%eax
> &nbs p;
> 202 ffff82c4801153c8:<++89 02 <++mov %eax,(%rdx)
> 203 ffff82c4801153ca:<++49 89 dd <++mov %rbx,%r13
> 204 ffff82c4801153cd:<++41 83 c4 01 <++add $0x1,%r12d
> 205 ffff82c4801153d1:<++41 83 fc 12 <++cmp ; $0x12,%r12d
> 206 ffff82c4801153d5:<++0f 84 e3 00 00 00 <++je ffff82c4801154be
> <free_heap_pages+0x37e>
> 207 ffff82c4801153db:<++48 bd 00 00 00 00 0a <++mov $0x7d0a00000000,%rbp
> 208 ffff82c4801153e2:<++7d 00 00
> 209 ffff82c4801153e5:<++44 89 e1 <++mov %r12d,%ecx
> 210 ffff82c4801153e8:<++be 01 00 00 00 <++mov $0x1,%esi
>
>
> ------------------------------------------------------------------------------
> ---------------------
> blktap_sysfs_create: adding attributes for dev ffff880239496c00
> (XEN) ----[ Xen-4.0.0 x86_64 debug=n Not tainted ]----
> (XEN) CPU: 2
> (XEN) RIP: e008:[<ffff82c4801153c3>] free_heap_pages+0x283/0x4a0
> (XEN) RFLAGS: 0000000000010282 CONTEXT: hypervisor
> (XEN) rax: ffff8315ffffffe0 rbx: ffff82f6093b0040 rcx: ffff83063fc01a20
> (XEN) rdx: ffff8315ffffffe0 rsi: 00000000ffffffff rdi: 000000000049d802
> (XEN) rbp: 00007d0a00000000 rsp: ffff83023ff37cb8 r8: 0000000000000000
> (XEN) r9: ffffffffffffffff r10: ffff83060a3c0018 r11: 0000000000000282
> (XEN) r12: 0000000000000000 r13: ffff82f6093b0060 r14: 00000000000001a2
> (XEN) r15: 0000000000000001 cr0: 000000008005003b cr4: 00000000000026f0
> (XEN) cr3: 000000008da54000 cr2: ffff83 15ffffffe4
> (XEN) ds: 0000 es: 0000 fs: 0063 gs: 0000 ss: e010 cs: e008
> (XEN) Xen stack trace from rsp=ffff83023ff37cb8:
> (XEN) ffff82f6093b7f80 00000000ffffffe0 00000000000001a2 ffff83060a3c0000
> (XEN) 0000000000000000 0000000000000001 ffff82f6093b0060 0000000000000000
> (XEN) ffff82f6093b0080 ffff82c480115732 00000001093b7cc0 ffff82f6093b0060
> (XEN) ffff83060a3c0018 0000000000000000 ffff83060a3c0000 ffff83060a3c0fa8
> (XEN) 0000000000000000 ffff82c48014aaa6 ffff83060a3c0fa8 ffff83060a3c0fa8
> (XEN) ffff83060a3c0014 4000000000000000 ffff83023ff37f28 ffff83060a3c0018
> (XEN) 0000000000000000 ffff83060a3c0000 0000000000305000 0000000000000009
> (XEN) 0000000000000009 ffff82c48014b2fd 00ffffffffffffff ffff83060a3c0000
> (XEN) 0000000000000000 ffff83023ff37e28 0000000000305000 ffff82c480105fe0
> (XEN) ffff82c480255240 fffffffffffffff3 0000000002599000 ffff82c4801043ce
> (XEN) ffff82c4801447da 0000000000000080 ffff83023ff37f28 0000000000000096
> (XEN) ffff83023ff37f28 00000000000000fc 0000000600000002 00000000023c0031
> (XEN) 0000000000000001 00000039890a8e2a 0000003000000018 000000004523af30
> (XEN) 000000004523ae70 0000000000000000 00007fc608ea8a70 000000398903c8a4
> (XEN) 000000004523af44 0000000000000000 000000004523b158 0000000000000000
> (XEN) 0000007f024f6d20 00007fc60a094750 000000000255ff40 00007fc607be5ea8
> (XEN) fffffffffffffff5 0000000000000246 00000039880cc557 0000000000000100
> (XEN) 00000039880cc557 0000000000000033 0000000000000246 ffff8300bf562000
> (XEN) ffff8801db8d3e78 000000004523aec0 0000000000305000 000000 0000000009
> (XEN) 0000000000000009 ffff82c4801e3169 0000000000000009 0000000000000009
> (XEN) Xen call trace:
> (XEN) [<ffff82c4801153c3>] free_heap_pages+0x283/0x4a0
> (XEN) [<ffff82c480115732>] free_domheap_pages+0x152/0x380
> (XEN) [<ffff82c48014aaa6>] relinquish_memory+0x186/0x530
> (XEN) [<ffff82c48014b2fd>] domain_relinquish_resources+0x1ad/0x280
> (XEN) [<ffff82c480105fe0>] domain_kill+0x80/0xf0
> (XEN) [<ffff82c4801043ce>] do_domctl+0x1be/0x1000
> (XEN) [<ffff82c4801447da>] __find_next_bit+0x6a/0x70
> (XEN) [<ffff82c4801e3169>] syscall_enter+0xa9/0xae
> (XEN)
> (XEN) Pagetable walk from ffff8315ffffffe4:
> (XEN) L4[0x106] = 00000000bf569027 5555555555555555
> (XEN) L3[0x057] = 0000000000000000 ffffffffffffffff
> (XE N)
> (XEN) ****************************************
> (XEN) Panic on CPU 2:
> (XEN) FATAL PAGE FAULT
> (XEN) [error_code=0002]
> (XEN) Faulting linear address: ffff8315ffffffe4
> (XEN) ****************************************
> (XEN)
> (XEN) Manual reset required ('noreboot' specified)
>
> ------------------------------------------------------------------------------
> ---------------------
>> Date: Mon, 30 Aug 2010 14:16:09 +0100
>> Subject: Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT
>> From: keir.fraser@xxxxxxxxxxxxx
>> To: tinnycloud@xxxxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx
>>
>> On 30/08/2010 14:03, "MaoXiaoyun" <tinnycloud@xxxxxxxxxxx> wrote:
>>
>>> Appreciate for the quick response.
>>>
>>> Actually I have done some decode on the backtrace last Friday.
>>> According the RIP ffff82c4801153c3, I cut the "objdump -dS xen-syms"
>>> (please see below). It looks like the bug happened on the domain page list
>>
>> ffff82c4801153c3 isn't the start of an instruction in your below
>> disassembly. Hence you didn't disassemble exactly the build of Xen which
>> crashed. It needs to be exactly the same image.
>>
>> -- keir
>>
>> & gt; travels, which is beyond my understanding. Since in my understanding,
>>> those domain pages come from kernel memory zone, they are always
>>> reside in the physical memory, and the address shouldn't have the chance
>>> to be changed, right?
>>> If so, what is the relationship between all those panic and free_heap_pages?
>>>
>>> Several servers (at least 3) experienced the same panic on the same test.
>>> Those servers have the identical hardware, kernel and xen configuration.
>>> Right now, on one server, memtest is running, shall be finished in a few
>>> hours.
>>> (24G memory)
>>>
>>> ----------------------------------------------------------------------------
>>> --
>>> ------
>>> 169 static inline void
>>> 170 page_list_del(struct page_info *page, struct page_list_head *head)
>>> 171 {
>>> 172 struct page_info *next = p dx_to_page(page->list.next);
>>> 173 struct page_info *prev = pdx_to_page(page->list.prev);
>>> 174 ffff82c4801153b8:<++8b 73 04 <++mov 0x4(%rbx),%esi
>>> 175 ffff82c4801153bb:<++49 8d 0c 06 <++lea (%r14,%rax,1),%rcx
>>> 176 ffff82c4801153bf:<++48 8d 05 fa 10 26 00 <++lea 2494714(%rip),%rax
>>> # ffff82c4803764c0 <_heap>
>>> 177 ffff82c4801153c6:<++48 c1 e1 04 <++shl $0x4,%rcx
>>> 178 ffff82c4801153ca:<++4a 03 0c f8 <++add (%rax,%r15,8),%rcx
>>> 179 }
>>> 180 static inline void
>>> 181 page_list_del(struct page_info *page, struct page_list_head *head)
>>> 182 {
>>> 183 struct page_info *next = pdx_to_page(page->list.next);
>>> 184 ffff82c4801153ce:<++8b 03 <++mov (%rbx),%eax
>>> 185 ffff82c4801153d0:<++48 c1 e0 05 <++shl $0x5,%rax
>>> 186 ffff82c4801153d4:<++48 29 e8 <++sub %rbp,%r ax
>>> 187 ffff82c4801153d7:<++48 3b 19 <++cmp (%rcx),%rbx
>>> 188 ffff82c4801153da:<++0f 84 95 01 00 00 <++je ffff82c480115575
>>> <free_heap_pages+0x405>
>>> 189 struct page_info *prev = pdx_to_page(page->list.prev);
>>> 190 ffff82c4801153e0:<++89 f2 <++mov %esi,%edx
>>> 191 ffff82c4801153e2:<++48 c1 e2 05 <++shl $0x5,%rdx
>>> 192 ffff82c4801153e6:<++48 29 ea <++sub %rbp,%rdx
>>> 193 ffff82c4801153e9:<++48 3b 59 08 <++cmp 0x8(%rcx),%rbx
>>> 194 ffff82c4801153ed:<++0f 84 bd 01 00 00 <++je ffff82c4801155b0
>>> <free_heap_pages+0x440>
>>> 195
>>> 196 if ( !__page_list_del_head(page, head, next, prev) )
>>> 197 {
>>> 198
>>> ----------------------------------------------------------------------------
>>> --
>>> ------
>>>
>>>> Date: Mon, 30 Aug 2010 10:02:05 +01 00
>>>> Subject: Re: [Xen-devel] Xen-unstable panic: FATAL PAGE FAULT
>>>> From: keir.fraser@xxxxxxxxxxxxx
>>>> To: tinnycloud@xxxxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx
>>>>
>>>> On 30/08/2010 09:47, "MaoXiaoyun" <tinnycloud@xxxxxxxxxxx> wrote:
>>>>
>>>>> 3) Every panic pointer to the same address: ffff8315ffffffe4, which is
>>>>> not a valid page address.
>>>>> I printted pages of the domain in assign_pages, which all looks like
>>>>> ffff82f60bd64000, at least
>>>>> ffff82f60 is the same.
>>>>
>>>> Yes, well you may not be crashing on a supposed page address. Certainly the
>>>> page pointer that relinquish_memory() is working on, and passed to
>>>> put_page->free_domheap_pages is valid enough to not cause any of those
>>>> functions to crash when dereferenci ng it. At the moment you really have no
>>>> idea what is causing free_heap_pages() to crash.
>>>>
>>>>> A bit of lost direction to go further. Thanks.
>>>>
>>>> You need to find out which line of code in free_heap_pages() is crashing,
>>>> and what variable it is trying to dereference when it crashes. You have a
>>>> nice backtrace with an EIP value, so you can 'objdump -d xen-syms' and
>>>> search for the EIP in the disassembly. If you have a debug build of Xen you
>>>> can even do 'objdump -S xen-syms' and have the disassembly annotated with
>>>> corresponding source lines.
>>>>
>>>> Have you seen this on more than one physical machine? If not, have you run
>>>> memtest on the offending machine?
>>>>
>>>> -- Keir
>>>>
>>>>
>>>
>>
>>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|