[Xen-devel] Re: L1[0x1fb] = 0000000000000000 which faults on one

On Wed, 2011-03-16 at 22:19 +0000, Konrad Rzeszutek Wilk wrote:
> I am troubleshooting an issue where the Linux kernel tries
> to dereference a not present entry. I have a fix for this
> in for-2.6.32/bug-fixes .. but please read on.

I'll give it a shot, I'll try anything at this point ;P

> Specifically it tries to derefence the fixmapped value of
> APIC_BASE. The fixmapped value of APIC_BASE is actually not set
> due to git commit a1d8e2fa8325064338b2da1bcf0d7a0473883c284
> which adds this in arch/x86/kernel/acpi/boot.c:
> 
> static void __init acpi_register_lapic_address(unsigned long address)
>  {
>         /* Xen dom0 doesn't have usable lapics */
>        if (xen_initial_domain())
>              return;
>  
>         mp_lapic_addr = address;
> 
>       set_fixmap_nocache(FIX_APIC_BASE, address);
> 
> Later on we use 'native_apic_read' which tries to use the APIC_BASE as
> address (it is present to be @ slot FIX_APIC_BASE of the fixmap
> API) and it fails (on some machines).
> 
> Since we don't call 'set_fixmap_nocache(FIX_APIC_BASE)' and 
> if one were to go through the pagetable this is what we get:
> 
> 
> [    0.000000] SMP: Allowing 1 CPUs, 0 hotplug CPUs
> [    0.000000] mapped APIC to ffffffffff5fb000 (00000000)
> (XEN) d0:v0: unhandled page fault (ec=0000)
> (XEN) Pagetable walk from ffffffffff5fb020:
> (XEN)  L4[0x1ff] = 0000000221003067 0000000000001003
> (XEN)  L3[0x1ff] = 0000000221004067 0000000000001004
> (XEN)  L2[0x1fa] = 0000000221771067 0000000000001771 
> (XEN)  L1[0x1fb] = 0000000000000000 ffffffffffffffff
> (XEN) domain_crash_sync called from entry.S
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> (XEN) ----[ Xen-4.1-110309  x86_64  debug=y  Tainted:    C ]----
> (XEN) CPU:    0
> (XEN) RIP:    e033:[<ffffffff8102b5d1>]
> (XEN) RFLAGS: 0000000000000292   EM: 1   CONTEXT: pv guest
> (XEN) rax: ffffffff8164cf50   rbx: 000000026ec00000   rcx: 00000000ffffdd85
> (XEN) rdx: 00000000ffffffff   rsi: 0000000000000000   rdi: 0000000000000020
> (XEN) rbp: ffffffff81643ea8   rsp: ffffffff81643e50   r8:  0000000000000002
> (XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
> (XEN) r12: ffff880013671800   r13: 00000000bff66000   r14: ffffffffffffffff
> (XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000000006f0
> (XEN) cr3: 0000000221001000   cr2: ffffffffff5fb020
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
> (XEN) Guest stack trace from rsp=ffffffff81643e50:
> 
> Which is to say that the L1 has this:
> 0000000115771fa0:  00000000 00000000 00000000 00000000
> 0000000115771fb0:  00000000 00000000 00000000 00000000
> 0000000115771fc0:  00000000 00000000 15770067 80100001
> 0000000115771fd0:  15770067 80100001 00000000 00000000
> 0000000115771fe0:  00000000 00000000 00000000 00000000
> 0000000115771ff0:  00000000 00000000 00000000 00000000
> 
> L1[0x1fb] is machine address 115771fd8, which has nothing in it.
> 
> OK, so I've come up a fix that is a back-port of how 2.6.38 does it
> which is that it removes the check I mentioned above and in xen_set_fixmap
> we set the FIX_APIC_BASE to actually point to a dummy ioapic_mapping. 
> It is 7cb068cf1ba90425e12f3a7b3caed9d018fa9b8c in for-2.6.32/bug-fixes
> 
> Gianni, you might want to check this out in case it fixes the problem you
> are experiencing.

Not sure, mine happens a lot earlier, sort of just after the very early
memory initialisation. Also we're nowhere near trying to use APIC
anything as an address afaict - just trying to reach the xen info page.

The last thing I see is:
[    0.000000] kernel direct mapping tables up to 2f000000 @ 100000-27a000
[    0.000000] init_memory_mapping: 0000000100000000-00000002a7000000


> But one thing I can't understand is why on one machine (IBM x3850)
> I get this crash, while another one with the same pagetable contents
> (L1 has nothing for 0x1fb) it works just fine? I added a panic and used
> the Xen hypervisor kdb to manually inspect the pagetable, and it has
> the same contents as the IBM x3850 -but it boots fine with this invalid value.
> Any ideas?

A missing TLB flush? heh

> 
> FYI, seems another user (Sven Sübert) IBM x3650 hits the same bug. And with
> this fix he is able to boot.

Very odd, if this isn't the bug I'm seeing it might be tangentially
related.

I'll let you know

Gianni


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] Re: L1[0x1fb] = 0000000000000000 which faults on one type of