|
|
|
|
|
|
|
|
|
|
xen-devel
[Xen-devel] Re: L1[0x1fb] = 0000000000000000 which faults on one type of
On Wed, 2011-03-16 at 22:19 +0000, Konrad Rzeszutek Wilk wrote:
> I am troubleshooting an issue where the Linux kernel tries
> to dereference a not present entry. I have a fix for this
> in for-2.6.32/bug-fixes .. but please read on.
I'll give it a shot, I'll try anything at this point ;P
> Specifically it tries to derefence the fixmapped value of
> APIC_BASE. The fixmapped value of APIC_BASE is actually not set
> due to git commit a1d8e2fa8325064338b2da1bcf0d7a0473883c284
> which adds this in arch/x86/kernel/acpi/boot.c:
>
> static void __init acpi_register_lapic_address(unsigned long address)
> {
> /* Xen dom0 doesn't have usable lapics */
> if (xen_initial_domain())
> return;
>
> mp_lapic_addr = address;
>
> set_fixmap_nocache(FIX_APIC_BASE, address);
>
> Later on we use 'native_apic_read' which tries to use the APIC_BASE as
> address (it is present to be @ slot FIX_APIC_BASE of the fixmap
> API) and it fails (on some machines).
>
> Since we don't call 'set_fixmap_nocache(FIX_APIC_BASE)' and
> if one were to go through the pagetable this is what we get:
>
>
> [ 0.000000] SMP: Allowing 1 CPUs, 0 hotplug CPUs
> [ 0.000000] mapped APIC to ffffffffff5fb000 (00000000)
> (XEN) d0:v0: unhandled page fault (ec=0000)
> (XEN) Pagetable walk from ffffffffff5fb020:
> (XEN) L4[0x1ff] = 0000000221003067 0000000000001003
> (XEN) L3[0x1ff] = 0000000221004067 0000000000001004
> (XEN) L2[0x1fa] = 0000000221771067 0000000000001771
> (XEN) L1[0x1fb] = 0000000000000000 ffffffffffffffff
> (XEN) domain_crash_sync called from entry.S
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> (XEN) ----[ Xen-4.1-110309 x86_64 debug=y Tainted: C ]----
> (XEN) CPU: 0
> (XEN) RIP: e033:[<ffffffff8102b5d1>]
> (XEN) RFLAGS: 0000000000000292 EM: 1 CONTEXT: pv guest
> (XEN) rax: ffffffff8164cf50 rbx: 000000026ec00000 rcx: 00000000ffffdd85
> (XEN) rdx: 00000000ffffffff rsi: 0000000000000000 rdi: 0000000000000020
> (XEN) rbp: ffffffff81643ea8 rsp: ffffffff81643e50 r8: 0000000000000002
> (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000
> (XEN) r12: ffff880013671800 r13: 00000000bff66000 r14: ffffffffffffffff
> (XEN) r15: 0000000000000000 cr0: 000000008005003b cr4: 00000000000006f0
> (XEN) cr3: 0000000221001000 cr2: ffffffffff5fb020
> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033
> (XEN) Guest stack trace from rsp=ffffffff81643e50:
>
> Which is to say that the L1 has this:
> 0000000115771fa0: 00000000 00000000 00000000 00000000
> 0000000115771fb0: 00000000 00000000 00000000 00000000
> 0000000115771fc0: 00000000 00000000 15770067 80100001
> 0000000115771fd0: 15770067 80100001 00000000 00000000
> 0000000115771fe0: 00000000 00000000 00000000 00000000
> 0000000115771ff0: 00000000 00000000 00000000 00000000
>
> L1[0x1fb] is machine address 115771fd8, which has nothing in it.
>
> OK, so I've come up a fix that is a back-port of how 2.6.38 does it
> which is that it removes the check I mentioned above and in xen_set_fixmap
> we set the FIX_APIC_BASE to actually point to a dummy ioapic_mapping.
> It is 7cb068cf1ba90425e12f3a7b3caed9d018fa9b8c in for-2.6.32/bug-fixes
>
> Gianni, you might want to check this out in case it fixes the problem you
> are experiencing.
Not sure, mine happens a lot earlier, sort of just after the very early
memory initialisation. Also we're nowhere near trying to use APIC
anything as an address afaict - just trying to reach the xen info page.
The last thing I see is:
[ 0.000000] kernel direct mapping tables up to 2f000000 @ 100000-27a000
[ 0.000000] init_memory_mapping: 0000000100000000-00000002a7000000
> But one thing I can't understand is why on one machine (IBM x3850)
> I get this crash, while another one with the same pagetable contents
> (L1 has nothing for 0x1fb) it works just fine? I added a panic and used
> the Xen hypervisor kdb to manually inspect the pagetable, and it has
> the same contents as the IBM x3850 -but it boots fine with this invalid value.
> Any ideas?
A missing TLB flush? heh
>
> FYI, seems another user (Sven Sübert) IBM x3650 hits the same bug. And with
> this fix he is able to boot.
Very odd, if this isn't the bug I'm seeing it might be tangentially
related.
I'll let you know
Gianni
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|
<Prev in Thread] |
Current Thread |
[Next in Thread> |
[Xen-devel] Re: L1[0x1fb] = 0000000000000000 which faults on one type of machine but on another works?,
Gianni Tedesco <=
|
|
|
|
|