|
|
|
|
|
|
|
|
|
|
xen-devel
[Xen-devel] shadow2 corrupting PV guest state
I've been fighting random crashes in the paravirt tree for a while.
After a fair amount of head-banging, it looks to me like the shadow2
code is trashing the guest stack (and maybe register state) at random
points.
If I boot a kernel with CONFIG_DEBUG_PAGEALLOC enabled (which
dramatically increases the rate of pagetable modifications), it rarely
makes it through early boot without some random crash. The crashes are
often at the same place, but they move around; however they tend to be
near places where the pagetable is touched. It may also interact with
timer events; certainly masking events seems to help a bit.
I tend to see this a lot more when running under qemu, but I've also
seen strange things happen on real hardware.
If I roll Xen back to pre-shadow2 (change fda70200da01), all these
mysterious crashes disappear.
Looking into it a bit more deeply, the kind of crash I'm seeing are
along the lines of:
mov (%ebx), %eax # works; %ebx is a valid pointer
call xen_enable_irq
mov %eax, (%ebx) # crashes; %ebx will equal 0, 1, or something bad
where xen_enable_irq will have pushed %ebx, set the flag state, polled
for pending events and popped %ebx. My suspicion is that something
about re-enabling interrupts is causing the on-stack version of %ebx to
get trashed, rather than the actual %ebx register state (in general the
corrupted register is the one near or at the top of the stack).
Sometimes the corruption shows up as %eip off in the weeds (either at
NULL-ish addresses, or executing the stack).
I'm speculating that the sequence is:
1. change pagetable; this creates a deferred pagetable update
2. enable events
3. handle pending timer interrupt, which also does a deferred
pagetable update
4. resume running with corrupted stack
But I don't really know enough about how shadow2 works to know if that's
really plausible. Maybe a vcpu/guest context switch is a part of the
sequence.
I wonder if the stack corruption is caused by a mismatch of exception
frame formats between exception->iret?
All a bit handwavy, but I haven't really managed to make much headway.
I spent some time assuming it was a bug on my side, but the fact that
all these symptoms go away with pre-shadow2 Xen makes me point the
finger over the wall.
Or perhaps it really is just a qemu bug, but I can't imagine that
shadow2 exercises qemu's emulated CPU exception stuff in a way which
normal Xen doesn't... I think its more likely that there's some race
which is much more easily triggered by qemu's slow speed.
Any thoughts or ideas?
J
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|
|
|
|
|