[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] shadow2 corrupting PV guest state

I've been fighting random crashes in the paravirt tree for a while. After a fair amount of head-banging, it looks to me like the shadow2 code is trashing the guest stack (and maybe register state) at random points.

If I boot a kernel with CONFIG_DEBUG_PAGEALLOC enabled (which dramatically increases the rate of pagetable modifications), it rarely makes it through early boot without some random crash. The crashes are often at the same place, but they move around; however they tend to be near places where the pagetable is touched. It may also interact with timer events; certainly masking events seems to help a bit.

I tend to see this a lot more when running under qemu, but I've also seen strange things happen on real hardware. If I roll Xen back to pre-shadow2 (change fda70200da01), all these mysterious crashes disappear. Looking into it a bit more deeply, the kind of crash I'm seeing are along the lines of:

   mov (%ebx), %eax       # works; %ebx is a valid pointer
   call xen_enable_irq
   mov %eax, (%ebx)   # crashes; %ebx will equal 0, 1, or something bad

where xen_enable_irq will have pushed %ebx, set the flag state, polled for pending events and popped %ebx. My suspicion is that something about re-enabling interrupts is causing the on-stack version of %ebx to get trashed, rather than the actual %ebx register state (in general the corrupted register is the one near or at the top of the stack). Sometimes the corruption shows up as %eip off in the weeds (either at NULL-ish addresses, or executing the stack).

I'm speculating that the sequence is:

  1. change pagetable; this creates a deferred pagetable update
  2. enable events
  3. handle pending timer interrupt, which also does a deferred
     pagetable update
  4. resume running with corrupted stack

But I don't really know enough about how shadow2 works to know if that's really plausible. Maybe a vcpu/guest context switch is a part of the sequence. I wonder if the stack corruption is caused by a mismatch of exception frame formats between exception->iret?

All a bit handwavy, but I haven't really managed to make much headway. I spent some time assuming it was a bug on my side, but the fact that all these symptoms go away with pre-shadow2 Xen makes me point the finger over the wall.

Or perhaps it really is just a qemu bug, but I can't imagine that shadow2 exercises qemu's emulated CPU exception stuff in a way which normal Xen doesn't... I think its more likely that there's some race which is much more easily triggered by qemu's slow speed.

Any thoughts or ideas?


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.