Some background:
Now the 32bit HVM SMP Windows guest with the PV
drivers will hang randomly. Sometimes the problem occurs during drivers
loading, and sometimes the problem occurs when the guest is destroyed. And at
last, Xen0 will hang also. We are debugging this issue.
With the great help of Kevin Tian, we at last find two
deadlock issues on HVM SMP guest. The description of the deadlock is followed.
Suppose we have two vcpus now.
1) One
vcpu is holding the BIGLOCK, and it wants to hold the shadow_lock. At the same
time, the other vcpu is holding the shadow_lock, and it wants to walk the P2M
table. The fault pfn address is near the 4G boundary, for example 0xfee00, and
of course the va for the P2M table entry is now even never mapped. So when the
vcpu tries to walk the P2M table, one page fault in Xen address area occurs.
The current do_page_fault() will call spurious_page_fault() to test if it is a
page fault really or not. But the spurious_page_fault() will first try to hold
the BIGLOCK. So the deadlock…..
2) When
the guest is destroyed, Xen will call domain_shutdown_finalise(), the function
will first try to hold the BIGLOCK, and next call vcpu_sleep_sync(). The
vcpu_sleep_sync() will wait for other vcpu’s state. But the other vcpu
now is in the spurious_page_fault(), and spurious_page_fault() will try to hold
BIGLOCK. So another situation of deadlock.
Is there anything wrong with the description?
If we’re right, then does the
spurious_page_fault() need to hold the BIGLOCK? We have an ugly workaround to
decrease the occurring frequency of the spurious_page_fault(), that is we try
to map all the 4G P2M table area and fill it with INVALID_MFN accordingly at
P2M table allocated time. And with the workaround, the 32bit HVM SMP Windows
with PV drivers can now run more smoothly, and can be destroyed successfully.
But we have no elegant solution now. :-(
Does anyone have some good suggestions? Any comments
are welcome.
Thanks
Xioahui