Hi:
Well, AFAK, There is a KeSwapProcessOrStack thread in Windonws kernel to swap in/out thread
kernel Stack, and it is possible to cause BOSD code 0x77/0x7E, Which means the IO page requestion can
not be complete successfully due to disk fail. This is reproduceable by periodically "gdb attach tapdsik"
process in dom0, to simulate IO large response, larger than 10s.
In fact, the IO stream from tapdisk is written to our own storage cluster, and it supports
failover, but it takes time, so it means, when failover, the IO is hang from VM side. When this
happen, we confront some bluescreens.
Also I've done some experiments, test two scenerios,
1) use current XenVbd_HwScsiResetBus, that is complete IO with SRB_STATUS_BUS_RESET
2) do nothing in XenVbd_HwScsiResetBus
Just use gdb tapdisk to hold IO periodically, it shows that 1) makes higher possibilty blue
screen than 2)(in fact, we have'nt met bluescreen in 2)).
Form the log, I see XenVbd_HwScsiResetBus every 14seconds( 10 Seconds + 4S hold time)
in scenerio 1), but in 2) I just saw a fem of them(less than 10), It looks like the driver call resetbus
on a few of times.
So, I have below assumptions or questions:
1) Only some of the IO failure will cause BOSD
2) Do nothing in XenVbd_HwScsiResetBus is relatively good to minimize the bluescreen posibity
3) Well, I still confuse how is XenVbd_HwScsiResetBus called, and why XenVbd_HwScsiResetBus not
called if nothing to be done in XenVbd_HwScsiResetBus.
4) Is it ok do nothing in XenVbd_HwScsiResetBus?
Could you help to clarify? Many thanks.
|