>
> Hi:
>
> Well, AFAK, There is a KeSwapProcessOrStack thread in Windonws
kernel to
> swap in/out thread
> kernel Stack, and it is possible to cause BOSD code 0x77/0x7E, Which
means the
> IO page requestion can
> not be complete successfully due to disk fail. This is reproduceable
by
> periodically "gdb attach tapdsik"
> process in dom0, to simulate IO large response, larger than 10s.
>
> In fact, the IO stream from tapdisk is written to our own
storage
> cluster, and it supports
> failover, but it takes time, so it means, when failover, the IO is
hang from
> VM side. When this
> happen, we confront some bluescreens.
>
> Also I've done some experiments, test two scenerios,
> 1) use current XenVbd_HwScsiResetBus, that is complete IO with
> SRB_STATUS_BUS_RESET
> 2) do nothing in XenVbd_HwScsiResetBus
> Just use gdb tapdisk to hold IO periodically, it shows that 1)
makes
> higher possibilty blue
> screen than 2)(in fact, we have'nt met bluescreen in 2)).
>
> Form the log, I see XenVbd_HwScsiResetBus every 14seconds( 10
Seconds +
> 4S hold time)
> in scenerio 1), but in 2) I just saw a fem of them(less than 10), It
looks
> like the driver call resetbus
> on a few of times.
>
> So, I have below assumptions or questions:
> 1) Only some of the IO failure will cause BOSD
> 2) Do nothing in XenVbd_HwScsiResetBus is relatively good to
minimize
> the bluescreen posibity
> 3) Well, I still confuse how is XenVbd_HwScsiResetBus called, and
why
> XenVbd_HwScsiResetBus not
> called if nothing to be done in XenVbd_HwScsiResetBus.
> 4) Is it ok do nothing in XenVbd_HwScsiResetBus?
>
> Could you help to clarify? Many thanks.
>
When Windows calls a drivers HwScsiResetBus the driver is supposed to
perform the procedure described at
http://msdn.microsoft.com/en-us/library/ff565331%28v=vs.85%29.aspx which
is basically to cancel the IO and return all SRB's with a status of
SRB_STATUS_BUS_RESET.
It occurs to me that completing the requests while Dom0 still owns the
buffers really is the wrong thing to do. Windows might reuse the buffers
for something else while Dom0 might still write to them, which may well
cause the crash you are seeing.
There really is no mechanism for a DomU to indicate a reset to Dom0, so
all we are doing is emulating it. We still have to wait for any
outstanding requests. The only alternative would be to close and re-open
the device, but that can't be done from within scsiport.
I'll see what needs to be done to fix this bug.
James
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|