> Subject: RE: Question on XenVbd_HwScsiResetBus in PV driver
> Date: Fri, 22 Jul 2011 20:44:34 +1000
> From: james.harper@xxxxxxxxxxxxxxxx
> To: tinnycloud@xxxxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx
>
> >
> > Hi:
> >
> > Well, AFAK, There is a KeSwapProcessOrStack thread in Windonws
> kernel to
> > swap in/out thread
> > kernel Stack, and it is possible to cause BOSD code 0x77/0x7E, Which
> means the
> > IO page requestion can
> > not be complete successfully due to disk fail. This is reproduceable
> by
> > periodically "gdb attach tapdsik"
> > process in dom0, to simulate IO large response, larger than 10s.
> >
> > In fact, the IO stream from tapdisk is written to our own
> storage
> > cluster, and it supports
> > failover, but it takes time, so it means, when failover, the IO is
> hang from
> > VM side. When this
> > happen, we confront some bluescreens.
> >
> > Also I've done some experiments, test two scenerios,
> > 1) use current XenVbd_HwScsiResetBus, that is complete IO with
> > SRB_STATUS_BUS_RESET
> > 2) do nothing in XenVbd_HwScsiResetBus
> > Just use gdb tapdisk to hold IO periodically, it shows that 1)
> makes
> > higher possibilty blue
> > screen than 2)(in fact, we have'nt met bluescreen in 2)).
> >
> > Form the log, I see XenVbd_HwScsiResetBus every 14seconds( 10
> Seconds +
> > 4S hold time)
> > in scenerio 1), but in 2) I just saw a fem of them(less than 10), It
> looks
> > like the driver call resetbus
> > on a few of times.
> >
> > So, I have below assumptions or questions:
> > 1) Only some of the IO failure will cause BOSD
> > 2) Do nothing in XenVbd_HwScsiResetBus is relatively good to
> minimize
> > the bluescreen posibity
> > 3) Well, I still confuse how is XenVbd_HwScsiResetBus called, and
> why
> > XenVbd_HwScsiResetBus not
> > called if nothing to be done in XenVbd_HwScsiResetBus.
> > 4) Is it ok do nothing in XenVbd_HwScsiResetBus?
> >
> > Could you help to clarify? Many thanks.
> >
>
> When Windows calls a drivers HwScsiResetBus the driver is supposed to
> perform the procedure described at
> http://msdn.microsoft.com/en-us/library/ff565331%28v=vs.85%29.aspx which
> is basically to cancel the IO and return all SRB's with a status of
> SRB_STATUS_BUS_RESET.
>
> It occurs to me that completing the requests while Dom0 still owns the
> buffers really is the wrong thing to do. Windows might reuse the buffers
> for something else while Dom0 might still write to them, which may well
> cause the crash you are seeing.
>
> There really is no mechanism for a DomU to indicate a reset to Dom0, so
> all we are doing is emulating it. We still have to wait for any
> outstanding requests. The only alternative would be to close and re-open
> the device, but that can't be done from within scsiport.
If there really something wrong happen in Dom0 backend, it looks like there
is no chance do recover back from DomU side, even re-open the device.
Since the new connection would fail too.
So, regards to normal Large IO response, caused by heavy IO load, I think
infinite waiting in PV is enough, IO will come back finally.
> I'll see what needs to be done to fix this bug.
>
> James