[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] RE: PV resume failed after self migration failed



> Subject: RE: PV resume failed after self migration failed
> Date: Mon, 20 Jun 2011 09:11:59 +1000
> From: james.harper@xxxxxxxxxxxxxxxx
> To: tinnycloud@xxxxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx
>
> > >
> > > Windows will invoke a scsi reset if a request takes too long to
> complete
> > > (5 seconds I think). It will also issue a reset when a crash dump
> > > starts, just to make sure all previous requests are flushed etc.
> > >
> > Thanks for the help, sorry for the late response, I've been leaving a
> while
> > lase weekend.
> >
> > If VBD is already suspended, all further IO try to issue will find vbd
> states
> > is not SR_STATE_RUNNING,
> > thus calls ScsiPortNotification to notify RequestComplete, right?
> >
> > If so, I have an assumption.
> > at time t, VBD is suspend, an IO is try to issue, but before it calls
> > ScsiPortNotificaiton, the whole
> > VM paused(VCPU paused, last step of step), 10 or more seconds later,
> if VM
> > resumes, will the driver
> > found the IO mentioned before has already timed out and trigger
> > XenVbd_HwScsiResetBus?
> >
>
> The xenvbd driver doesn't do any timeout, windows does the timeout and
> tells xenvbd to reset. I haven't tested the scenario you describe very
> recently, and xenvbd is now two different drivers, one for scsiport (<=
> 2003) and one for storport (>= Vista), so there could be bugs in either.
>
The bug can be reproduced in 2003 32bit system. We are using scsi driver.
I put some log in XenVbd_HwScsiResetBus to see if there are not completed srb(Like below)
but I didn't see the log when XenVbd_HwScsiResetBus called. So No IO is in queue.  
 
 for (i = 0; i < MAX_SHADOW_ENTRIES; i++)
    {
      if (xvdd->shadows[i].srb)
      {
        KdPrint((__DRIVER_NAME "    in-flight srb %p with status SRB_STATUS_BUS_RESET\n", xvdd->shadows[i].srb));
      }
    }
 
 
Right now, I don't think it is related to bus reset.  From the log, it looks like an event is not acked.
The log shows that PV Resuming is waiting xppdd->device_state.suspend_resume_state_fdo to change but failed.
 
that is :  XenPci_Pdo_Resume->XenPci_Pdo_ChangeSuspendState(device, SR_STATE_RESUMING)->
-> KeWaitForSingleObject(&xpdd->pdo_suspend_event, Executive, KernelMode, FALSE, NULL);
It is assumed that the change should happen in XenVbd_HwScsiInterrupt.
But for some reason the if statement in XenVbd_HwScsiInterrupt(xenvbd_scsiport.c:920) return False.
 
             /* in dump mode I think we get called on a timer, not by an actual IRQ */
                 if (!dump_mode && !xvdd->vectors.EvtChn_AckEvent(xvdd->vectors.context, xvdd->event_channel, &last_interrupt))
                         return FALSE; /* interrupt was not for us */
 
Since the event is not acked, that's why in EvtChn_EvtInterruptIsr, print out a log like "Unacknowledged event word = 0, val = 00000200"
 
12952670574140: XenPCI --> XenPci_BalloonEnableHandler
12952670574140: XenPCI     Unacknowledged event word = 0, val = 00000200 
12952670574140: XenPCI  receive balloon enable = (1308226300.21:0)
12952670574156: XenPCI     Balloon enable change to 0
12952670574156: XenPCI  successfull got BalloonEnableChangedEvent
 
I will try to take a close look EvtChn_EvtInterruptIsr to get more understanding. Thanks.

> James
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.