On Thu, 2010-02-25 at 18:18 -0500, Jeremy Fitzhardinge wrote:
> On 02/24/2010 07:03 PM, Daniel Stodden wrote:
> > On Wed, 2010-02-24 at 20:47 -0500, Daniel Stodden wrote:
> >
> >> On Wed, 2010-02-24 at 19:37 -0500, Jeremy Fitzhardinge wrote:
> >>
> >>> On 02/24/2010 04:29 PM, Daniel Stodden wrote:
> >>>
> >>>> On Wed, 2010-02-24 at 18:52 -0500, Jeremy Fitzhardinge wrote:
> >>>>
> >>>>
> >>>>> On 02/24/2010 03:49 PM, Daniel Stodden wrote:
> >>>>>
> >>>>>
> >>>>>> On Wed, 2010-02-24 at 17:55 -0500, Jeremy Fitzhardinge wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> When rebooting the machine, I got this crash from blktap. The rip
> >>>>>>> maps to line 262 in
> >>>>>>> 0xffffffff812548a1 is in blktap_request_pool_free
> >>>>>>> (/home/jeremy/git/linux/drivers/xen/blktap/request.c:262).
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>> Uhm, where did that RIP come from?
> >>>>>>
> >>>>>> pool_free is on the module exit path. The stack trace below looks like
> >>>>>> a
> >>>>>> crash from the broadcasted SIGTERM before reboot.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> Ignore it; I generated it from a different kernel from the one that
> >>>>> crashed. But the other oops I posted should be all consistent and
> >>>>> meaningful.
> >>>>>
> >>>>>
> >>>> Ignore only the debuginfo quote, right?
> >>>> Cos this looks like a different issue to me.
> >>>>
> >>>>
> >>> Perhaps. I got all the others on normal domain shutdown, but this one
> >>> was on machine reboot. I'll try to repro (as I boot the test kernel
> >>> with your patch in it).
> >>>
> >> (gdb) list *(blktap_device_restart+0x7a)
> >> 0x2a73 is in blktap_device_restart
> >> (/local/exp/dns/scratch/xenbits/xen-unstable.hg/linux-2.6-pvops.git/drivers/xen/blktap/device.c:920).
> >> 915 /* Re-enable calldowns. */
> >> 916 if (blk_queue_stopped(dev->gd->queue))
> >> 917 blk_start_queue(dev->gd->queue);
> >> 918
> >> 919 /* Kick things off immediately. */
> >> 920 blktap_device_do_request(dev->gd->queue);
> >> 921
> >> 922 spin_unlock_irq(&dev->lock);
> >> 923 }
> >> 924
> >>
> >> Assuming we've been dereferencing a NULL gendisk, i.e. device_destroy
> >> racing against device_restart.
> >>
> >> Would take
> >>
> >> * Tapdisk killed on the other thread, which goes through into
> >> a device_restart(). Which is what your stacktrace shows.
> >>
> >> * Device removal pending, blocking until
> >> device->users drops to 0, then doing the device_destroy().
> >> That might have happened during bdev .release.
> >>
> >> Both running at the same time sounds like what happens if you kill them
> >> all at once.
> >>
> >> That clearly takes another patch then.
> >>
> > Jeremy,
> >
> > can you try out the attached patch for me?
> >
> > This should close the above shutdown race as well.
> >
> > Should be nowhere as frequent as the timer_sync crash fixed earlier.
> >
>
> Hm, the two patches changed things but I'm still seeing problems on
> domain shutdown. Still looks like use-after-free.
All these new-fashioned debug switches. Only causing trouble.
This is yet a different piece. The sysfs code was causing a double unref
on the ring device.
Daniel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|