WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] Prepping a bugfix push

To: Jeremy Fitzhardinge <jeremy@xxxxxxxx>
Subject: Re: [Xen-devel] Prepping a bugfix push
From: Brendan Cully <brendan@xxxxxxxxx>
Date: Thu, 3 Dec 2009 11:35:40 -0800
Cc: Ian Campbell <Ian.Campbell@xxxxxxxxxxxxx>, Paolo Bonzini <pbonzini@xxxxxxxxxx>, Xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>, Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Delivery-date: Thu, 03 Dec 2009 11:36:05 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4B1810DF.40309@xxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Mail-followup-to: jeremy@xxxxxxxx, Ian.Campbell@xxxxxxxxxxxxx, konrad.wilk@xxxxxxxxxx, pbonzini@xxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxx
References: <4B1810DF.40309@xxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mutt/1.5.20 (2009-10-28)
Not a patch, but I've just tried out xm save -c again with the latest
xen changes, and while I no longer see the grant table version panic,
the guest's devices (aside from the console) appear to be wedged on
resume. Is anyone else seeing this?

After a while on the console I see messages like this:

INFO: task syslogd:2219 blocked for more than 120 seconds.

which I assume is trouble with the block device.

On Thursday, 03 December 2009 at 11:26, Jeremy Fitzhardinge wrote:
> I'm preparing a general bugfix push for Linus, targeted at both current
> linux-2.6.git and stable.  The list of patches I have lined up (in the
> "bugfix" branch) are below.  Is there anything I've overlooked?  Are
> there any patches I've forgotten to apply altogether?
> 
> (Note, this is all domU stuff; dom0 things will need to mature a bit.)
> 
> Thanks,
>     J
> 
> commit b4606f2165153833247823e8c04c5e88cb3d298b
> Author: Ian Campbell <ian.campbell@xxxxxxxxxx>
> Date:   Tue Dec 1 11:47:15 2009 +0000
> 
>     xen: explicitly create/destroy stop_machine workqueues outside 
> suspend/resume region.
>     
>     I have observed cases where the implicit stop_machine_destroy() done by
>     stop_machine() hangs while destroying the workqueues, specifically in
>     kthread_stop(). This seems to be because timer ticks are not restarted
>     until after stop_machine() returns.
>     
>     Fortunately stop_machine provides a facility to pre-create/post-destroy
>     the workqueues so use this to ensure that workqueues are only destroyed
>     after everything is really up and running again.
>     
>     I only actually observed this failure with 2.6.30. It seems that newer
>     kernels are somehow more robust against doing kthread_stop() without timer
>     interrupts (I tried some backports of some likely looking candidates but
>     did not track down the commit which added this robustness). However this
>     change seems like a reasonable belt&braces thing to do.
>     
>     Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit 65f63384b391bf4d384327d8a7c6de9860290b5c
> Author: Ian Campbell <ian.campbell@xxxxxxxxxx>
> Date:   Tue Dec 1 11:47:14 2009 +0000
> 
>     xen: improve error handling in do_suspend.
>     
>     The existing error handling has a few issues:
>     - If freeze_processes() fails it exits with shutting_down = 
> SHUTDOWN_SUSPEND.
>     - If dpm_suspend_noirq() fails it exits without resuming xenbus.
>     - If stop_machine() fails it exits without resuming xenbus or calling
>       dpm_resume_end().
>     - xs_suspend()/xs_resume() and dpm_suspend_noirq()/dpm_resume_noirq() 
> were not
>       nested in the obvious way.
>     
>     Fix by ensuring each failure case goto's the correct label. Treat a 
> failure of
>     stop_machine() as a cancelled suspend in order to follow the correct 
> resume
>     path.
>     
>     Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit fed5ea87e02aaf902ff38c65b4514233db03dc09
> Author: Ian Campbell <ian.campbell@xxxxxxxxxx>
> Date:   Tue Dec 1 16:15:30 2009 +0000
> 
>     xen: don't leak IRQs over suspend/resume.
>     
>     On resume irq_info[*].evtchn is reset to 0 since event channel mappings
>     are not preserved over suspend/resume. The other contents of irq_info
>     is preserved to allow rebind_evtchn_irq() to function.
>     
>     However when a device resumes it will try to unbind from the
>     previous IRQ (e.g.  blkfront goes blkfront_resume() -> blkif_free() ->
>     unbind_from_irqhandler() -> unbind_from_irq()). This will fail due to the
>     check for VALID_EVTCHN in unbind_from_irq() and the IRQ is leaked. The
>     device will then continue to resume and allocate a new IRQ, eventually
>     leading to find_unbound_irq() panic()ing.
>     
>     Fix this by changing unbind_from_irq() to handle teardown of interrupts
>     which have type!=IRQT_UNBOUND but are not currently bound to a specific
>     event channel.
>     
>     Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit f6eafe3665bcc374c66775d58312d1c06c55303f
> Author: Ian Campbell <Ian.Campbell@xxxxxxxxxx>
> Date:   Wed Nov 25 14:12:08 2009 +0000
> 
>     xen: call clock resume notifier on all CPUs
>     
>     tick_resume() is never called on secondary processors. Presumably this
>     is because they are offlined for suspend on native and so this is
>     normally taken care of in the CPU onlining path. Under Xen we keep all
>     CPUs online over a suspend.
>     
>     This patch papers over the issue for me but I will investigate a more
>     generic, less hacky, way of doing to the same.
>     
>     tick_suspend is also only called on the boot CPU which I presume should
>     be fixed too.
>     
>     Signed-off-by: Ian Campbell <Ian.Campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
>     Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> 
> commit 6aaf5d633bb6cead81b396d861d7bae4b9a0ba7e
> Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> Date:   Wed Nov 25 13:15:38 2009 -0800
> 
>     xen: use iret for return from 64b kernel to 32b usermode
>     
>     If Xen wants to return to a 32b usermode with sysret it must use the
>     right form.  When using VCGF_in_syscall to trigger this, it looks at
>     the code segment and does a 32b sysret if it is FLAT_USER_CS32.
>     However, this is different from __USER32_CS, so it fails to return
>     properly if we use the normal Linux segment.
>     
>     So avoid the whole mess by dropping VCGF_in_syscall and simply use
>     plain iret to return to usermode.
>     
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Acked-by: Jan Beulich <jbeulich@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit 922cc38ab71d1360978e65207e4a4f4988987127
> Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> Date:   Tue Nov 24 09:58:49 2009 -0800
> 
>     xen: don't call dpm_resume_noirq() with interrupts disabled.
>     
>     dpm_resume_noirq() takes a mutex, so it can't be called from a 
> no-interrupt
>     context.  Don't call it from within the stop-machine function, but just
>     afterwards, since we're resuming anyway, regardless of what happened.
>     
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit 499d19b82b586aef18727b9ae1437f8f37b66e91
> Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> Date:   Tue Nov 24 09:38:25 2009 -0800
> 
>     xen: register runstate info for boot CPU early
>     
>     printk timestamping uses sched_clock, which in turn relies on runstate
>     info under Xen.  So make sure we set it up before any printks can
>     be called.
>     
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit 028896721ac04f6fa0697f3ecac3f98761746363
> Author: Ian Campbell <ian.campbell@xxxxxxxxxx>
> Date:   Tue Nov 24 09:32:48 2009 -0800
> 
>     xen: register runstate on secondary CPUs
>     
>     The commit "xen: re-register runstate area earlier on resume" caused us
>     to never try and setup the runstate area for secondary CPUs. Ensure that
>     we do this...
>     
>     Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit f350c7922faad3397c98c81a9e5658f5a1ef0214
> Author: Ian Campbell <ian.campbell@xxxxxxxxxx>
> Date:   Tue Nov 24 10:16:23 2009 +0000
> 
>     xen: register timer interrupt with IRQF_TIMER
>     
>     Otherwise the timer is disabled by dpm_suspend_noirq() which in turn 
> prevents
>     correct operation of stop_machine on multi-processor systems and breaks
>     suspend.
>     
>     Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit fa24ba62ea2869308ffc9f0b286ac9650b4ca6cb
> Author: Ian Campbell <ian.campbell@xxxxxxxxxx>
> Date:   Sat Nov 21 11:32:49 2009 +0000
> 
>     xen: correctly restore pfn_to_mfn_list_list after resume
>     
>     pvops kernels >= 2.6.30 can currently only be saved and restored once. The
>     second attempt to save results in:
>     
>         ERROR Internal error: Frame# in pfn-to-mfn frame list is not in 
> pseudophys
>         ERROR Internal error: entry 0: p2m_frame_list[0] is 0xf2c2c2c2, max 
> 0x120000
>         ERROR Internal error: Failed to map/save the p2m frame list
>     
>     I finally narrowed it down to:
>     
>         commit cdaead6b4e657f960d6d6f9f380e7dfeedc6a09b
>             Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>             Date:   Fri Feb 27 15:34:59 2009 -0800
>     
>                 xen: split construction of p2m mfn tables from registration
>     
>                 Build the p2m_mfn_list_list early with the rest of the p2m 
> table, but
>                 register it later when the real shared_info structure is in 
> place.
>     
>                 Signed-off-by: Jeremy Fitzhardinge 
> <jeremy.fitzhardinge@xxxxxxxxxx>
>     
>     The unforeseen side-effect of this change was to cause the mfn list list 
> to not
>     be rebuilt on resume. Prior to this change it would have been rebuilt via
>     xen_post_suspend() -> xen_setup_shared_info() -> 
> xen_setup_mfn_list_list().
>     
>     Fix by explicitly calling xen_build_mfn_list_list() from 
> xen_post_suspend().
>     
>     Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit 3905bb2aa7bb801b31946b37a4635ebac4009051
> Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> Date:   Sat Nov 21 08:46:29 2009 +0800
> 
>     xen: restore runstate_info even if !have_vcpu_info_placement
>     
>     Even if have_vcpu_info_placement is not set, we still need to set up
>     the runstate area on each resumed vcpu.
>     
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit be012920ecba161ad20303a3f6d9e96c58cf97c7
> Author: Ian Campbell <Ian.Campbell@xxxxxxxxxx>
> Date:   Sat Nov 21 08:35:55 2009 +0800
> 
>     xen: re-register runstate area earlier on resume.
>     
>     This is necessary to ensure the runstate area is available to
>     xen_sched_clock before any calls to printk which will require it in
>     order to provide a timestamp.
>     
>     I chose to pull the xen_setup_runstate_info out of xen_time_init into
>     the caller in order to maintain parity with calling
>     xen_setup_runstate_info separately from calling xen_time_resume.
>     
>     Signed-off-by: Ian Campbell <ian.campbell@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> commit ae7888012969355a548372e99b066d9e31153b62
> Author: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> Date:   Wed Jul 8 12:27:39 2009 +0200
> 
>     xen: wait up to 5 minutes for device connetion
>     
>     Increases the device timeout from 10s to 5 minutes, giving the user a
>     visual indication during that time in case there are problems.  The patch
>     is a backport of changesets 144 and 150 in the Xenbits tree.
>     
>     Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> 
> commit f8dc33088febc63286b7a60e6b678de8e064de8e
> Author: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> Date:   Wed Jul 8 12:27:38 2009 +0200
> 
>     xen: improvement to wait_for_devices()
>     
>     When printing a warning about a timed-out device, print the
>     current state of both ends of the device connection (i.e., backend as
>     well as frontend).  This backports half of changeset 146 from the
>     Xenbits tree.
>     
>     Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> 
> commit c6e1971139be1342902873181f3b80a979bfb33b
> Author: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> Date:   Wed Jul 8 12:27:37 2009 +0200
> 
>     xen: fix is_disconnected_device/exists_disconnected_device
>     
>     The logic of is_disconnected_device/exists_disconnected_device is wrong
>     in that they are used to test whether a device is trying to connect (i.e.
>     connecting).  For this reason the patch fixes them to not consider a
>     Closing or Closed device to be connecting.  At the same time the patch
>     also renames the functions according to what they really do; you could
>     say a closed device is "disconnected" (the old name), but not "connecting"
>     (the new name).
>     
>     This patch is a backport of changeset 909 from the Xenbits tree.
>     
>     Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> 
> commit db05fed0ad72f264e39bcb366795f7367384ec92
> Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
> Date:   Tue Nov 24 16:41:47 2009 -0800
> 
>     xen/xenbus: make DEVICE_ATTR()s static
>     
>     They don't need to be global, and may cause linker clashes.
>     
>     Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>
>     Cc: Stable Kernel <stable@xxxxxxxxxx>
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel