Re: [Xen-devel] DomU crash during migration when suspending sour

To:	"Graham, Simon" <Simon.Graham@xxxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject:	Re: [Xen-devel] DomU crash during migration when suspending source domain
From:	Keir Fraser <Keir.Fraser@xxxxxxxxxxxx>
Date:	Wed, 14 Feb 2007 10:36:09 +0000
Delivery-date:	Wed, 14 Feb 2007 02:35:30 -0800
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxx
In-reply-to:	<C1F89152.1B9E%Keir.Fraser@xxxxxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index:	AcdP6h4+HveIAzruQ3+gt7NQNapEGwANqzaeAADJUVA=
Thread-topic:	[Xen-devel] DomU crash during migration when suspending source domain
User-agent:	Microsoft-Entourage/11.3.3.061214

Your theory that the cpu_down() is happening too early sounds plausible
except that cpu_up/cpu_down are both entirely protected by the hotplug lock.
See their definitions in kernel/cpu.c.

The notifier calls of interest are CPU_ONLINE and CPU_DEAD. These are the
events that the cacheinfo code cares about. You can see that both
notifications are broadcast under the cpu_hotplug_lock, so there should be
no race possible in which a CPU starts to be taken down before all
notification work associated with it coming online has completed.

 -- Keir

On 14/2/07 10:13, "Keir Fraser" <Keir.Fraser@xxxxxxxxxxxx> wrote:

> Is this with a 2.6.16 guest from 3.0.4? This would most likely be a CPU
> hotplug issue in Linux, but we did so lots of testing of that...
> 
>  -- Keir
> 
> On 14/2/07 03:42, "Graham, Simon" <Simon.Graham@xxxxxxxxxxx> wrote:
> 
>> Just run into an odd DomU crash doing live migration of a 4-VCPU domain (with
>> 3.0.4 but the code looks the same in 2.6.18/unstable to me) - the actual
>> panic
>> is attached at the end of this, but the bottom line is that the code in
>> cache_remove_shared_cpu_map (in arch/i385/kernel/cpu/intel_cacheinfo.c) is
>> attempting to clean up the cache info for a processor that does not yet have
>> this info setup - the code is dereferencing a pointer in the cpuid4_info[]
>> array and looking at the dump I can see that this is NULL.
>> 
>> My working theory here is that we attempted the migration waaay early and the
>> initialization of the array of cache info pointers was not setup for all
>> processors yet; it would be relatively easy to protect against this by
>> checking for NULL, but I'm not sure if this is the correct solution or not --
>> if anyone is familiar with this code and can comment on an appropriate fix
>> I'd
>> be grateful.
>> 
>> One thing I'm really not sure about is the timing of marking the CPUs up with
>> respect to the trace re initializing CPUs (see console output below) -- I can
>> see that the four VCPUs are setup in the cpu_sys_devices array (which is
>> setup
>> by the code that outputs the 'Initializing CPU#n' trace) but the array of
>> cache info structures only has an entry for VCPU 0:
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] DomU crash during migration when suspending source domai