WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split

To: "Przywara, Andre" <Andre.Przywara@xxxxxxx>
Subject: Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
From: Stephan Diestelhorst <stephan.diestelhorst@xxxxxxx>
Date: Wed, 2 Feb 2011 09:39:06 -0500
Accept-language: de-DE, en-US
Cc: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>
Delivery-date: Wed, 02 Feb 2011 07:13:03 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4D483599.1060807@xxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <4D41FD3A.5090506@xxxxxxx> <AANLkTi=ppBtb1nhdfbhGZa0Rt6kVyopdS3iJPr5fVA1x@xxxxxxxxxxxxxx> <4D483599.1060807@xxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-topic: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
User-agent: KMail/1.13.5 (Linux/2.6.37-020637-generic; KDE/4.5.5; x86_64; ; )
Hi folks,
  long time no see. :-)

On Tuesday 01 February 2011 17:32:25 Andre Przywara wrote:
> I asked Stephan Diestelhorst for help and after I convinced him that 
> removing credit and making SEDF the default again is not an option he 
> worked together with me on that ;-) Many thanks for that!
> We haven't come to a final solution but could gather some debug data.
> I will simply dump some data here, maybe somebody has got a clue. We 
> will work further on this tomorrow.

Andre and I have been looking through this further, in particular sanity
checking the invariant

prv->weight >= sdom->weight * sdom->active_vcpu_count

each time someone tweaks the active vcpu count. This happens only in
__csched_vcpu_acct_start and __csched_vcpu_acct_stop_locked. We managed
to observe the broken invariant when splitting cpupoools.

We have the following theory of what happens:
* some vcpus of a particular domain are currently in the process of
  being moved to the new pool

* some are still left on the old pool (vcpus_old) and some are already
  in the new pool (vcpus_new)

* we now have vcpus_old->sdom = vcpus_new->sdom and following from this
  * vcpus_old->sdom->weight = vcpus_new->sdom->weight
  * vcpus_old->sdom->active_vcpu_count = vcpus_new->sdom->active_vcpu_count

* active_vcpu_count thus does not represent the separation of the
  actual vpcus (may be the sum, only the old or new ones, does not
  matter)

* however, sched_old != sched_new, and thus 
  * sched_old->prv != sched_new->prv
  * sched_old->prv->weight != sched_new->prv->weight

* the prv->weight field hence sees the incremental move of VCPUs
  (through modifications in *acct_start and *acct_stop_locked)

* if at any point in this half-way migration, the scheduler wants to
  csched_acct, it erroneously checks the wrong active_vcpu_count

Workarounds / fixes (none tried):
* disable scheduler accounting while half-way migrating a domain
  (dom->pool_migrating flag and then checking in csched_acct)
* temporarily split the sdom structures while migrating to account for
  transient split of vcpus
* synchronously disable all vcpus, migrate and then re-enable

Caveats:
* prv->lock does not guarantee mutual exclusion between (same)
  schedulers of different pools

<rant>
The general locking policy vs the comment situation is a nightmare.
I know that we have some advanced data-structure folks here, but
intuitively reasoning about when specific things are atomic and
mutually excluded is a pain in the scheduler / cpupool code, see the
issue with the separate prv->locks above.

E.g. cpupool_unassign_cpu and cpupool_unassign_cpu_helper interplay:
* cpupool_unassign_cpu unlocks cpupool_lock
* sets up the continuation calling cpupool_unassign_cpu_helper
* cpupool_unassign_cpu_helper locks cpupool_lock
* while intuitively, one would think that both should see a consistent
  snapshot and hence freeing the lock in the middle is a bad idea
* also communicating continuation-local state through global variables
  mandates that only a single global continuation can be pending

* reading cpu outside of the lock protection in
  cpupool_unassign_cpu_helper also smells
</rant>

Despite the rant, it is amazing to see the ability to move running
things around through this remote continuation trick! In my (ancient)
balancer experiments I added hypervisor-threads just for side-
stepping this issue..

Stephan
-- 
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelhorst@xxxxxxx
Tel. +49 (0)351 448 356 719

Advanced Micro Devices GmbH
Einsteinring 24
85609 Aschheim
Germany
Geschaeftsfuehrer: Alberto Bozzo u. Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632, WEEE-Reg-Nr: DE 12919551



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel