Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split

To:	"Przywara, Andre" <Andre.Przywara@xxxxxxx>
Subject:	Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
From:	Stephan Diestelhorst <stephan.diestelhorst@xxxxxxx>
Date:	Wed, 2 Feb 2011 09:39:06 -0500
Accept-language:	de-DE, en-US
Cc:	George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>
Delivery-date:	Wed, 02 Feb 2011 07:13:03 -0800
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<4D483599.1060807@xxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References:	<4D41FD3A.5090506@xxxxxxx> <AANLkTi=ppBtb1nhdfbhGZa0Rt6kVyopdS3iJPr5fVA1x@xxxxxxxxxxxxxx> <4D483599.1060807@xxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-topic:	[Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
User-agent:	KMail/1.13.5 (Linux/2.6.37-020637-generic; KDE/4.5.5; x86_64; ; )

Hi folks,
  long time no see. :-)

On Tuesday 01 February 2011 17:32:25 Andre Przywara wrote:
> I asked Stephan Diestelhorst for help and after I convinced him that 
> removing credit and making SEDF the default again is not an option he 
> worked together with me on that ;-) Many thanks for that!
> We haven't come to a final solution but could gather some debug data.
> I will simply dump some data here, maybe somebody has got a clue. We 
> will work further on this tomorrow.

Andre and I have been looking through this further, in particular sanity
checking the invariant

prv->weight >= sdom->weight * sdom->active_vcpu_count

each time someone tweaks the active vcpu count. This happens only in
__csched_vcpu_acct_start and __csched_vcpu_acct_stop_locked. We managed
to observe the broken invariant when splitting cpupoools.

We have the following theory of what happens:
* some vcpus of a particular domain are currently in the process of
  being moved to the new pool

* some are still left on the old pool (vcpus_old) and some are already
  in the new pool (vcpus_new)

* we now have vcpus_old->sdom = vcpus_new->sdom and following from this
  * vcpus_old->sdom->weight = vcpus_new->sdom->weight
  * vcpus_old->sdom->active_vcpu_count = vcpus_new->sdom->active_vcpu_count

* active_vcpu_count thus does not represent the separation of the
  actual vpcus (may be the sum, only the old or new ones, does not
  matter)

* however, sched_old != sched_new, and thus 
  * sched_old->prv != sched_new->prv
  * sched_old->prv->weight != sched_new->prv->weight

* the prv->weight field hence sees the incremental move of VCPUs
  (through modifications in *acct_start and *acct_stop_locked)

* if at any point in this half-way migration, the scheduler wants to
  csched_acct, it erroneously checks the wrong active_vcpu_count

Workarounds / fixes (none tried):
* disable scheduler accounting while half-way migrating a domain
  (dom->pool_migrating flag and then checking in csched_acct)
* temporarily split the sdom structures while migrating to account for
  transient split of vcpus
* synchronously disable all vcpus, migrate and then re-enable

Caveats:
* prv->lock does not guarantee mutual exclusion between (same)
  schedulers of different pools

<rant>
The general locking policy vs the comment situation is a nightmare.
I know that we have some advanced data-structure folks here, but
intuitively reasoning about when specific things are atomic and
mutually excluded is a pain in the scheduler / cpupool code, see the
issue with the separate prv->locks above.

E.g. cpupool_unassign_cpu and cpupool_unassign_cpu_helper interplay:
* cpupool_unassign_cpu unlocks cpupool_lock
* sets up the continuation calling cpupool_unassign_cpu_helper
* cpupool_unassign_cpu_helper locks cpupool_lock
* while intuitively, one would think that both should see a consistent
  snapshot and hence freeing the lock in the middle is a bad idea
* also communicating continuation-local state through global variables
  mandates that only a single global continuation can be pending

* reading cpu outside of the lock protection in
  cpupool_unassign_cpu_helper also smells
</rant>

Despite the rant, it is amazing to see the ability to move running
things around through this remote continuation trick! In my (ancient)
balancer experiments I added hypervisor-threads just for side-
stepping this issue..

Stephan
-- 
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelhorst@xxxxxxx
Tel. +49 (0)351 448 356 719

Advanced Micro Devices GmbH
Einsteinring 24
85609 Aschheim
Germany
Geschaeftsfuehrer: Alberto Bozzo u. Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632, WEEE-Reg-Nr: DE 12919551

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split