xen-devel
Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split
To: |
"Przywara, Andre" <Andre.Przywara@xxxxxxx> |
Subject: |
Re: [Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split |
From: |
Stephan Diestelhorst <stephan.diestelhorst@xxxxxxx> |
Date: |
Wed, 2 Feb 2011 09:39:06 -0500 |
Accept-language: |
de-DE, en-US |
Cc: |
George Dunlap <George.Dunlap@xxxxxxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx> |
Delivery-date: |
Wed, 02 Feb 2011 07:13:03 -0800 |
Envelope-to: |
www-data@xxxxxxxxxxxxxxxxxxx |
In-reply-to: |
<4D483599.1060807@xxxxxxx> |
List-help: |
<mailto:xen-devel-request@lists.xensource.com?subject=help> |
List-id: |
Xen developer discussion <xen-devel.lists.xensource.com> |
List-post: |
<mailto:xen-devel@lists.xensource.com> |
List-subscribe: |
<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe> |
List-unsubscribe: |
<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe> |
References: |
<4D41FD3A.5090506@xxxxxxx> <AANLkTi=ppBtb1nhdfbhGZa0Rt6kVyopdS3iJPr5fVA1x@xxxxxxxxxxxxxx> <4D483599.1060807@xxxxxxx> |
Sender: |
xen-devel-bounces@xxxxxxxxxxxxxxxxxxx |
Thread-topic: |
[Xen-devel] Hypervisor crash(!) on xl cpupool-numa-split |
User-agent: |
KMail/1.13.5 (Linux/2.6.37-020637-generic; KDE/4.5.5; x86_64; ; ) |
Hi folks,
long time no see. :-)
On Tuesday 01 February 2011 17:32:25 Andre Przywara wrote:
> I asked Stephan Diestelhorst for help and after I convinced him that
> removing credit and making SEDF the default again is not an option he
> worked together with me on that ;-) Many thanks for that!
> We haven't come to a final solution but could gather some debug data.
> I will simply dump some data here, maybe somebody has got a clue. We
> will work further on this tomorrow.
Andre and I have been looking through this further, in particular sanity
checking the invariant
prv->weight >= sdom->weight * sdom->active_vcpu_count
each time someone tweaks the active vcpu count. This happens only in
__csched_vcpu_acct_start and __csched_vcpu_acct_stop_locked. We managed
to observe the broken invariant when splitting cpupoools.
We have the following theory of what happens:
* some vcpus of a particular domain are currently in the process of
being moved to the new pool
* some are still left on the old pool (vcpus_old) and some are already
in the new pool (vcpus_new)
* we now have vcpus_old->sdom = vcpus_new->sdom and following from this
* vcpus_old->sdom->weight = vcpus_new->sdom->weight
* vcpus_old->sdom->active_vcpu_count = vcpus_new->sdom->active_vcpu_count
* active_vcpu_count thus does not represent the separation of the
actual vpcus (may be the sum, only the old or new ones, does not
matter)
* however, sched_old != sched_new, and thus
* sched_old->prv != sched_new->prv
* sched_old->prv->weight != sched_new->prv->weight
* the prv->weight field hence sees the incremental move of VCPUs
(through modifications in *acct_start and *acct_stop_locked)
* if at any point in this half-way migration, the scheduler wants to
csched_acct, it erroneously checks the wrong active_vcpu_count
Workarounds / fixes (none tried):
* disable scheduler accounting while half-way migrating a domain
(dom->pool_migrating flag and then checking in csched_acct)
* temporarily split the sdom structures while migrating to account for
transient split of vcpus
* synchronously disable all vcpus, migrate and then re-enable
Caveats:
* prv->lock does not guarantee mutual exclusion between (same)
schedulers of different pools
<rant>
The general locking policy vs the comment situation is a nightmare.
I know that we have some advanced data-structure folks here, but
intuitively reasoning about when specific things are atomic and
mutually excluded is a pain in the scheduler / cpupool code, see the
issue with the separate prv->locks above.
E.g. cpupool_unassign_cpu and cpupool_unassign_cpu_helper interplay:
* cpupool_unassign_cpu unlocks cpupool_lock
* sets up the continuation calling cpupool_unassign_cpu_helper
* cpupool_unassign_cpu_helper locks cpupool_lock
* while intuitively, one would think that both should see a consistent
snapshot and hence freeing the lock in the middle is a bad idea
* also communicating continuation-local state through global variables
mandates that only a single global continuation can be pending
* reading cpu outside of the lock protection in
cpupool_unassign_cpu_helper also smells
</rant>
Despite the rant, it is amazing to see the ability to move running
things around through this remote continuation trick! In my (ancient)
balancer experiments I added hypervisor-threads just for side-
stepping this issue..
Stephan
--
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelhorst@xxxxxxx
Tel. +49 (0)351 448 356 719
Advanced Micro Devices GmbH
Einsteinring 24
85609 Aschheim
Germany
Geschaeftsfuehrer: Alberto Bozzo u. Andrew Bowd
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632, WEEE-Reg-Nr: DE 12919551
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|
|
|