Re: [Xen-devel] [Patch 0/6] xen: cpupool support

To:	George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
Subject:	Re: [Xen-devel] [Patch 0/6] xen: cpupool support
From:	Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>
Date:	Wed, 22 Apr 2009 10:19:23 +0200
Cc:	"xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Delivery-date:	Wed, 22 Apr 2009 01:20:27 -0700
Dkim-signature:	v=1; a=rsa-sha256; c=simple/simple; d=ts.fujitsu.com; i=juergen.gross@xxxxxxxxxxxxxx; q=dns/txt; s=s1536b; t=1240388606; x=1271924606; h=from:sender:reply-to:subject:date:message-id:to:cc: mime-version:content-transfer-encoding:content-id: content-description:resent-date:resent-from:resent-sender: resent-to:resent-cc:resent-message-id:in-reply-to: references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:list-owner:list-archive; z=From:=20Juergen=20Gross=20<juergen.gross@xxxxxxxxxxxxxx> \|Subject:=20Re:=20[Xen-devel]=20[Patch=200/6]=20xen:=20cp upool=20support\|Date:=20Wed,=2022=20Apr=202009=2010:19:23 =20+0200\|Message-ID:=20<49EED30B.9050504@xxxxxxxxxxxxxx> \|To:=20George=20Dunlap=20<George.Dunlap@xxxxxxxxxxxxx> \|CC:=20"xen-devel@xxxxxxxxxxxxxxxxxxx"=20<xen-devel@lists .xensource.com>,=20=0D=0A=20Keir=20Fraser=20<keir.fraser@ eu.citrix.com>\|MIME-Version:=201.0 \|Content-Transfer-Encoding:=207bit\|In-Reply-To:=20<de7640 5a0904210511l7257a33di196b30f99b5ae312@xxxxxxxxxxxxxx> \|References:=20<49E851B2.7000601@xxxxxxxxxxxxxx>=09<C60E2 C7B.8E2D%keir.fraser@xxxxxxxxxxxxx>=20<de76405a0904210511 l7257a33di196b30f99b5ae312@xxxxxxxxxxxxxx>; bh=O4kBPhlrBymTUpcr+LRkv/4d8fOeRP5Dj4NX/P768U4=; b=ToyYBjWHwUI18hziFaDxqurgad02Li0TB6hdO9zXlI5u/ekEMuOdbSxt q3e/2/S9us4UZfVc4n5FEO0Q30TwkhxM9fwFYDJFyV1WSGZJcMLuNvX8j q9eH1+Rw+xWvFhBbuy4gRJAWm5us8K1OrRuvpEuGqyNzMjrGdpzCAJj7W YTtJo9o2fwfB0+9cQZbxmCPWzfGY0yAEJB3EFLv2uds7dMu8/wowOR+40 xxLCo0Mgd2mYyv113Qs+XYm2VT2i+;
Domainkey-signature:	s=s1536a; d=ts.fujitsu.com; c=nofws; q=dns; h=X-SBRSScore:X-IronPort-AV:Received:X-IronPort-AV: Received:Received:Message-ID:Date:From:Organization: User-Agent:MIME-Version:To:CC:Subject:References: In-Reply-To:X-Enigmail-Version:Content-Type: Content-Transfer-Encoding; b=F2OC8godV4ddIbuUvKNPZfVAqNgUaYPA7pketPP49lC2eY8cWFNmVmmh ZfxosaF0Kj3Jqrvi95m5vYUHSk7ycQtQq69PS6CBod4KGzeI+lWBtOf2Y 41djiF3cCOVpsG0ScLc73FMmtUVY3Rgbvcy8ANrkG8rWB5LAMMVr5flV6 tpgcDDDJhCLBdAz2t/g7HA55YeUWOsg8kqJzsbaQuYqo7STtWWOHbrrNz ubUQ+vWwT94y0jfRjdr5UBWz12u2R;
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<de76405a0904210511l7257a33di196b30f99b5ae312@xxxxxxxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Organization:	Fujitsu Technology Solutions
References:	<49E851B2.7000601@xxxxxxxxxxxxxx> <C60E2C7B.8E2D%keir.fraser@xxxxxxxxxxxxx> <de76405a0904210511l7257a33di196b30f99b5ae312@xxxxxxxxxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent:	Mozilla-Thunderbird 2.0.0.19 (X11/20090103)

Hi George,

thanks for the overall positive feedback :-)

George Dunlap wrote:
> Juergen,
> 
> Thanks for doing this work.  Overall things look like they're going in
> the right direction.
> 
> However, it's a pretty big change, and I'd like to hear some more
> opinions.  So to facilitate discussion on the list, would you please
> send out another e-mail with:
> 1. A description of the motivation for this work (the basic problem
> you're trying to solve)

The basic problem we ran into was a weakness of the current credit scheduler.
We wanted to be able to pin multiple domains to a subset of the physical
processors and hoped the scheduler would still schedule the domains according
to their weights. Unfortunately this was not the case.

Our motivation to do this is the license model we are using for our BS2000 OS.
The customer buys a specific processing power for which he pays a monthly fee.
He might use multiple BS2000 domains, but the overall consumable BS2000 power
is restricted according to his license by allowing BS2000 to run only on a
subset of the physical processors. On other processors other domains are
allowed to run.

As pinning the BS2000 domains to the processor subset was not working, we
thought of two possible solutions:
- fix the credit scheduler to support our request
- introduction of cpupools to have an own scheduler for BS2000 without
  explicit pinning of cpus.

Fixing the scheduler to support weights correctly in case of cpu pinning
seemed to be a complex task with minor benefit for others. The cpupool
approach seemed to be the better solution having more general use cases:
- solution for our problem
- better scalability of the scheduler for large cpu numbers
- potential base for NUMA systems
- support of "software partitions" with more flexibility as hardware
  partitions
- easy grouping of domains

> 2. A description overall of what cpu pools does

The idea is to have multiple pools of cpus, each pool having its own
scheduler. Each physical cpu is member of at most one pool. A pool can have
multiple cpus assigned to. A domain is assigned to a pool on creation,
resulting in being able to run only on the physical cpus assigned to the
same pool.
Domains can be moved from one pool to another, cpus can be removed from or
added to a pool.

The scheduler of each pool is selected at pool creation. Changing scheduling
parameters of a pool only affect domains of this pool. Each scheduler "sees"
only the cpus of its own pool (e.g. each pool with credit scheduler has its
own master cpu, its own load balancing, ...).

On system boot Pool-0 is created as the default pool. As a default all
physical processors are assigned to Pool-0, it is possible to reduce the
number of cpus in Pool-0 via a boot parameter.
Domain 0 is always assigned to Pool-0, it can't be moved to another pool.

Cpus not assigned to any pool can run only the idle domain.

There were several design decisions to take:
- Idle domain handling: either keep the current solution (1 idle domain
  with a pinned vcpu for each physical processor) or one idle domain per pool.
  I've chosen the first variant as this solution seemed to require less
  changes (see discussion below for this topic).
- Use an existing hypercall for cpupool control or introduce a new one. Again
  I wanted to change not too much code, so I used the domctl hypercall (other
  scheduler related stuff is handled via this hypercall, too).
- Handling of special case "continue_hypercall_on_cpu": This function is used
  to execute a domain 0 hypercall (or parts of it) on a specific physical
  processor, e.g. for microcode updates of Intel cpus. With domain 0 residing
  in Pool-0 not running on all physical processors this is a problem. I had
  either to find a general solution for this problem keeping the semantics of
  continue_hypercall_on_cpu, or to eliminate the need for this function by
  changing each case where this function is used. I preferred the general
  solution (see again discussion below).

The main functional support is in the hypervisor, of course. Here are the
main changes I've made:
- I added a cpupool.c source to handle all cpupool operations
- To be able to support multiple scheduler incarnations some static global
  variables had to be allocated from heap for each scheduler. Each physical
  processor has a percpu cpopool pointer now, the cpupool structure contains
  the scheduler reference.
  Most changes in the scheduler are related to the elimination of the global
  variables.
- At domain creation a cpupool id has to be specified. It may be NONE for
  special domains like the idle domain.
- References of cpu_online_mask had to be replaced by the cpu-mask of the
  cpupool in some places.
- continue_hypercall_on_cpu had to be modified to be cpupool aware. See below
  for more details.

> 3. A description of any quirky corner cases you ran into, how you
> solved them, and why you chose the way you did

George, you've read my patch quite well! The corner cases are exactly the
topics you are mentioning below. :-)

> Here are some examples for #3 I got after spending a couple of hours
> looking at your patch:
> * The whole "cpu borrowing" thing

As mentioned above, the semantics of continue_hypercall_on_cpu are problematic
with cpupools. Without the cpupools this function pins the vcpu performing the
hypercall temporarily to the specified physical processor and removes that
pinning after the sub-function specified as a parameter has been completed.
With cpupools it is no longer possible to just pin a vcpu to any physical
processor as this processor might be out of reach for the scheduler.
First I thought it might be possible to use on_selected_cpus instead, but the
sub-functions used with continue_hypercall_on_cpu sometimes access guest
memory. It would be possible to allocate a buffer and copy the guest memory
to this buffer, of course. This would have required a change of all users of
continue_hypercall_on_cpu I wanted to avoid.
The solution I've chosen expands the idea of pinning the vcpu to a processor
temporarily by adding this processor to a cpupool temporarily, if necessary.
It is a little bit more complicated as the vcpu pinning, because after
completion of the sub-function on the borrowed processor this processor has to
be returned to its original cpupool. And this is possible only, if the vcpu
executing the hypercall is no longer running on the processor to be returned.
Things are rather easy if the borrowed processor was not assigned to a
cpupool. It can be assigned to the current pool and unassigned afterwards
quite easy.
If the processor to be borrowed is assigned to an active cpupool however,
the processor must first be unassigned from this pool. This could leave the
pool without any processor resulting in strange behaviour. As the need for
continuing a hypercall on processors outside the current cpupool seems to be
a rare event, I've chosen to suspend all domains in the cpupool from which
the processor is borrowed until the processor is returned.

> * Dealing with the idle domain

I've chosen to stay with one global idle domain instead of per-cpupool idle
domains for two main reasons:
- I felt uneasy about changing a central concept of the hypervisor
- Assigning a processor to or unassigning it from a cpupool with multiple
  idle domains seemed to be more complex. Switching the scheduler on a
  processor seems to be a bad idea as long any non-idle vcpu is running on
  that processor. If the idle vcpus are cpupool specific as well, things
  are becoming really ugly. Either you have a vcpu running outside its
  related scheduler, or the current vcpu referenced by the per-cpu pointer
  "current" is invalid for a short period of time, which is even worse.
This led to the solution of one idle domain and the idle vcpus changing
between schedulers.
Generally the idle domain plays a critical role whenever a processor is
assigned to or unassigned from a cpupool. The critical operation is changing
from one scheduler to another. At this time only an idle vcpu is allowed to
be active on the processor.

> * Why the you expose allocating and freeing of vcpu and pcpu data in
> the sched_ops structure

This is related to the supported operations on cpupools.
Switching a processor between cpupools requires changing the scheduler
responsible for this processor. And this requires a change of the pcpu
scheduler data. Without a interface for allocating/freeing pcpu
scheduler specific data it would be impossible to switch schedulers.
The same applies to the vcpu scheduler data, but this is related to
moving a domain from one cpupool to another. Again the scheduler has to
be changed, but this time for all the vcpus of the moved domain.
Without the capability to move a domain to another cpupool allocating
and freeing vcpu data would still be necessary for switching processors
(the idle vcpu of the switched processor is changing the scheduler as well),
but it would not have to be exposed to sched_ops.

> 
> Some of these I'd people to be able to discuss who don't have the time
> / inclination to spend looking at the patch (which could use a lot
> more comments).
> 
> As for me: I'm happy with the general idea of the patch (putting cpu
> pools in underneath the scheduler, and allowing pools to have
> different schedulers).  I think this is a good orthogonal to the new
> scheduler.  I'm not too keen on the whole "cpu borrowing" thing; it
> seems like there should be a cleaner solution to the problem.  Overall
> the patches need more comments.  I have some coding specific comments,
> but I'll save those until the high-level things have been discussed.

Thanks again for the feedback!

Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@xxxxxxxxxxxxxx
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] [Patch 0/6] xen: cpupool support