Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals a

To:	"Tian, Kevin" <kevin.tian@xxxxxxxxx>
Subject:	Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
From:	George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
Date:	Thu, 16 Apr 2009 11:27:57 +0100
Cc:	Jeremy Fitzhardinge <jeremy@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date:	Thu, 16 Apr 2009 03:28:29 -0700
Dkim-signature:	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to:cc :content-type:content-transfer-encoding; bh=UAwQiyiZbrXNBLW/CR3Ip2/jE85Xmwasp5gXcAs4Ah8=; b=LzupGfkPcoJf6eB7Ke6U34raI2CjPewQvPvjQ/Ydl1diFIFc9aRS8fDcXHcQbO2AxB T5mpm3++nGIRf2bacYXKDtTA8P5/IaOjkizFf4Cg+8TAEOcoc8B97pvJ9Usl3wHrIVzI blp5c/1Td1I15AyKocjFFOX4FOyx727INhz7A=
Domainkey-signature:	a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=lkWk7opaNw/Atj2OBzY75NORyffYKJLwoEH4EfNzIOYO/K2X+zCza/xeAWzMWEdHwU IJdsBUPe8Q+PcWg0ddpGLr2iO24J549uOosbdvINYMZiYjEUXH6X1FZw5+k9OB8IQzTv r8krnwWCgYTpy+b8/Koh0yvf4E4ElKqiVurV4=
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<0A882F4D99BBF6449D58E61AAFD7EDD61042528F@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References:	<de76405a0904090858g145f07cja3bd7ccbd6b30ce9@xxxxxxxxxxxxxx> <49DE415F.3060002@xxxxxxxx> <0A882F4D99BBF6449D58E61AAFD7EDD61036A60D@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <49DF708F.6070102@xxxxxxxx> <0A882F4D99BBF6449D58E61AAFD7EDD61036AB17@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <de76405a0904150856p2419ad7ndfbbb26d0fbb2c6b@xxxxxxxxxxxxxx> <0A882F4D99BBF6449D58E61AAFD7EDD61042528F@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx

2009/4/16 Tian, Kevin <kevin.tian@xxxxxxxxx>:
> Could you elaborate more about what's being simplified compared to
> generic gang scheduling? I used to be scared by complexity to have
> multiple vcpus sync in and sync out, especially with other
> heterogeneous VMs (w/o gang scheduling requirement). It's possibly
> simpler if all VMs in that system are hyper-pair based... :-)

(I've only done some thought experiments re gang scheduling, so take
that into account as you read this description.)

The general gang scheduling problem is that you have P processors, and
N vms, each of which have up to M processors.  Each vm may have a
different number of processors N.vcpu < P, and at any one time there
may be a different number of processors runnable N.runnable < N.vcpu <
P; this may change at any time.  The general problem is to maximize
the number of used processors P (and thus maximize the throughput).
If you have a VM with 4 vcpus, but it's only using 2, you have a
choice: do I run it on 2 cores, and let another VM use the other 2
cores?  Or do I "reserve" 4 cores for it, so that if the other 2 vcpus
wake up they can run immediately?  If you do the former, then if one
of the other vcpus wakes up you have to quickly preempt someone else;
if not, you risk leaving the two cores idle for the entire timeslice,
effectively throwing away the processing power.  The whole problem is
likely to be NP-complete, and really obnoxious to have good heuristics
for.

In the case of HT gang-scheduling, the problem is significantly constrained:
* The pairs of processors to look at is constrianed: each logical cpu
has a pre-defined sibling.
* The quantity we're trying to gang-schedule is significantly
constrained: only 2; and each gang-scheduled vcpu has its pre-defined
HT sibling as well.
* If only one sibling of a HT pair is active, the other one isn't
wasted; the active thread will get more processing power.  So we don't
have that risky choice.

So there are really only a handful of new cases to consider:
* A HT vcpu pair wakes up or comes to the front of the queue when
another HT vcpu pair is running.
 + Simple: order normally.  If this pair is a higher priority
(whatever that means) than the running pair, preempt the running pair
(i.e., preempt the vcpus on both logical cpus).
* A non-HT vcpu becomes runnable (or comes to the front of the
runqueue) when a HT vcpu pair is on a pair of threads
 + If the non-HT vcpu priority is lower, wait until the HT vcpu pair
is finished.
 + If the non-HT vcpu priority is higher, we need to decide whether to
wait longer or whether to preempt both threads.  This may depends on
whether there are other non-HT vcpus waiting to run, and what their
priority is.
* A HT vcpu pair becomes runnable when non-HT vcpus are running on the threads
 + Deciding whether to wait or preempt both threads will depend on the
relative weights of both.

These decisions are a lot easier to deal with than the full
gang-scheduling problem (AFAICT).  One can imagine, for instance, that
each HT pair could share one runqueue.  A vcpu HT pair would be put on
the runqueue as an individual entity.  When it reached the top of the
queue, both threads would preempt what was on them and run the pair of
threads (or idle if the vcpu was idle).  Otherwise, each thread would
take a single non-pair vcpu and execute it.

At any rate, that's a step further down the road.  First we need to
address basic credits, latency, and load-balancing. (Implementation
e-mails to come in a bit.)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and inte