This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and inter

To: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
Subject: RE: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
From: "Tian, Kevin" <kevin.tian@xxxxxxxxx>
Date: Fri, 17 Apr 2009 18:02:40 +0800
Accept-language: en-US
Acceptlanguage: en-US
Cc: Jeremy Fitzhardinge <jeremy@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Fri, 17 Apr 2009 03:05:32 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <de76405a0904160327h29d138d1pfe6e4c2b4f4eaaac@xxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <de76405a0904090858g145f07cja3bd7ccbd6b30ce9@xxxxxxxxxxxxxx> <49DE415F.3060002@xxxxxxxx> <0A882F4D99BBF6449D58E61AAFD7EDD61036A60D@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <49DF708F.6070102@xxxxxxxx> <0A882F4D99BBF6449D58E61AAFD7EDD61036AB17@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <de76405a0904150856p2419ad7ndfbbb26d0fbb2c6b@xxxxxxxxxxxxxx> <0A882F4D99BBF6449D58E61AAFD7EDD61042528F@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <de76405a0904160327h29d138d1pfe6e4c2b4f4eaaac@xxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: Acm+fgYyTSSbJs4JSLG8Y1hd3lZ78wAuuQJQ
Thread-topic: [Xen-devel] [RFC] Scheduler work, part 1: High-level goals and interface.
>From: George Dunlap
>Sent: 2009年4月16日 18:28
>2009/4/16 Tian, Kevin <kevin.tian@xxxxxxxxx>:
>> Could you elaborate more about what's being simplified compared to
>> generic gang scheduling? I used to be scared by complexity to have
>> multiple vcpus sync in and sync out, especially with other
>> heterogeneous VMs (w/o gang scheduling requirement). It's possibly
>> simpler if all VMs in that system are hyper-pair based... :-)
>(I've only done some thought experiments re gang scheduling, so take
>that into account as you read this description.)

Thanks for writing this down.

>The general gang scheduling problem is that you have P processors, and
>N vms, each of which have up to M processors.  Each vm may have a
>different number of processors N.vcpu < P, and at any one time there
>may be a different number of processors runnable N.runnable < N.vcpu <
>P; this may change at any time.  The general problem is to maximize
>the number of used processors P (and thus maximize the throughput).
>If you have a VM with 4 vcpus, but it's only using 2, you have a
>choice: do I run it on 2 cores, and let another VM use the other 2
>cores?  Or do I "reserve" 4 cores for it, so that if the other 2 vcpus
>wake up they can run immediately?  If you do the former, then if one
>of the other vcpus wakes up you have to quickly preempt someone else;
>if not, you risk leaving the two cores idle for the entire timeslice,
>effectively throwing away the processing power.  The whole problem is
>likely to be NP-complete, and really obnoxious to have good heuristics

This is related to the definition of gang scheduling. In my memory
gang scheduling comes from parallel computing requirement where
massive inter-threads communications/synchronizations exists and
one thread in blocking also impacts the rest. That normally requires
a 'gang' concept that once one thread is to be scheduled, all other
threads in same group are schdeduled in all together; on the other 
hand context switches are minimized to keep threads always in 
ready state. This idea is very application specific and thus normally
not find way into generic market.

Based on your whole write-up, you don't consider how to ensure 
vcpus of one VM scheduled in sync at given time point. Instead,
your model is to ensure exclusive usage on a thread-pair within
a given quantum window. It's nothing related to 'gang scheduling'
but it's OK for us to call it 'simplified gang scheduling' in this context
since the keypoint in this discussion is whether we can do some
optimization on HT, not strictly limited to gang scheduling itself. :-)

>In the case of HT gang-scheduling, the problem is 
>significantly constrained:
>* The pairs of processors to look at is constrianed: each logical cpu
>has a pre-defined sibling.
>* The quantity we're trying to gang-schedule is significantly
>constrained: only 2; and each gang-scheduled vcpu has its pre-defined
>HT sibling as well.

Then I assume only VMs with <=2 vcpus are considered here

>* If only one sibling of a HT pair is active, the other one isn't
>wasted; the active thread will get more processing power.  So we don't
>have that risky choice.

May or may not. Be in mind that why HT is introduced is because one
thread normally includes lots of stall cycles and thus stalled cycles
can be utilized to drive another thread. whether more processing power
is gained is very workload specific.

>So there are really only a handful of new cases to consider:
>* A HT vcpu pair wakes up or comes to the front of the queue when
>another HT vcpu pair is running.
> + Simple: order normally.  If this pair is a higher priority
>(whatever that means) than the running pair, preempt the running pair
>(i.e., preempt the vcpus on both logical cpus).

"vcpu pair"? your below descriptions are all based on vcpu pair. How do
you define the status (blocked, runnable, running) of a pair? If it's AND
of individual status of two vcpus, it's too restrictive to have a runnable
status. If it's OR operation, you may then need consider how to define
a sane priorioty (is one-vcpu-ready-the-other-block with higher credits
considered higher priority than two-vcpu-ready with lower credits? etc.)

>* A non-HT vcpu becomes runnable (or comes to the front of the
>runqueue) when a HT vcpu pair is on a pair of threads
> + If the non-HT vcpu priority is lower, wait until the HT vcpu pair
>is finished.
> + If the non-HT vcpu priority is higher, we need to decide whether to
>wait longer or whether to preempt both threads.  This may depends on
>whether there are other non-HT vcpus waiting to run, and what their
>priority is.
>* A HT vcpu pair becomes runnable when non-HT vcpus are 
>running on the threads
> + Deciding whether to wait or preempt both threads will depend on the
>relative weights of both.

Above all looks adding much complexity to a common single-vcpu-based

>These decisions are a lot easier to deal with than the full
>gang-scheduling problem (AFAICT).  One can imagine, for instance, that
>each HT pair could share one runqueue.  A vcpu HT pair would be put on
>the runqueue as an individual entity.  When it reached the top of the
>queue, both threads would preempt what was on them and run the pair of
>threads (or idle if the vcpu was idle).  Otherwise, each thread would
>take a single non-pair vcpu and execute it.

That may bring undesired effect. The more cross-cpu talks you add
to scheduler, less efficient could scheduler be due to locks on both

>At any rate, that's a step further down the road.  First we need to
>address basic credits, latency, and load-balancing. (Implementation
>e-mails to come in a bit.)

Yup. First implementation should take commonness, simpleness
and overall efficiency as major goals.

Last, any effort toward better HT efficiency is welcomed to me. However
we need be careful enough to avoid introducing unnecessary complexity
into scheduler. I'm educated that the more complex (or flexible options)
the scheduler is, more unexpected side-effects may occur in other areas.
By far I'm not convinced that above idea could lead to a more fair system
either to HT vcpu pairs or to non-HT vcpus. I agree with Ian's idea that
topology is exposed in partition case where scheduler doesn't require
change, and for general case we just need HT accounting to ensure
fairness which is simple enhancement. :-)

Xen-devel mailing list
<Prev in Thread] Current Thread [Next in Thread>