Sorry that this mail has been so long in coming. I'm still trying to
work out kinks in a new algorithm, but I think it's important to keep
people up-do-date on where I am now.
This is actually rather long. I'd absolutely most like the feedback on
the last part; but I think a lot of people will wonder why I'm trying
something totally new instead of trying to fix what's already there.
So if you only have a little time to give, please skip to the new
description. However, if you have more time, and are interested in a
more thorough description of the problem, and/or are prone to wonder
about ways of solving the problem that doesn't require a complete
re-write, read the whole thing.
The sections are:
* Describing what an ideal scheduler would do
* Describing at a high level what credit1 did, and what effects that
had on latency-sensitive workloads
* [future email] Describing a new proposal for how to account for
credits, and ask for input / suggestions / criticism.
* WWISD? (What Would an Ideal Scheduler Do?)
First, let's review our overall goal: When VMs are competing for cpu
time, they should get time in to their weight.
This isn't too hard in the case of cpu-bound processes, which will
consume as much cpu as is given to them. But many workloads, even if
given an entire core to themselves, would only use some percentage of
that. (This is probably why the workload was virtualized in the first
place.) So we want to give VMs either their "fair share" or whatever
they're happy with, whichever is less. And we want to divide the
remaining portion by weight among VMs that do want more.
For a concrete example, suppose that we have a box with 2 cores, and 4
single-vcpu VMs with the following weights:
a: 35
b: 35
c: 65
d: 65
And suppose that each VM, if run by itself, would consume the
following percentages of the cpu:
a: 50%
b: 40%
c: 90%
d: 40%
Then an ideal scheduler would give the following percentages when they
are all run together on a 2-core system:
a: 42%
b: 40%
c: 78%
d: 40%
I leave it as an exercise to the reader to work out how I got these
numbers. But note that b and d both got the amount of cpu time they
wanted (40%), and the remaining is split, according to weight, between
a and c. (i.e., 35/65 == 42/72).
We don't need our scheduler to match this "ideal" exactly, but it
should be within a few percentage points.
Our ideal scheduler would also allow latency-sensitive VMs, which will
only run for a short time and then halt again, to run before
longer-running cpu-intensive VMs. So if b is a network-bound
workload, and d is a video-playing workload, we want them to be
scheduled as soon as they become runnable (or very shortly
afterwards). VM 'a' may also be a network-bound workload if run on an
idle system. In this case, we don't want to run it whenever it wants
to run (since we only want it to get 42% of the cpu), but we want to
schedule it in such a way that it can still utilize all its 42%, which
probably means letting it run soon after it becomes runnable a decent
amount of the time.
(Comments on my portrayal of the ideal scheduler are welcome.)
* Credit1, and why it doesn't work well for video
At a basic level, credit1 has two main priorities: UNDER and OVER.
Credits are handed out every 30ms according to weight. VMs that have
positive credits are classed at priority UNDER, and those with
negative credits are classed as OVER.
Credit1 has a strong probabilistic element. Every 10ms a "tick" timer
fires. When the tick happens, a full 10ms of credit is subtracted
from the currently running VM. It relies on an element of randomness
to spread this charge appropriately over all active VMs.
However, if we divide credits over all VMs we have the following
problem: if some VMs are idle, then what will happen is that the
"active" VMs will spend most of their time in the "OVER" state, while
mostly idle VMs will spend all of their time in the "UNDER" state
accumulating more and more credits. So credit1 attempts to
distinguish between two kinds of workloads:
* "active" workloads, which are competing for CPU, and need to have
credit accounting done.
* "inactive" workloads, which use minimal cpu, and need no credit
accounting.
The rules for determining whether a VM was "active" or "inactive" are
as follows:
* If a VM accumulates more than 30ms worth of credit, it is marked as
"inactive". Its credits are taken away and it is set to "UNDER"
priority.
* If a VM is interrupted by a tick, it is marked as active. It is not
given any credits immediately, but will be given a share at the next
accounting period (which happens every 30ms). Again, this relies on
an element of probability: the more often a VM runs, the higher
probability it will be interrupted by a tick and be marked "active".
In addition to the "active/inactive" distinction, credit1 has a BOOST
priority meant for "interactive" processes, that wake up and block on
a regular basis. BOOST is the highest priority, so any VM running in
BOOST will run before other VMs without having to wait on the runqueue
behind VMs with UNDER priority. The conditions for entering and
leaving BOOST are as follows:
* If a VM wakes up from being blocked and is in the UNDER state
(either active or inactive), it is marked BOOST priority.
* If a VM is interrupted by a tick, it loses BOOST.
(I was not involved in the design of credit1; all of this is inferred
from the code itself, combined with my recent experience trying to
approximate the ideal scheduler.)
A couple of observations about this algorithm. It really seems to be
designed to have a relatively small number of VMs that are cpu
"burners". If a VM is cpu-intensive, uses up all of its credits, and
its workload isn't real-time or latency-sensitive, then this algorithm
works great. Each cpu-intensive VM will get overall an amount of time
proportional to its weight.
However, if a VM uses some percentage of the cpu between 0 and its
"fair share", this algorithm doesn't work that well. Every cpu that
doesn't use up its credits on a regular basis will eventually get hit
by a "tick", which means it will be set to "active" with no credits,
causing it to go 10ms OVER credit. It will then dig its way out of
OVER, and gradually accumulate credits until it has enough to be
considered "inactive", starting the cycle again. The amount of time
spent in "active" vs "inactive" is probabilistic and depends on how
much the VM runs. So it's worst in workloads like network or video,
which consume non-negligible cpu time, but not as much as it can.
The BOOST mechanism does allow latency-sensitive VMs to run
immediately during the "inactive" phase of this cycle. But during the
"active" phase, it starts its time digging its way out of OVER; it
runs after all UNDER VMs instead of before them. This introduces long
latencies and interferes with the workload significantly during this
phase.
Furthermore, even cpu-hog VMs which are HVM block to qemu on a regular
basis, and so (at least some percentage of the time) get marked as
BOOST and run for up to 10ms (the next tick), blocking
latency-sensitive applications which are in boost.
Finally, when latency-sensitive workloads are using more than their
"fair share" and need to be accounted for, the rough granularity of
accounting (debited 10ms of credit at random intervals) works against
these workloads: VMs will spend long periods in UNDER (and thus also
in BOOST), but will then spend long periods in OVER receiving little
cpu time. This is even more so if there are several competing
"latency-sensitive" workloads that wake up often, because at any time
some random subset will be in BOOST. This makes performance highly
unpredictable and erratic.
This is a summary of observed problems with the current credit system.
The next e-mail will cover some ways we might try to modify the
current system which fail for various reasons, and a proposal for a
new way of doing things.
I'd like to comment briefly on the Boost Credit Scheduler, posted
earlier this year by Naoki Nishiguchi. The Boost scheduler was based
on the current credit scheduler, but added a "boost reservation"
feature. It allowed an administrator to assign a VM a guaranteed
minimum "reservation" of cpu time. While a VM was running underneath
that minimum, it ran at BOOST priority. Multiple competing VMs
running at BOOST priority were interleaved at a 1ms granularity to
ensure low latency.
The main downside of this approach is that it requires the user /
administrator to specify ahead of time how much a given VM will need
to do its job appropriately. If you give the VM a 20% reservation,
and the video decoder needs 25% for some reason, you're likely to drop
frames again. Furthermore, if your VM has a rogue process that starts
burning cpu, you may eat through your 20% "boost" period doing useless
spinning, and the rest of the time you'll be back where you started
(going into the normal over/under cycle).
I think the reservation is a useful feature for highly over-committed
boxes, where even VMs with high weight may be cpu-starved. But I
think that we should be able to make most things "just work" by
default on boxes where there's plenty of cpu time to go around,
without resorting to this kind of user intervention.
(As always, comments welcome.)
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|