I've been working on upstreaming some patches in our XenServer
patchqueue relating to credit1. One set of patches deals with an
issue I call "concurrency hazard" (which I shall define below). As
part of the upstreaming process, I did extensive testing to determine
which of the patches helped, and how much. I've decided to upstream
one of the patches and shelve the other two. I'm writing this e-mail
to explain the nature of the original problem, the the results of my
testing, and why I've chosen which one, so that community members can
make informed review decisions.
So, start with the basics. What is concurrency hazard?
Operating systems typically use "spinning" primitives to do
synchronization between processors. Spinlocks are the most common
example, but there are variations on those (e.g., queued/ticketed
spinlocks), as well as other techniques, such as sending an IPI to
another processor asking it to do something, and spinning waiting for
it to be done.
When moving to a virtualized environment, not all vcpus of a VM are
scheduled all of the time. It's possible for a VM to grab a spinlock
and immediately be preempted, for example. In that case, if another
VM attempts to grab the lock, it may spend its whole timeslice (up to
30ms in the default scheduler) just spinning waiting for the lock.
This is not only time that could be used by another VM for useful
processing; in our analyses, it's not uncommon for the VM holding the
lock to be on the same cpu, but a lower priority, than the VM trying
to grab the lock. In that case, the time spinning is actively
preventing the VM holding the lock from doing what it needs to do and
release it.
To give you an idea of the potential problem, consider the following
setup:
* 4-core box
* Several VMs, each with 4 vcpus, running Windows
* Install the Windows DDK, and build the example drivers, timing how
long it takes.
This "ddk-build" test is similar to the Linux kernbench (or
kernel-build) test: it involves a lot of disk I/O, file creation and
destruction, and process creation and destruction, involving a number
of locks.
Two measures are average time of completion. Since this doesn't give
you an idea of overall system performance, I also calculate a
"throughput" metric. The throughput of an individual is the somewhat
arbitrary of measure of "Number of ddk builds per 100,000 seconds".
The advantage of this is that we can sum the throughput of individual
VMs, and see what the total throughput of the system is as we increase
the number of VMs. In an ideal system, the throughput of 1 VM maxing
out at least one resource (either cpu or disk) would be equal to the
throughput of 2 VMs both maxing out the same physical resource.
If only a single VM is running, and the disk cache is warm, the time
to build on my system averaged 71 seconds -- an aggregate throughput
of 1402 (==100000/71). In an ideal system, if we add a second VM,
we'd expect the average throughput to be no more than 140 seconds.
However, what we actually get is 190, and the aggregate throughput
down to 1049 (==(100000/t1)+(100000/t2), where t1 and t2 are the times
of the individual runs). Things continue to get worse as we go up:
Base (no patches)
VMs | Ave time | Tput
1 | 71s | 1402
2 | 190s | 1049
3 | 371s | 805
4 | 559s | 665
(For completeness, the complete setup for this test and other tests:
* Each VM has Windows Server 2008 R2, 1G of RAM on a 32G box
* Before starting testing, 8 VMs are booted. Tests are run with the
specified number of VMs, with the rest of them idling.
* Each test is run 6 times. The first time is to warm up the guest's
disk cache; the result of this run is discarded. The other 5 runs are
calculated, and the average throughput / run time reported here.)
Previous analysis had shown that the VMs spent an increasingly
significant chunk of their time in spinlock routines -- in other
words, concurrency hazard.
To address this issue, we pursued two lines. The first line was to
try pure scheduling changes, with no changes to the guest. The main
point of these was to try to keep vcpus from the same VM off of the
same physical processor. Our analysis showed that it was not uncommon
for two vcpus from the same VM to end up on the same processor. In
that case, it was a common situation for v0 to grab a spinlock, then
be preempted by v1, which would then try to grab the spinlock. v1
would then spin for a full timeslice waiting for the lock, when if it
had just blocked and let v0 run, it could have gotten the lock in
short order.
Two patches we came up with were:
* "pick": Add knowledge of domains to cpu_pick(), so that cpu_pick()
would try to avoid putting vcpus from the same VM on the same
processor if possible.
* "push": In the course of chaotic load balancing, vcpus often end up
on the same cpu anyway. Periodically go through and "push" vcpus
which have become co-located to different cpus.
The second line of pursuit was paravirtualization. For Windows 2003 /
XP, we patched the spinlock routines to spin for a fixed number of
cycles, then make a SCHED_yield hypercall. Viridian extensions include a
hypervisor callback for missed spinlocks; we re-directed this callback
to SCHED_yield as well.
The existing implementation of "yield" was simply to calls into the
scheduler again. Unfortunately, for the credit scheduler, the result
is almost always that it picks the same vcpu to run again. After all,
that cpu has the highest priority, it's the one that should run. So
yield is a no-op.
So the next patch tested is:
* "yield": actually implement yield in the credit scheduler. If a
vcpu yields, and there are lower-priority vcpus on the runqueue, put
it behind one of the lower-priority vcpus. The runqueue is sorted by
priority every 30ms, so the longest this inversion can last is 30ms.
The results, when adding all of these together was impressive:
Pick,Push,Yield
VMs | Ave time | Tput
1 | 74s | 1402
2 | 129s | 1540
3 | 196s | 1527
4 | 267s | 1492
Note that the average runtime for 2 vcpus is *less* than twice the
number for 1 vcpu -- the total system throughput is higher, as the
resources are being used more efficiently.
Also note, however, that the throughput for a single system is lower
-- almost 4%. So adding these mechanisms can help with systems with
concurrency hazard, but they also have a negative performance impact
on systems without concurrency hazard.
So to minimize impact, I did some tests to find out which patch was
helping, and how much. The pick and push patches both made
significant contributions when made alone. However, the yield patch
outshone them all:
Yield
VMs | Ave time | Tput
1 | 72s | 1389
2 | 132s | 1514
3 | 198s | 1508
4 | 271s | 1470
The impact of the yield patch on single VM throughput was much
smaller, less than 1%. The throughput of higher numbers of VMs was
lower than with all of the patches, but only by a marginal amount --
less than 2% in all cases.
So of these three patches, I believe that the best option is to add
only the yield patch.
I'm attaching the other two patches to this e-mail for posterity, as
well as a gnumeric spreadsheet with all of the combinations
tested. (For those reading, "ppy" is "pick+push+yield".)
I'll be sending the "yield" patch series shortly.
-George
scheduler.cpu_pick-avoids-redundancy.patch
Description: Text Data
scheduler.push-redundant-vcpus.patch
Description: Text Data
038.kodo3.4cpu.trunk-patch-test.summary.gnumeric
Description: application/gnumeric
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|