In the interest of openness (as well as in the interest of taking
advantage of all the smart people out there), I'm posting a very early
design prototype of the credit2 scheduler. We've had a lot of
contributors to the scheduler recently, so I hope that those with
interest and knowledge will take a look and let me know what they
think at a high level.
This first e-mail will discuss the overall goals: the target "sweet
spot" use cases to consider, measurable goals for the scheduler, and
the target interface / features. This is for general comment.
The subsequent e-mail(s?) will include some specific algorithms and
changes currently in consideration, as well as some bleeding-edge
patches. This will be for people who have a specific interest in the
details of the scheduling algorithms.
Please feel free to comment / discuss / suggest improvements.
1. Design targets
We have three general use cases in mind: Server consolidation, virtual
desktop providers, and clients (e.g. XenClient).
For servers, our target "sweet spot" for which we will optimize is a
system with 2 sockets, 4 cores each socket, and SMT (16 logical cpus).
Ideal performance is expected to be reached at about 80% total system
cpu utilization; but the system should function reasonably well up to
a utilization of 800% (e.g., a load of 8).
For virtual desktop systems, we will have a large number of
interactive VMs with a lot of shared memory. Most of these will be
single-vcpu, or at most 2 vcpus.
For client systems, we expect to have 3-4 VMs (including dom0).
Systems will probably ahve a single socket with 2 cores and SMT (4
logical cpus). Many VMs will be using PCI pass-through to access
network, video, and audio cards. They'll also be running video and
audio workloads, which are extremely latency-sensitive.
2. Design goals
For each of the target systems and workloads above, we have some
high-level goals for the scheduler:
* Fairness. In this context, we define "fairness" as the ability to
get cpu time proportional to weight.
We want to try to make this true even for latency-sensitive workloads
such as networking, where long scheduling latency can reduce the
throughput, and thus the total amount of time the VM can effectively
use.
* Good scheduling for latency-sensitive workloads.
To the degree we are able, we want this to be true even those which
use a significant amount of cpu power: That is, my audio shouldn't
break up if I start a cpu hog process in the VM playing the audio.
* HT-aware.
Running on a logical processor with an idle peer thread is not the
same as running on a logical processor with a busy peer thread. The
scheduler needs to take this into account when deciding "fairness".
* Power-aware.
Using as many sockets / cores as possible can increase the total cache
size avalable to VMs, and thus (in the absence of inter-VM sharing)
increase total computing power; but by keeping multiple sockets and
cores powered up, also increases the electrical power used by the
system. We want a configurable way to balance between maximizing
processing power vs minimizing electrical power.
3. Target interface:
The target interface will be similar to credit1:
* The basic unit is the VM "weight". When competing for cpu
resources, VMs will get a share of the resources proportional to their
weight. (e.g., two cpu-hog workloads with weights of 256 and 512 will
get 33% and 67% of the cpu, respectively).
* Additionally, we will be introducing a "reservation" or "floor".
(I'm open to name changes on this one.) This will be a minimum
amount of cpu time that a VM can get if it wants it.
For example, one could give dom0 a "reservation" of 50%, but leave the
weight at 256. No matter how many other VMs run with a weight of 256,
dom0 will be guaranteed to get 50% of one cpu if it wants it.
* The "cap" functionality of credit1 will be retained.
This is a maximum amount of cpu time that a VM can get: i.e., a VM
with a cap of 50% will only get half of one cpu, even if the rest of
the system is completely idle.
* We will also have an interface to the cpu-vs-electrical power.
This is yet to be defined. At the hypervisor level, it will probably
be a number representing the "badness" of powering up extra cpus /
cores. At the tools level, there will probably be the option of
either specifying the number, or of using one of 2/3 pre-defined
values {power, balance, green/battery}.
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
.