Re: [Xen-devel] Questions regarding Xen Credit Scheduler

To:	Gaurav Dhiman <dimanuec@xxxxxxxxx>
Subject:	Re: [Xen-devel] Questions regarding Xen Credit Scheduler
From:	George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
Date:	Mon, 12 Jul 2010 12:05:33 +0100
Cc:	xen-devel@xxxxxxxxxxxxxxxxxxx
Delivery-date:	Mon, 12 Jul 2010 04:07:56 -0700
Dkim-signature:	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:cc:content-type:content-transfer-encoding; bh=U5NUBPP8muR23Lv/FocPmTYak+ZXjAxLegUxbmRBZk4=; b=u3Yc61eduo0cDWveocwszgGC6fCddfbeWI6srSGBCTWsEM7/34DOTDIfwYUcT8u+3k gKmL2OHbfCXB91Ks/GNNwMq778y0rMi9rVbH9rYXqW1JJZMz7EYsDyjCQJWy3TN3VrWu AFLroCmoj2GXoTZXjnq3iILHb88qCCL2G/nCA=
Domainkey-signature:	a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=PZqeDt6rAtfktgLizvs0XjzrvV3Hy0jIOhG1bozLd8eCIGsQFcZiyWshDiZ7LdUVpL 77NNJk84scxjVPKTpXX/13omJ0cOj3FX4Gm2mxT38JMIfDDZeiJ8HfDAQZYjchp6LnA4 vR2RQpgAzxcnEu5QBz/WZpLBoz7UTUK8sfFO8=
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<AANLkTim2BYie1fZS00YO23XOZB3KRv8JFXmptbt9I-rp@xxxxxxxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References:	<AANLkTiloe7jgMO49i72sF0MDFmuHJJeysEBb0oLVNono@xxxxxxxxxxxxxx> <AANLkTikh504vP27XP1SXtNANv2h1Z42RNDgEzRMjI-BK@xxxxxxxxxxxxxx> <AANLkTim2BYie1fZS00YO23XOZB3KRv8JFXmptbt9I-rp@xxxxxxxxxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx

On Sat, Jul 10, 2010 at 10:21 AM, Gaurav Dhiman <dimanuec@xxxxxxxxx> wrote:
> 1. __runq_insert: Insert according to credit as well as priority.
> Current code just looks at the priority, which is very coarse.
[snip]
> 3. csched_runq_sort: Sort according to credit.

I think these (both related to a sorted runqueue) are probably good
ideas.  The main thing to pay attention to is overhead: run some tests
with long runqueues, and see if there's any performance degradation.

> 2. __runq_tickle: Tickle the CPU even if the new VCPU has same
> priority but higher amount of credits left. Current code just looks at
> the priority.
[snip]
> 5. csched_schedule: Always call csched_load_balance. In the
> csched_load_balance and csched_runq_steal functions, change the logic
> to grab a VCPU with higher credit. Current code just works on
> priority.

I'm much more wary of these ideas.  The problem here is that doing
runqueue tickling and load balancing isn't free -- IPIs can be
expensive, especially if your VMs are running with hardware
virtualization .  In fact, with the current scheduler, you get a sort
of n^2 effect, where the time the system spends doing IPIs due to load
balancing squares with the number of schedulable entities.  In
addition, frequent migration will reduce cache effectiveness and
increase congestion on the memory bus.

I presume you want to do this to decrease the latency?  Lee et al [1]
actually found that *decreasing* the cpu migrations of their soft
real-time workload led to an overall improvement in quality.  The
paper doesn't delve deeply into why, but it seems reasonable to
conclude that although the vcpus may have been able to start their
task sooner (although even that's not guaranteed -- it may have taken
longer to migrate than to get to the front of the runqueue), they
ended their task later, presumably due to cpu stalls on cacheline
misses and so on.

I think a much better approach would be:
* To have long-term effective placement, if possible: i.e., distribute
latency-sensitive vcpus
* If two latency-sensitive vcpus are sharing a cpu, do shorter time-slices.

But I think those need more research; it would be better to put that
effort into the new scheduler.

> 4. csched_acct: If credit of a VCPU crosses 300, then set it to 300,
> not 0. I am still not sure why the VCPU is being marked as inactive?
> Can't I just update the credit and let it be active?

Did you read the whitepaper that I linked to, and/or watch my
presentation? It has a lot of information about the logic behind the
algorithm: specifically, the tendency of this "credit-like" class of
algorithms to credit divergence.  Please read it and let me know if
you have any questions.

The active / inactive distinction has to do with who gets credits. If
you just divided credits equally with everyone, then eventually VMs
that weren't using credits would gain a lot (or be capped out, as
your'e suggesting).  Because we allow VMs to use "extra" cpu time if
it's not being used, those VMs will by definition burn more credits
than they earn, and will tend to go off to negative.

So what credit1 does is assume that all workloads fall into two
categories: "active" VMs, which consume as much cpu as they can, and
"inactive" (or "I/O-bound") VMs, which use almost no cpu.  "Inactive"
VMs essentially run at BOOST priority, and run whenever they want to.
Then the credit for each timeslice is divided among the "active" VMs.
 This way the ones that are consuming cpu don't get too far behind.

The problem of course, is that most server workloads fall in the
middle: they spend a significant time processing, but also a
significant time waiting for more network packets.

I looked at the idea of "capping" credit, as you say; but the
steady-state when I worked out the algorithms by hand was that all the
VMs were at their cap all the time, which screwed up other aspects of
the algorithm.  Credits need to be thrown away; my proposal was to
divide the credits by 2, rather than setting to 0.  This should be a
good mid-way.

These things are actually really subtle.  I've spent hours and hours
with pencil-and-paper, working out different algorithms by hand, to
see exactly what effect the different changes would have.  I even
wrote a discrete event simulator, to make the process a bit faster.
(But of course, to understand why things look the way they do, you
still have to trace through the algorithm manually).  If you're really
keen, I can tar it up and send it to you. :-)

So in summary:
* Please do post a sort-runq-by-credit patch, preferably along with
some benchmarks showing a lack of performance impact
* Don't think increased load-balancing is the right approach.  Won't
scale, and probably won't even make throughput faster.  I wouldn't
approve these without significant large-scale testing.
* Think the "reset condition" could use revising. My sense is that
leaving it at the cap isn't the best approach.  If you can convince me
that it works OK (including test results as well as posting graphs of
consumed credit, &c), I'll consider it; but I think dividing in half
will be better.  I'll post a patch later today.
* Let me know if you want my hacked-up scheduler simulator to play with. :-)

Thanks again for your help,
 -George

[1] Min Lee, A. S. Krishnakumar, P. Krishnan, Navjot Singh, Shalini
Yajnik. ‘Supporting Soft
Real-Time Tasks in the Xen Hypervisor’, VEE 2010, Pittsburgh, PA,
March 17-19, 2010

> Do you think, these ideas make sense? Am I missing out on something?
>
>> Regarding specific things:
>>
>> One thing you didn't catch is that credits before 4.0 are debited
>> probabilistically, a full 10ms at a time, by the very timer tick that
>> moves a vcpu from "inactive" to "active"; so when you make the switch
>> from "active" to "inactive", you don't start out at 0, but at -10ms.
>
> Yes, I noticed this; point 4 above tries to address this. As I
> mentioned above, I am not sure why it is being marked inactive in
> first place?
>
>> It turns out that's not only bad for latency-sensitive processes, but
>> it's also a security bug; so there's a patch in 4.0 (not sure whether
>> it's been backported to 3.4) to do accurate accounting based on RDTSC
>> reads instead of probabilistic-based accounting based on timer ticks.
>
> Yes, I have seen Xen 4.0 code; it does deterministic accounting by
> recording the amount of time spent on the CPU by a VCPU.
>
>> So the answer to #3 is:
>> * The "accurate credit" patch is in 4.0, maybe 3.4.  That should help 
>> somewhat.
>> * I have a patch that will change the "reset condition"; I'm
>> considering submitting it.  I'd appreciate testing / feedback.  (I'll
>> send this in a separate e-mail.)
>
> Please do send this.
>
>> * There is no patch yet that will fix the sort-by-priority, but it
>> should be simple and straightforward to implement.  I'll support
>> putting it in once I'm reasonably convinced that it helps and doesn't
>> hurt too much.  If you were to help out with the implementation and
>> testing, that will happen a lot faster. :--)
>
> I am trying to implement the ideas I mentioned above. You feedback
> would be very helpful.
>
> Thanks,
> -Gaurav
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
>

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] Questions regarding Xen Credit Scheduler