Re: [Xen-devel] Cpu pools discussion

To:	George Dunlap <dunlapg@xxxxxxxxx>
Subject:	Re: [Xen-devel] Cpu pools discussion
From:	Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>
Date:	Tue, 28 Jul 2009 07:40:54 +0200
Cc:	xen-devel@xxxxxxxxxxxxxxxxxxx, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Delivery-date:	Mon, 27 Jul 2009 22:41:23 -0700
Dkim-signature:	v=1; a=rsa-sha256; c=simple/simple; d=ts.fujitsu.com; i=juergen.gross@xxxxxxxxxxxxxx; q=dns/txt; s=s1536b; t=1248759699; x=1280295699; h=from:sender:reply-to:subject:date:message-id:to:cc: mime-version:content-transfer-encoding:content-id: content-description:resent-date:resent-from:resent-sender: resent-to:resent-cc:resent-message-id:in-reply-to: references:list-id:list-help:list-unsubscribe: list-subscribe:list-post:list-owner:list-archive; z=From:=20Juergen=20Gross=20<juergen.gross@xxxxxxxxxxxxxx> \|Subject:=20Re:=20[Xen-devel]=20Cpu=20pools=20discussion \|Date:=20Tue,=2028=20Jul=202009=2007:40:54=20+0200 \|Message-ID:=20<4A6E8F66.1080506@xxxxxxxxxxxxxx>\|To:=20Ge orge=20Dunlap=20<dunlapg@xxxxxxxxx>\|CC:=20xen-devel@lists .xensource.com,=20=0D=0A=20Keir=20Fraser=20<keir.fraser@e u.citrix.com>\|MIME-Version:=201.0 \|Content-Transfer-Encoding:=207bit\|In-Reply-To:=20<de7640 5a0907270820gd76458cs34354a61cc410acb@xxxxxxxxxxxxxx> \|References:=20<de76405a0907270820gd76458cs34354a61cc410a cb@xxxxxxxxxxxxxx>; bh=UI6FPyt2VWWwtKGj1PiKlzkk3GG/138wckC5Ddc9h2I=; b=M6kc3YDuCFY5W+Ja1sqfOIw4V0elLzeRfnRiOEOnxSYPBymqMT9lEtVP ylSsTeR1fYqpUYGASrGh3n1zuREi+9oCEzqRFPP79Yrahh2PUUFIfVJ4j Qsbc+vREfNkJ38C10n/beug0fk9bTtkcWg+otqvBJBiZPNijODN7BI9W7 HtyMMq+lE9iFem3gpXQ5HL5XIi210W6GG5ITNRLg8to5VkxZBT3VX3HAu tl6KLS6+wspVaA/5gK1aWMjE9yUKy;
Domainkey-signature:	s=s1536a; d=ts.fujitsu.com; c=nofws; q=dns; h=X-SBRSScore:X-IronPort-AV:Received:X-IronPort-AV: Received:Received:Message-ID:Date:From:Organization: User-Agent:MIME-Version:To:CC:Subject:References: In-Reply-To:X-Enigmail-Version:Content-Type: Content-Transfer-Encoding; b=hxGquCvNYn29v7KlRpMpPe3+copE/F/qifBubKttf+9UjrHGuIUMLjAy cENQgwvaNlie4VFqwU6NUIKZvc99WeKJuvHTnqpANuZ1F9km0DR6Wdzv2 kkdI2KX4Mv64ZrROh1lAU+1PZdFD8+DVJ9JCXOrumKFm1d0tfdVHaCf6N XJeTTUDdqC1D0da32hdpr59o3Qj2Vla6dqe+Pnhj3V2tiP7hKWZrpkLQt xguUf20mCK7A1Z9NaomzY/SJ8wAAZ;
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to:	<de76405a0907270820gd76458cs34354a61cc410acb@xxxxxxxxxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Organization:	Fujitsu Technology Solutions
References:	<de76405a0907270820gd76458cs34354a61cc410acb@xxxxxxxxxxxxxx>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent:	Mozilla-Thunderbird 2.0.0.22 (X11/20090707)

George Dunlap wrote:
> Keir (and community),
> 
> Any thoughts on Jeurgen Gross' patch on cpu pools?
> 
> As a reminder, the idea is to allow "pools" of cpus that would have
> separate schedulers.  Physical cpus and domains can be moved from one
> pool to another only by an explicit command.  The main purpose Fujitsu
> seems to have is to allow a simple machine "partitioning" that is more
> robust than using simple affinity masks.  Another potential advantage
> would be the ability to use different schedulers for different
> purposes.
> 
> For my part, it seems like they should be OK.  The main thing I don't
> like is the ugliness related to continue_hypercall_on_cpu(), described
> below.
> 
> Jeurgen, could you remind us what were the advantages of pools in the
> hypervisor, versus just having
> affinity masks (with maybe sugar in the toolstack)?

Sure.

Our main reason for introducing pools was the weakness of the current
scheduler(s) to schedule domains according to their weights while restricting
the domains to a subset of the physical processors using pinning.
I think it is virtually impossible to find a general solution for this
problem without some sort of pooling (if somebody proves me being wrong here,
I'm completely glad to take this "perfect" scheduler instead of pools :-) ).

So while the reason for the pools was a lack of functionality in the first
run, there are some more benefits:
+ possibility to use different schedulers for different domains on the same
  machine (do you remember the discussion with bcredit?). Zhigang has posted
  a request for this feature already.
+ less lock conflicts on huge machines with many processors
+ pools could be a good base for NUMA-aware scheduling policies

> 
> Re the ugly part of the patch, relating to continue_hypercall_on_cpu():
> 
> Domains are assigned to a pool, so
> if continue_hypercall_on_cpu() is called for a cpu not in the domain's
> pool, you can't just run it normally.  Jeurgen's solution (IIRC) was to
> pause all domains in the other pool, temporarily move the cpu in
> question to the calling domain's pool, finish the hypercall, then move
> the cpu in question back to the other pool.
> 
> Since there's a lot of antecedents in that, let's take an example:
> 
> Two pools; Pool A has cpus 0 and 1, pool B has cpus 2 and 3.
> 
> Domain 0 is running in pool A, domain 1 is running in pool B.
> 
> Domain 0 calls "continue_hypercall_on_cpu()" for cpu 2.
> 
> Cpu 2 is in pool B, so Jeurgen's patch:
>  * Pauses domain 1
>  * Moves cpu 2 to pool A
>  * Finishes the hypercall
>  * Moves cpu 2 back to pool B
>  * Unpauses domain 1
> 
> That seemed a bit ugly to me, but I'm not familiar enough with the use
> cases or the code to know if there's a cleaner solution.

Some thoughts on this topic:

The continue_hypercall_on_cpu() function is needed on x86 for loading new
microcode into the processor. The source buffer of the new microcode is
located in dom0-memory so dom0 has to run on the physical processor the new
code is loaded into (otherwise it wouldn't be accessible).
We could avoid the complete continue_hypercall_on_cpu() stuff if the microcode
would be copied into a hypervisor buffer and use on_selected_cpus() instead.
Other users (cpu hotplug and acpi_enter_sleep) would have to switch to other
solutions as well.

BTW: continue_hypercall_on_cpu() exists on x86 only and it isn't really much
better than my usage of it:
- remember old pinning state of current vcpu
- pin it temporarily to the cpu it should continue on
- continue the hypercall
- remove temporary pinning
- re-establish old pinning (if any)
Pretty much the same as my solution above ;-)

So I would suggest to eliminate continue_hypercall_on_cpu() completely if you
are feeling uneasy with my solution.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 636 47950
Fujitsu Technolgy Solutions               e-mail: juergen.gross@xxxxxxxxxxxxxx
Otto-Hahn-Ring 6                        Internet: ts.fujitsu.com
D-81739 Muenchen                 Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] Cpu pools discussion