WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] Dom0 losing interrupts???

To: George Dunlap <George.Dunlap@xxxxxxxxxxxxx>
Subject: Re: [Xen-devel] Dom0 losing interrupts???
From: Juergen Gross <juergen.gross@xxxxxxxxxxxxxx>
Date: Mon, 14 Feb 2011 12:46:56 +0100
Cc: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxxxx>
Delivery-date: Mon, 14 Feb 2011 03:47:29 -0800
Dkim-signature: v=1; a=rsa-sha256; c=simple/simple; d=ts.fujitsu.com; i=juergen.gross@xxxxxxxxxxxxxx; q=dns/txt; s=s1536b; t=1297684020; x=1329220020; h=message-id:date:from:mime-version:to:cc:subject: references:in-reply-to:content-transfer-encoding; bh=P4QICpmnxOQaymtQirXe2b9BBzAxX+X8RZpSTqPUBFk=; b=DX237IQj0UgOyNUUp+ubzkLPlV9bitq51O825Yq7U67E5BgsrqvtoXGT Gii8vKX1KZ96QWXla9ibGhlDQE3dZgC9Ci53093qZ7z9BMfwu9je1DGiN 4PsfbMBofpvSRBeUwexU6tULFWsx7LSaR1+Lpp+WG81t/eH21zIov+eZm iJcEnBPSfwqBjM1SUSb6EFDIQzoKvRPt4EM3Z8ULbm2FUnXEKSx7jnpTV oAEQtGmwMOZHYMYtJWwWIsS6IuIyy;
Domainkey-signature: s=s1536a; d=ts.fujitsu.com; c=nofws; q=dns; h=X-SBRSScore:X-IronPort-AV:Received:X-IronPort-AV: Received:Received:Message-ID:Date:From:Organization: User-Agent:MIME-Version:To:CC:Subject:References: In-Reply-To:Content-Type:Content-Transfer-Encoding; b=h2D+sFKoRKwwfVMfgIfub4dkOa+x76UrkHTT3JNfbNUOsbOIJluUl73b n8u4c6kKwFJL7dhl4iEc2hEzLmqcnTDlkBVJen989feuW1n1fkKePDZRV wmVaoHS1hFDF+wxNRyT6jBWXGc8pDDHRKe0xf9tZZINN3N5/dlABal1Pg 7aJVjen5OUQu7lo7RTeaHoE6tA6yR9WaKAyP8GShfsR8LjJ9Z87yMpRlk kruz6svWNGofFZge0eatqQe4V8/jE;
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <AANLkTimT2td-zPxzXuJ7psi_hTRj7Ryv8-4hfa3ieDR4@xxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Organization: Fujitsu Technology Solutions
References: <4D58D2D7.9010803@xxxxxxxxxxxxxx> <4D59034A0200007800031B7A@xxxxxxxxxxxxxxxxxx> <4D58F820.80401@xxxxxxxxxxxxxx> <4D590AE70200007800031BC1@xxxxxxxxxxxxxxxxxx> <AANLkTimT2td-zPxzXuJ7psi_hTRj7Ryv8-4hfa3ieDR4@xxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.16) Gecko/20101226 Iceowl/1.0b1 Icedove/3.0.11
On 02/14/11 12:21, George Dunlap wrote:
My sense is that:
* Pinning N vcpus to N-M pcpus (where M is a significant fraction of
N) is just a really bad idea; it would be better just not to do that.

I just wanted to make sure the interrupts are not lost due to the cpupool
operation itself.
So I tried with an extreme configuration and was proved right :-)

It would be ideal if somehow when dom0's cpu pool shrinks, it
automatically offlines an appropriate number of vcpus; but it
shouldn't be difficult for an administrator to do that themselves.

I've sent a patch for the cpupool-numa-split case, which will always remove a
significant number of physical cpus for dom0.

* On average, a vcpu shouldn't have to wait more than 60ms or so for
an interrupt.  It seems like there's a non-negligible possibility that
there's some kind of bug in the interrupt delivery and handling,
either on the Xen side or the Linux side (or as Jan pointed out, a bug
in the driver).  In that case, doing something in the scheduler isn't
actually fixing the problem, it's just making it less likely to
happen.  (NB that we've had intermittent failures in the xen.org
testing infrastructure with what looks like might be missed interrupts
as well -- and those weren't on heavily loaded boxes.)

Any idea what I could do to help? Our larger test machines are not just
idling, but I could use one from time to time without much problems.
It's rather easy for me to reproduce the problem, OTOH it should be easy for
others with a reasonable large machine, too.

* Even if it is ultimately a scheduler bug, understanding exactly what
the scheduler is doing and why is key to making a proper fix.  It's
possible that there's just a simple quirk in the algorithm, such that
a general fix will make everything work better without needing to
introduce a special case for hardware interrupts.
* I'm not opposed in principle to a mechanism which will prioritize
vcpus awaiting hardware interrupts.  But I am wary of guessing what
the problem is and then introducing a patch without proper root-cause
analysis.  Even if it seems to fix the immediate problem, it may
simply be masking the real problem, and may also cause problems of its
own.  Behavior of the scheduler is hard enough to understand already,
and every special case makes it even harder.

I absolutely agree!


Juergen

--
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@xxxxxxxxxxxxxx
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>