WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN

To: Christoph Egger <Christoph.Egger@xxxxxxx>
Subject: RE: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
From: "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
Date: Fri, 20 Feb 2009 10:53:11 +0800
Accept-language: en-US
Acceptlanguage: en-US
Cc: "Frank.Vanderlinden@xxxxxxx" <Frank.Vanderlinden@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, "Ke, Liping" <liping.ke@xxxxxxxxx>, Gavin Maltby <Gavin.Maltby@xxxxxxx>
Delivery-date: Thu, 19 Feb 2009 18:54:32 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <200902191725.32556.Christoph.Egger@xxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <C5BF30B3.2C2B%keir.fraser@xxxxxxxxxxxxx> <200902181905.55015.Christoph.Egger@xxxxxxx> <E2263E4A5B2284449EEBD0AAB751098401C7AAC7A0@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <200902191725.32556.Christoph.Egger@xxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcmSrsG4ljM7MNXKRZuaXsBGozdGlgAS9YIQ
Thread-topic: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
Christoph Egger <mailto:Christoph.Egger@xxxxxxx> wrote:
> Ok, here is a different interpretation of what is correctable and
> uncorrectable. Uncorrectable in your interpretation means neither hardware
> nor software can't
> do anything.
> Uncorrectable in my interpretation means the hardware can't
> correct it, but
> software may have more information and correct it.

Yes. Maybe "fatal" is more appropriate name here. 

> 
>> The main reason we need these flag is, several step is required for MCA
>> handling, for example, when multiple MCE happen to multiple CPU, firstly
>> each CPU check it's own severity, seconldy we need check the most severity
>> CPU and take action. For example, CPU A may get unrecoverable  while CPU B
>> get recoverable, they will check the information and the result, and the
>> final solution will be unrecoverable .
> 
> I brought up an example of a broken memory page for my argumentation,
> you bring up a broken CPU for your argumentation.
> 
> We need to find a common denominator to compare.
> 
> If a CPU is completely broken and you are on UP, then game is over. Not
> even a reboot can help. On a SMP system, offline the CPU and inform Dom0.

Sorry I didn't get relationship between the flags and comparing the two example 
:$

>> Currently we only plan to support these two types, do you have plan for
>> other recover action? And is that action be done better in Dom0 than in
>> Xen?
> 
> Yes!! Solaris maintains a list of broken pages which is even persistent
> across reboot when the serial number of the DIMM didn't change.
> For doing page offlining properly, SUN should design a
> hypercall allowing
> the Dom0 to give Xen this list as early as possible at boot time.

We have a patch to support  page offline (sent as RFC to mailing list), and it 
already export a hypercall for Dom0 to ask Xen to offline pages (this is for 
proactive action to CE errors from Dom0), also, as Frank suggested, we will add 
a hypercall for Dom0 to get page's offline status, so it should be OK.

> Further, with our Shanghai CPU, we can disable certain parts
> of its L3 cache.
> Instead of offlining that broken CPU completely, just disable
> the broken
> part of it. The registers for this is in PCI config space.
> Since Xen delegates
> PCI access to Dom0, Dom0 can do that.

Sorry that I have no idea of Shanghai, but I'm a bit suprised that when error 
happens to cache, we will transfer control to Dom0  and wait for Dom0's MCA 
handler to take action to disable the cache, it is really a loooong code path. 
Per my understanding, if there are issue in cache, we should clear/disable the 
cache ASAP to avoid more server result, and it is a extreme example to let Xen 
handle the MCA. Or maybe I missed something important in this feature?

BTW, I want to clarify that this patch is for #MC handling (i.e. the 
"uncorrected" error in your mind). For hardware correctable error (i.e. 
"correctable") , Xen will do nothing, but just pass it to Dom0 as vIRQ as our 
previous patch 
(http://lists.xensource.com/archives/html/xen-devel/2008-12/msg00970.html ) 
shown, because CE will not impact system. So if the "cache index disable" is to 
disable part of cache after too many CE (Correctable Error) as proactive 
action, I think we are on the same page.

I attached two foil that are part of our Xen summit presentation. Page 1 is 
mainly for #MC handling, page2 is for CE handling (though CMCI or polling). The 
page 1 is described in the patch clearly. Page 2 is what our previous patch did 
.

Thanks
-- Yunhong Jiang

> 
> Christoph
> 
> --
> ---to satisfy European Law for business letters:
> Advanced Micro Devices GmbH
> Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
> Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
> Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632

Attachment: MCA.pdf
Description: MCA.pdf

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
<Prev in Thread] Current Thread [Next in Thread>