WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN

To: Christoph Egger <Christoph.Egger@xxxxxxx>
Subject: RE: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
From: "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
Date: Thu, 19 Feb 2009 17:13:18 +0800
Accept-language: en-US
Acceptlanguage: en-US
Cc: "Frank.Vanderlinden@xxxxxxx" <Frank.Vanderlinden@xxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>, "Ke, Liping" <liping.ke@xxxxxxxxx>, Gavin Maltby <Gavin.Maltby@xxxxxxx>
Delivery-date: Thu, 19 Feb 2009 01:14:22 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <200902181905.55015.Christoph.Egger@xxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <C5BF30B3.2C2B%keir.fraser@xxxxxxxxxxxxx> <4999A94D.5020500@xxxxxxx> <E2263E4A5B2284449EEBD0AAB751098401C79FB2D2@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <200902181905.55015.Christoph.Egger@xxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcmR85zERjIxrd+yQJGzyhjk5bJZjwAerW9g
Thread-topic: [Xen-devel] Re: [RFC] RAS(Part II)--MCA enalbing in XEN
xen-devel-bounces@xxxxxxxxxxxxxxxxxxx <> wrote:
> On Tuesday 17 February 2009 07:41:29 Jiang, Yunhong wrote:
>> I think the major difference including: a) How to handle the #MC, i.e.
>> reset system, decide impacted components, take recover action like page
>> offline etc. b) How to handle error impact guest. As to other item like
>> log/telemetry, I think our implementation didn't have much different to
>> current implementation.
> 
> The hardware doesn't know what recover actions the software can do.
> If page A is faulty, and software maintains a copy in page B, then
> software can turn an uncorrectable error into an correctable one.
> If the hardware is aware of that copy (memory mirroring done by memory
> controller), then the hardware itself turns the uncorrectable error
> into an correctable one and reports an correctable error.
> 
> Therefore, I don't see why other flags than correctable and uncorrectable
> are needed at all.

Christoph, thanks for your reply.

I think recoverable means VMM/OS can take recover action like page offline, 
while unrecoverable means VMM/OS can't do anything and we have to reboot. The 
main reason we need these flag is, several step is required for MCA handling, 
for example, when multipel MCE happen to multiple CPU, firstly each CPU check 
it's own severity, seconldy we need check the most severity CPU and take 
action. For example, CPU A may get unrecoverable  while CPU B  get recoverable, 
they will check the information and the result, and the final solution will be 
unrecoverable .

> 
> 
> After some thinking on taking some quick actions, I can
> agree on it if it meets the condition below. Be aware, error analyzes
> is highly CPU vendor and even CPU family/model specific. Doing a
> complete analyzes as Solaris does blows Xen up a *lot*.

I didn't check Solaris code, so can Gavin or Frank gives us more information? 
At least currently it will not be large AFAIK, and if we do need model specific 
support (I don't know such requirement now, and I suppose it will not be common 
if exists, please correct me if wrong), dom0 can inform Xen for it.
 
> 
> Therefore, a *cheap* error analysis must be enough to figure out
> if recover actions like page-offlining or cpu offlining
> are *obviously* only the right thing to do.

Currently we only plan to support these two types, do you have plan for other 
recover action? And is that action be done better in Dom0 than in Xen?

Thanks
-- Yunhong Jiang

> 
> If this is not the case, then let Dom0 decide what to do.

> 
> Christoph
> 
> 
> --
> ---to satisfy European Law for business letters:
> Advanced Micro Devices GmbH
> Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
> Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
> Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

<Prev in Thread] Current Thread [Next in Thread>