This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


Re: [Xen-devel] RFC: MCA/MCE concept

To: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] RFC: MCA/MCE concept
From: "Christoph Egger" <Christoph.Egger@xxxxxxx>
Date: Wed, 30 May 2007 09:45:50 +0200
Cc: Jan Beulich <jbeulich@xxxxxxxxxx>
Delivery-date: Wed, 30 May 2007 00:44:26 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <465D4190.76E4.0078.0@xxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Organization: AMD / OSRC
References: <200705291732.46709.Christoph.Egger@xxxxxxx> <465D4190.76E4.0078.0@xxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: KMail/1.9.6
On Wednesday 30 May 2007 09:19:12 Jan Beulich wrote:
> >case I) - Xen reveives a MCE from the CPU
> >
> >1) Xen MCE handler figures out if error is an correctable error (CE)
> >    or uncorrectable error (UE)
> >2a) error == CE:
> >     Xen notifies Dom0 if Dom0 installed an MCA event handler
> >     for statistical purpose
> >2b) error == UE and UE impacts Xen or Dom0:
> A very important aspect here is how you want to classify what impact an
> uncorrectable has - generally, I can see very few situations where you
> could confine the impact to a sub-portion of the system (i.e. a single
> domU, dom0, or Xen). The general rule in my opinion must be to halt the
> system, the question just is how likely it is that you can get a meaningful
> message out (to screen, serial, or logs) that can help analyze the problem
> afterwards. If it is somewhat likely, then dom0 should be involved,
> otherwise Xen should just shut down the system.

Here you can best help out using HW features to handle errors.
AMD CPUs features online-spare RAM and Chipkill since K8 RevF.

CPUs such as the Sparc features Data Poisoning. That would be the
most handy technique that can be used here.

Maybe this line:

> >     Xen does some self-healing

should be this:

            Xen *tries* to do some self-healing
> >         and notifies Dom0 on success if Dom0 installed MCA event handler
> >         or Xen panics on failure

The first implemenation can just panic here. The self-healing will be
implemented and improved over time.

> >2c)  error == UE and UE impacts DomU:
> >      In case of Dom0 installed MCA event handler:
> >          Xen notifies Dom0 and Dom0 tells Xen whether
> >              to also notify DomU and/or does some operations
> >              on the DomU (case II)
> >       In case Dom0 did not install MCA event handler,
> >           Xen notifies DomU
> >3a) DomU is a PV guest:
> >       if DomU installed MCA event handler, it gets notified to perform
> >          self-healing
> >       if DomU did not install MCA event handler, notify Dom0 to do
> >          some operations on DomU (case II)
> >       if neither DomU nor Dom0 did not install MCA event handlers,
> >          then Xen kills DomU
> >3b) DomU is a HVM guest:
> >       if DomU features a PV driver then behave as in 3a)
> What significance do pv drivers have here? Or do you mean a pv MCA
> driver?

Yes, I mean the pv MCA driver.

> >       if DomU enabled MCA/MCE via MSR, inject MCE into guest
> >       if DomU did not enable MCA/MCE via MSR, notify Dom0
> >            to do some operations on DomU (case II)
> >       if neither DomU enabled MCA/MCE nor Dom0 did not install
> >            MCA event handler, Xen kills DomU
> Injecting an MCE to a hvm guest seems at least questionable. It can't
> really do anything about it (it doesn't even know the real topology of the
> system it's running on, so addresses stored in MSRs are meaningless -
> either you allow them to be read untranslated [in which case the guest
> cannot make sense of them] or you do translation for the guest [in which
> case it might make assumptions about co-locality of other nearby pages
> which will be wrong]).

Yes, Xen should do the translation for the guest. The assumptions must
be fixed then. I know that's easier said than done.

> Doing this to a pv domU for purely notification purposes (where the guest
> knows it's running virtualized) is clearly a different matter.

Yes, I agree with you here. The general idea behind informing a DomU
is to let its own fault management handle the error. It is always better to 
let it kill a screen saver process and keep the word processor running than
killing the whole guest. The DomU should crash itself if it thinks that's the

> >case II) - Xen reveives Dom0 instructions via Hypercall
> >
> >There are different reasons, why Xen should do something.
> >
> >   - Dom0 got enough CEs so that UEs are very likely to happen in order
> >      to "circumvent" UEs.
> >   - Possible operations on a DomU
> >        - save/restore DomU
> >        - (live-)migrate DomU to a different physical machine
> >        - etc.
> Very heavy-weight operations, which I think are unlikely to succeed if
> you already suspect the system's going to suffer a UE soon.

Yes, they are heavy-weight operations. Do you have some ideas, what
a Dom0 can do?

The idea here is that the Dom0's fault management helps guests to
survive as best as possible.


AMD Saxony, Dresden Germany
Operating System Research Center

Legal Information:
AMD Saxony Limited Liability Company & Co. KG
Sitz (Geschäftsanschrift):
   Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland
Registergericht Dresden: HRA 4896
vertretungsberechtigter Komplementär:
   AMD Saxony LLC (Sitz Wilmington, Delaware, USA)
Geschäftsführer der AMD Saxony LLC:
   Dr. Hans-R. Deppe, Thomas McCoy

Xen-devel mailing list