This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


Re: [Xen-devel] RFC: MCA/MCE concept

To: "Christoph Egger" <Christoph.Egger@xxxxxxx>
Subject: Re: [Xen-devel] RFC: MCA/MCE concept
From: "Jan Beulich" <jbeulich@xxxxxxxxxx>
Date: Wed, 30 May 2007 09:19:12 +0200
Cc: xen-devel@xxxxxxxxxxxxxxxxxxx
Delivery-date: Wed, 30 May 2007 00:16:34 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <200705291732.46709.Christoph.Egger@xxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <200705291732.46709.Christoph.Egger@xxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
>case I) - Xen reveives a MCE from the CPU
>1) Xen MCE handler figures out if error is an correctable error (CE)
>    or uncorrectable error (UE)
>2a) error == CE:
>     Xen notifies Dom0 if Dom0 installed an MCA event handler
>     for statistical purpose
>2b) error == UE and UE impacts Xen or Dom0:

A very important aspect here is how you want to classify what impact an
uncorrectable has - generally, I can see very few situations where you
could confine the impact to a sub-portion of the system (i.e. a single domU,
dom0, or Xen). The general rule in my opinion must be to halt the system,
the question just is how likely it is that you can get a meaningful message
out (to screen, serial, or logs) that can help analyze the problem afterwards.
If it is somewhat likely, then dom0 should be involved, otherwise Xen should
just shut down the system.

>     Xen does some self-healing
>         and notifies Dom0 on success if Dom0 installed MCA event handler
>         or Xen panics on failure
>2c)  error == UE and UE impacts DomU:
>      In case of Dom0 installed MCA event handler:
>          Xen notifies Dom0 and Dom0 tells Xen whether
>              to also notify DomU and/or does some operations
>              on the DomU (case II)
>       In case Dom0 did not install MCA event handler,
>           Xen notifies DomU
>3a) DomU is a PV guest:
>       if DomU installed MCA event handler, it gets notified to perform
>          self-healing
>       if DomU did not install MCA event handler, notify Dom0 to do
>          some operations on DomU (case II)
>       if neither DomU nor Dom0 did not install MCA event handlers,
>          then Xen kills DomU
>3b) DomU is a HVM guest:
>       if DomU features a PV driver then behave as in 3a)

What significance do pv drivers have here? Or do you mean a pv MCA

>       if DomU enabled MCA/MCE via MSR, inject MCE into guest
>       if DomU did not enable MCA/MCE via MSR, notify Dom0
>            to do some operations on DomU (case II)
>       if neither DomU enabled MCA/MCE nor Dom0 did not install
>            MCA event handler, Xen kills DomU

Injecting an MCE to a hvm guest seems at least questionable. It can't really
do anything about it (it doesn't even know the real topology of the system
it's running on, so addresses stored in MSRs are meaningless - either you
allow the to be read untranslated [in which case the guest cannot make
sense of them] or you do translation for the guest [in which case it might
make assumptions about co-locality of other nearby pages which will be
Doing this to a pv domU for purely notification purposes (where the guest
knows it's running virtualized) is clearly a different matter.

>case II) - Xen reveives Dom0 instructions via Hypercall
>There are different reasons, why Xen should do something.
>   - Dom0 got enough CEs so that UEs are very likely to happen in order
>      to "circumvent" UEs.
>   - Possible operations on a DomU
>        - save/restore DomU
>        - (live-)migrate DomU to a different physical machine
>        - etc.

Very heavy-weight operations, which I think are unlikely to succeed if
you already suspect the system's going to suffer a UE soon.


Xen-devel mailing list