This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


[Xen-devel] RFC: MCA/MCE concept

To: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: [Xen-devel] RFC: MCA/MCE concept
From: "Christoph Egger" <Christoph.Egger@xxxxxxx>
Date: Tue, 29 May 2007 17:32:46 +0200
Delivery-date: Tue, 29 May 2007 10:21:13 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Organization: AMD / OSRC
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: KMail/1.9.6

The current MCA/MCE support in Xen is that it dumps the error and panics.

In the following concept I propose here, there are two places where Xen has to 
react on.
I) Xen receives a MCE from the CPU and
II) Xen receives Dom0 instructions via Hypercall

The term "self-healing" below is used in the sense of using the most propriate
technique(s) to handle an error such as MPR 
online-spare RAM or killing/restarting of impacted processes
to prevent crashes of whole guests or the whole machine.

case I) - Xen reveives a MCE from the CPU

1) Xen MCE handler figures out if error is an correctable error (CE)
    or uncorrectable error (UE)
2a) error == CE:
     Xen notifies Dom0 if Dom0 installed an MCA event handler
     for statistical purpose
2b) error == UE and UE impacts Xen or Dom0:
     Xen does some self-healing
         and notifies Dom0 on success if Dom0 installed MCA event handler
         or Xen panics on failure
2c)  error == UE and UE impacts DomU:
      In case of Dom0 installed MCA event handler:
          Xen notifies Dom0 and Dom0 tells Xen whether
              to also notify DomU and/or does some operations
              on the DomU (case II)
       In case Dom0 did not install MCA event handler,
           Xen notifies DomU
3a) DomU is a PV guest:
       if DomU installed MCA event handler, it gets notified to perform
       if DomU did not install MCA event handler, notify Dom0 to do
          some operations on DomU (case II)
       if neither DomU nor Dom0 did not install MCA event handlers,
          then Xen kills DomU
3b) DomU is a HVM guest:
       if DomU features a PV driver then behave as in 3a)
       if DomU enabled MCA/MCE via MSR, inject MCE into guest
       if DomU did not enable MCA/MCE via MSR, notify Dom0
            to do some operations on DomU (case II)
       if neither DomU enabled MCA/MCE nor Dom0 did not install
            MCA event handler, Xen kills DomU

case II) - Xen reveives Dom0 instructions via Hypercall

There are different reasons, why Xen should do something.

   - Dom0 got enough CEs so that UEs are very likely to happen in order
      to "circumvent" UEs.
   - Possible operations on a DomU
        - save/restore DomU
        - (live-)migrate DomU to a different physical machine
        - etc.

Some details


When an MCE occures, then all the stuff above should NOT happen within the
handler, because when an MCE happens within the MCE handler, then the CPU
enters shutdown state. So the mail topic "NMI deferal on i386" may be related 

Notifying guests

Above I am talking about MCA event handler. What I actually mean is a way to
inform the guest something happened.
I choosed the term "MCA event handler", because I think, using the event
mechanism fits best for this purpose.

Regarding HVM guests with no "MCA PV driver", can enable/disable certain types
of errors. They can even control if tehy want to get an exception or do
I would prefer to always inject exceptions into the HVM guest. A HVM guest
can't prevent when it always see's exceptions, but I know if they behave
correctly, when they assume to get all or certain errors via polling.

Guests which already feature fault management to a certain level when
running non-virtualized can easily re-use this capability to decode the
error telemetry and handle the error in the virtualized case.
Thus forwarding/injecting the error into a guest will only require the 
translation of the physical/virtual address reported by the HW into
guest physical/guest virtual addresses. The error code itself needs no


IMO, only Xen should use the HW features such as online-spare RAM, which
has been introduced in AMD K8 RevF. The HW features should never be visible
to any DomUs in order to reduce complexity in Xen. Software-only techniques
such as MPR are ok in all guests. Only the Dom0 can tell Xen to do something
using HW features.


AMD Saxony, Dresden Germany
Operating System Research Center

Legal Information:
AMD Saxony Limited Liability Company & Co. KG
Sitz (Geschäftsanschrift):
   Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland
Registergericht Dresden: HRA 4896
vertretungsberechtigter Komplementär:
   AMD Saxony LLC (Sitz Wilmington, Delaware, USA)
Geschäftsführer der AMD Saxony LLC:
   Dr. Hans-R. Deppe, Thomas McCoy

Xen-devel mailing list

<Prev in Thread] Current Thread [Next in Thread>