This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


Re: [Xen-devel] RFC: MCA/MCE concept

>>> "Christoph Egger" <Christoph.Egger@xxxxxxx> 30.05.07 09:45 >>>
>On Wednesday 30 May 2007 09:19:12 Jan Beulich wrote:
>> >case I) - Xen reveives a MCE from the CPU
>> >
>> >1) Xen MCE handler figures out if error is an correctable error (CE)
>> >    or uncorrectable error (UE)
>> >2a) error == CE:
>> >     Xen notifies Dom0 if Dom0 installed an MCA event handler
>> >     for statistical purpose
>> >2b) error == UE and UE impacts Xen or Dom0:
>> A very important aspect here is how you want to classify what impact an
>> uncorrectable has - generally, I can see very few situations where you
>> could confine the impact to a sub-portion of the system (i.e. a single
>> domU, dom0, or Xen). The general rule in my opinion must be to halt the
>> system, the question just is how likely it is that you can get a meaningful
>> message out (to screen, serial, or logs) that can help analyze the problem
>> afterwards. If it is somewhat likely, then dom0 should be involved,
>> otherwise Xen should just shut down the system.
>Here you can best help out using HW features to handle errors.
>AMD CPUs features online-spare RAM and Chipkill since K8 RevF.
>CPUs such as the Sparc features Data Poisoning. That would be the
>most handy technique that can be used here.

But that assumes the error is recoverable (i.e. no other data got
corrupted). You still didn't clarify how you intend to determine the
impact an uncorrectable error had.

>> >3a) DomU is a PV guest:
>> >       if DomU installed MCA event handler, it gets notified to perform
>> >          self-healing
>> >       if DomU did not install MCA event handler, notify Dom0 to do
>> >          some operations on DomU (case II)
>> >       if neither DomU nor Dom0 did not install MCA event handlers,
>> >          then Xen kills DomU
>> >3b) DomU is a HVM guest:
>> >       if DomU features a PV driver then behave as in 3a)
>> What significance do pv drivers have here? Or do you mean a pv MCA
>> driver?
>Yes, I mean the pv MCA driver.
>> >       if DomU enabled MCA/MCE via MSR, inject MCE into guest
>> >       if DomU did not enable MCA/MCE via MSR, notify Dom0
>> >            to do some operations on DomU (case II)
>> >       if neither DomU enabled MCA/MCE nor Dom0 did not install
>> >            MCA event handler, Xen kills DomU
>> Injecting an MCE to a hvm guest seems at least questionable. It can't
>> really do anything about it (it doesn't even know the real topology of the
>> system it's running on, so addresses stored in MSRs are meaningless -
>> either you allow them to be read untranslated [in which case the guest
>> cannot make sense of them] or you do translation for the guest [in which
>> case it might make assumptions about co-locality of other nearby pages
>> which will be wrong]).
>Yes, Xen should do the translation for the guest. The assumptions must
>be fixed then. I know that's easier said than done.

Exactly - you are proposing to fix all possible OSes, including sufficiently old
ones. That's impossible. And I can't even see why an OS intended to run on
native hardware would care to try to deal with virtualization aspects like this.


Xen-devel mailing list