On Wednesday 30 May 2007 09:19:12 Jan Beulich wrote:
> >case I) - Xen reveives a MCE from the CPU
> >
> >1) Xen MCE handler figures out if error is an correctable error (CE)
> > or uncorrectable error (UE)
> >2a) error == CE:
> > Xen notifies Dom0 if Dom0 installed an MCA event handler
> > for statistical purpose
> >2b) error == UE and UE impacts Xen or Dom0:
>
> A very important aspect here is how you want to classify what impact an
> uncorrectable has - generally, I can see very few situations where you
> could confine the impact to a sub-portion of the system (i.e. a single
> domU, dom0, or Xen). The general rule in my opinion must be to halt the
> system, the question just is how likely it is that you can get a meaningful
> message out (to screen, serial, or logs) that can help analyze the problem
> afterwards. If it is somewhat likely, then dom0 should be involved,
> otherwise Xen should just shut down the system.
Here you can best help out using HW features to handle errors.
AMD CPUs features online-spare RAM and Chipkill since K8 RevF.
CPUs such as the Sparc features Data Poisoning. That would be the
most handy technique that can be used here.
Maybe this line:
> > Xen does some self-healing
should be this:
Xen *tries* to do some self-healing
> > and notifies Dom0 on success if Dom0 installed MCA event handler
> > or Xen panics on failure
The first implemenation can just panic here. The self-healing will be
implemented and improved over time.
> >2c) error == UE and UE impacts DomU:
> > In case of Dom0 installed MCA event handler:
> > Xen notifies Dom0 and Dom0 tells Xen whether
> > to also notify DomU and/or does some operations
> > on the DomU (case II)
> > In case Dom0 did not install MCA event handler,
> > Xen notifies DomU
> >3a) DomU is a PV guest:
> > if DomU installed MCA event handler, it gets notified to perform
> > self-healing
> > if DomU did not install MCA event handler, notify Dom0 to do
> > some operations on DomU (case II)
> > if neither DomU nor Dom0 did not install MCA event handlers,
> > then Xen kills DomU
> >3b) DomU is a HVM guest:
> > if DomU features a PV driver then behave as in 3a)
>
> What significance do pv drivers have here? Or do you mean a pv MCA
> driver?
Yes, I mean the pv MCA driver.
>
> > if DomU enabled MCA/MCE via MSR, inject MCE into guest
> > if DomU did not enable MCA/MCE via MSR, notify Dom0
> > to do some operations on DomU (case II)
> > if neither DomU enabled MCA/MCE nor Dom0 did not install
> > MCA event handler, Xen kills DomU
>
> Injecting an MCE to a hvm guest seems at least questionable. It can't
> really do anything about it (it doesn't even know the real topology of the
> system it's running on, so addresses stored in MSRs are meaningless -
> either you allow them to be read untranslated [in which case the guest
> cannot make sense of them] or you do translation for the guest [in which
> case it might make assumptions about co-locality of other nearby pages
> which will be wrong]).
Yes, Xen should do the translation for the guest. The assumptions must
be fixed then. I know that's easier said than done.
> Doing this to a pv domU for purely notification purposes (where the guest
> knows it's running virtualized) is clearly a different matter.
Yes, I agree with you here. The general idea behind informing a DomU
is to let its own fault management handle the error. It is always better to
let it kill a screen saver process and keep the word processor running than
killing the whole guest. The DomU should crash itself if it thinks that's the
best.
> >case II) - Xen reveives Dom0 instructions via Hypercall
> >
> >There are different reasons, why Xen should do something.
> >
> > - Dom0 got enough CEs so that UEs are very likely to happen in order
> > to "circumvent" UEs.
> > - Possible operations on a DomU
> > - save/restore DomU
> > - (live-)migrate DomU to a different physical machine
> > - etc.
>
> Very heavy-weight operations, which I think are unlikely to succeed if
> you already suspect the system's going to suffer a UE soon.
Yes, they are heavy-weight operations. Do you have some ideas, what
a Dom0 can do?
The idea here is that the Dom0's fault management helps guests to
survive as best as possible.
Christoph
--
AMD Saxony, Dresden Germany
Operating System Research Center
Legal Information:
AMD Saxony Limited Liability Company & Co. KG
Sitz (Geschäftsanschrift):
Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland
Registergericht Dresden: HRA 4896
vertretungsberechtigter Komplementär:
AMD Saxony LLC (Sitz Wilmington, Delaware, USA)
Geschäftsführer der AMD Saxony LLC:
Dr. Hans-R. Deppe, Thomas McCoy
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|