This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


Re: [Xen-devel] RFC: MCA/MCE concept


On 05/30/07 10:10, Christoph Egger wrote:


2b) error == UE and UE impacts Xen or Dom0:
A very important aspect here is how you want to classify what impact an
uncorrectable has - generally, I can see very few situations where you
could confine the impact to a sub-portion of the system (i.e. a single
domU, dom0, or Xen). The general rule in my opinion must be to halt the
system, the question just is how likely it is that you can get a
meaningful message out (to screen, serial, or logs) that can help
analyze the problem afterwards. If it is somewhat likely, then dom0
should be involved, otherwise Xen should just shut down the system.
Here you can best help out using HW features to handle errors.
AMD CPUs features online-spare RAM and Chipkill since K8 RevF.

CPUs such as the Sparc features Data Poisoning. That would be the
most handy technique that can be used here.
But that assumes the error is recoverable (i.e. no other data got
corrupted). You still didn't clarify how you intend to determine the
impact an uncorrectable error had.

I know. I am lacking a sudden inspiration here.
That's why I discuss this here before writing code that goes to nowhere.
Anyone here with a flash of genius? :-)

For a first phase I'd suggest that treating an uncorrectable error as
terminal to the entire system (e.g., panic hypervisor or setup a hardware
reset mechanism such as Sync Flood) is practical and safe, and allows
us to concentrate on getting some more basic elements in place.
As Christoph says we really need some form of data poisoning supported
on the platform to really be able to isolate the impact of an uncorrectable
error.  In the absence of such support I think some fancy heuristics could
work in some limited cases (e.g., a memory uncorrectable on a page that
only a domU has a mapping to and which is not shared with any other domain
not even via a front/backend driver) but the penalty for bugs in those
heuristics is silent data corruption which is the ultimate crime.

3a) DomU is a PV guest:
      if DomU installed MCA event handler, it gets notified to perform
      if DomU did not install MCA event handler, notify Dom0 to do
         some operations on DomU (case II)
      if neither DomU nor Dom0 did not install MCA event handlers,
         then Xen kills DomU
3b) DomU is a HVM guest:
      if DomU features a PV driver then behave as in 3a)
What significance do pv drivers have here? Or do you mean a pv MCA

My feeling is that the hypervisor and dom0 own the hardware and as such
all hardware fault management should reside there.  So we should never
deliver any form of #MC to a domU, nor should a poll of MCA state from
a domU ever observe valid state (e.g, make the RDMSR return 0).
So all handling, logging and diagnosis as well as hardware response actions
(such as to deploy an online spare chip-select) are controlled
in the hypervisor/dom0 combination.  That seems a consistent model - e.g.,
if a domU is migrated to another system it should not carry the
diagnosis state of the original system across etc, since that belongs with
the one domain that cannot migrate.

But that is not to say that (I think at a future phase) domU should not
participate in a higher-level fault management function, at the direction
of the hypervisor/dom0 combo.  For example if/when we can isolate an
uncorrectable error to a single domU we could forward such an event to
the affected domU if it has registered its ability/interest in such
events.  These won't be in the form of a faked #MC or anything,
instead they'd be some form of synchronous trap experienced when next
the affected domU context resumes on CPU.  The intelligent domU handler
can then decide whether the domU must panic, whether it could simply
kill the affected process etc.  Those details are clearly sketchy, but the
idea is to up-level the communication to a domU to be more like
"you're broken" rather than "here's a machine-level hardware error for
you to interpret and decide what to do with".


Xen-devel mailing list