This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


RE: [Xen-devel] RFC: MCA/MCE concept

To: "Gavin Maltby" <Gavin.Maltby@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: RE: [Xen-devel] RFC: MCA/MCE concept
From: "Petersson, Mats" <Mats.Petersson@xxxxxxx>
Date: Wed, 30 May 2007 17:03:55 +0200
Delivery-date: Wed, 30 May 2007 08:03:12 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <465D812D.9040907@xxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AceiweeIdhki3L9NSz6dxRZ0hVG3ZQAAJZHQ
Thread-topic: [Xen-devel] RFC: MCA/MCE concept
> My feeling is that the hypervisor and dom0 own the hardware 
> and as such
> all hardware fault management should reside there.  So we should never
> deliver any form of #MC to a domU, nor should a poll of MCA state from
> a domU ever observe valid state (e.g, make the RDMSR return 0).
> So all handling, logging and diagnosis as well as hardware 
> response actions
> (such as to deploy an online spare chip-select) are controlled
> in the hypervisor/dom0 combination.  That seems a consistent 
> model - e.g.,
> if a domU is migrated to another system it should not carry the
> diagnosis state of the original system across etc, since that 
> belongs with
> the one domain that cannot migrate.

I agree entirely with this. 

> But that is not to say that (I think at a future phase) domU 
> should not
> participate in a higher-level fault management function, at 
> the direction
> of the hypervisor/dom0 combo.  For example if/when we can isolate an
> uncorrectable error to a single domU we could forward such an event to
> the affected domU if it has registered its ability/interest in such
> events.  These won't be in the form of a faked #MC or anything,
> instead they'd be some form of synchronous trap experienced when next
> the affected domU context resumes on CPU.  The intelligent 
> domU handler
> can then decide whether the domU must panic, whether it could simply
> kill the affected process etc.  Those details are clearly 
> sketchy, but the
> idea is to up-level the communication to a domU to be more like
> "you're broken" rather than "here's a machine-level hardware error for
> you to interpret and decide what to do with".

Yes, this makes much more sense than forwarding #MC, as the guest would
have a hard time to actually do anything really useful with this. As far
as I know, most uncorrectable errors are near enough entirely fatal in
most commercial non-Enterprise OS's anyways - e.g. in Windows XP or
Server 2K3, it always ends in a blue-screen - which is hardly any better
than the guest being "humanely euthenazed" by Dom0. 

I take it this would be some sort of hypercall (available through the
regular PV-driver interface for HVM guests) to say "Let me know if I'm
broken - trap on vector X". 

> Gavin
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

Xen-devel mailing list