RE: [Xen-devel] RFC: MCA/MCE concept

To:	"Gavin Maltby" <Gavin.Maltby@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx
Subject:	RE: [Xen-devel] RFC: MCA/MCE concept
From:	"Petersson, Mats" <Mats.Petersson@xxxxxxx>
Date:	Wed, 30 May 2007 17:03:55 +0200
Delivery-date:	Wed, 30 May 2007 08:03:12 -0700
Envelope-to:	www-data@xxxxxxxxxxxxxxxxxx
In-reply-to:	<465D812D.9040907@xxxxxxx>
List-help:	<mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id:	Xen developer discussion <xen-devel.lists.xensource.com>
List-post:	<mailto:xen-devel@lists.xensource.com>
List-subscribe:	<http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe:	<http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender:	xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index:	AceiweeIdhki3L9NSz6dxRZ0hVG3ZQAAJZHQ
Thread-topic:	[Xen-devel] RFC: MCA/MCE concept

[snip]
> My feeling is that the hypervisor and dom0 own the hardware 
> and as such
> all hardware fault management should reside there.  So we should never
> deliver any form of #MC to a domU, nor should a poll of MCA state from
> a domU ever observe valid state (e.g, make the RDMSR return 0).
> So all handling, logging and diagnosis as well as hardware 
> response actions
> (such as to deploy an online spare chip-select) are controlled
> in the hypervisor/dom0 combination.  That seems a consistent 
> model - e.g.,
> if a domU is migrated to another system it should not carry the
> diagnosis state of the original system across etc, since that 
> belongs with
> the one domain that cannot migrate.

I agree entirely with this. 

> 
> But that is not to say that (I think at a future phase) domU 
> should not
> participate in a higher-level fault management function, at 
> the direction
> of the hypervisor/dom0 combo.  For example if/when we can isolate an
> uncorrectable error to a single domU we could forward such an event to
> the affected domU if it has registered its ability/interest in such
> events.  These won't be in the form of a faked #MC or anything,
> instead they'd be some form of synchronous trap experienced when next
> the affected domU context resumes on CPU.  The intelligent 
> domU handler
> can then decide whether the domU must panic, whether it could simply
> kill the affected process etc.  Those details are clearly 
> sketchy, but the
> idea is to up-level the communication to a domU to be more like
> "you're broken" rather than "here's a machine-level hardware error for
> you to interpret and decide what to do with".

Yes, this makes much more sense than forwarding #MC, as the guest would
have a hard time to actually do anything really useful with this. As far
as I know, most uncorrectable errors are near enough entirely fatal in
most commercial non-Enterprise OS's anyways - e.g. in Windows XP or
Server 2K3, it always ends in a blue-screen - which is hardly any better
than the guest being "humanely euthenazed" by Dom0. 

I take it this would be some sort of hypercall (available through the
regular PV-driver interface for HVM guests) to say "Let me know if I'm
broken - trap on vector X". 

--
Mats
> 
> Gavin
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
> 
> 
> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

WARNING - OLD ARCHIVES

xen-devel

RE: [Xen-devel] RFC: MCA/MCE concept