This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


Re: Re: [Xen-devel] RFC: MCA/MCE concept

To: xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: Re: [Xen-devel] RFC: MCA/MCE concept
From: Gavin Maltby <Gavin.Maltby@xxxxxxx>
Date: Wed, 30 May 2007 15:00:26 +0100
Delivery-date: Wed, 30 May 2007 07:00:59 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: <200705301310.18574.Christoph.Egger@xxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <200705301310.18574.Christoph.Egger@xxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Thunderbird (X11/20070508)

Apologies for the screwy quoting below - I did not receive the first half of 
thread so it's been forwarded to me.

  - Dom0 got enough CEs so that UEs are very likely to happen in order
     to "circumvent" UEs.

The greatest rewards here are in syndrome/row/column/bank analysis of the
error stream.  Where something like a bad pin produces tonnes of CEs
they are always on the same bit and your chance of a UE is that of a random
radiation type CE colliding within the set of ECC checkwords being undermined
by that pin - not very high.  On the other hand if we're seeing repeated
distinct syndromes from the same chip-select (or chip-select in a pair)
then there is a good chance they could collide "soon" - our data is that
this combination predicts a UE within hours to a few days.  If you have
row/column/bank decoding you can also perform further analysis of the
error source and assess the chances of a collision that would produce a UE.

That example has DIMM memory in mind, but similar approaches apply to
cache memory where it is ECC protected and so on.

  - Possible operations on a DomU
       - save/restore DomU
       - (live-)migrate DomU to a different physical machine
       - etc.
Very heavy-weight operations, which I think are unlikely to succeed if
you already suspect the system's going to suffer a UE soon.

As above, some predictors can give you hours to a few days warning of a UE.


Xen-devel mailing list

<Prev in Thread] Current Thread [Next in Thread>