|
|
|
|
|
|
|
|
|
|
xen-devel
Re: Re: [Xen-devel] RFC: MCA/MCE concept
Hi,
Apologies for the screwy quoting below - I did not receive the first half of
this
thread so it's been forwarded to me.
- Dom0 got enough CEs so that UEs are very likely to happen in order
to "circumvent" UEs.
The greatest rewards here are in syndrome/row/column/bank analysis of the
error stream. Where something like a bad pin produces tonnes of CEs
they are always on the same bit and your chance of a UE is that of a random
radiation type CE colliding within the set of ECC checkwords being undermined
by that pin - not very high. On the other hand if we're seeing repeated
distinct syndromes from the same chip-select (or chip-select in a pair)
then there is a good chance they could collide "soon" - our data is that
this combination predicts a UE within hours to a few days. If you have
row/column/bank decoding you can also perform further analysis of the
error source and assess the chances of a collision that would produce a UE.
That example has DIMM memory in mind, but similar approaches apply to
cache memory where it is ECC protected and so on.
- Possible operations on a DomU
- save/restore DomU
- (live-)migrate DomU to a different physical machine
- etc.
Very heavy-weight operations, which I think are unlikely to succeed if
you already suspect the system's going to suffer a UE soon.
As above, some predictors can give you hours to a few days warning of a UE.
Gavin
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|
<Prev in Thread] |
Current Thread |
[Next in Thread> |
- Re: Re: [Xen-devel] RFC: MCA/MCE concept,
Gavin Maltby <=
|
|
|
|
|