|   | 
      | 
  
  
      | 
      | 
  
 
     | 
    | 
  
  
     | 
    | 
  
  
    |   | 
      | 
  
  
    | 
         
xen-devel
Re: Re: [Xen-devel] RFC: MCA/MCE concept
 
Hi,
Apologies for the screwy quoting below - I did not receive the first half of 
this
thread so it's been forwarded to me.
 
  - Dom0 got enough CEs so that UEs are very likely to happen in order
     to "circumvent" UEs.
 
 
 
The greatest rewards here are in syndrome/row/column/bank analysis of the
error stream.  Where something like a bad pin produces tonnes of CEs
they are always on the same bit and your chance of a UE is that of a random
radiation type CE colliding within the set of ECC checkwords being undermined
by that pin - not very high.  On the other hand if we're seeing repeated
distinct syndromes from the same chip-select (or chip-select in a pair)
then there is a good chance they could collide "soon" - our data is that
this combination predicts a UE within hours to a few days.  If you have
row/column/bank decoding you can also perform further analysis of the
error source and assess the chances of a collision that would produce a UE.
That example has DIMM memory in mind, but similar approaches apply to
cache memory where it is ECC protected and so on.
 
  - Possible operations on a DomU
       - save/restore DomU
       - (live-)migrate DomU to a different physical machine
       - etc.
 
Very heavy-weight operations, which I think are unlikely to succeed if
you already suspect the system's going to suffer a UE soon.
 
 
 
As above, some predictors can give you hours to a few days warning of a UE.
Gavin
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
 
 |   
 
| <Prev in Thread] | 
Current Thread | 
[Next in Thread> |  
- Re: Re: [Xen-devel] RFC: MCA/MCE concept,
Gavin Maltby <=
 
  
 |  
  
 | 
    | 
  
  
    |   | 
    |