Christoph Egger <mailto:Christoph.Egger@xxxxxxx> wrote:
> Ok, here is a different interpretation of what is correctable and
> uncorrectable. Uncorrectable in your interpretation means neither hardware
> nor software can't
> do anything.
> Uncorrectable in my interpretation means the hardware can't
> correct it, but
> software may have more information and correct it.
Yes. Maybe "fatal" is more appropriate name here.
>
>> The main reason we need these flag is, several step is required for MCA
>> handling, for example, when multiple MCE happen to multiple CPU, firstly
>> each CPU check it's own severity, seconldy we need check the most severity
>> CPU and take action. For example, CPU A may get unrecoverable while CPU B
>> get recoverable, they will check the information and the result, and the
>> final solution will be unrecoverable .
>
> I brought up an example of a broken memory page for my argumentation,
> you bring up a broken CPU for your argumentation.
>
> We need to find a common denominator to compare.
>
> If a CPU is completely broken and you are on UP, then game is over. Not
> even a reboot can help. On a SMP system, offline the CPU and inform Dom0.
Sorry I didn't get relationship between the flags and comparing the two example
:$
>> Currently we only plan to support these two types, do you have plan for
>> other recover action? And is that action be done better in Dom0 than in
>> Xen?
>
> Yes!! Solaris maintains a list of broken pages which is even persistent
> across reboot when the serial number of the DIMM didn't change.
> For doing page offlining properly, SUN should design a
> hypercall allowing
> the Dom0 to give Xen this list as early as possible at boot time.
We have a patch to support page offline (sent as RFC to mailing list), and it
already export a hypercall for Dom0 to ask Xen to offline pages (this is for
proactive action to CE errors from Dom0), also, as Frank suggested, we will add
a hypercall for Dom0 to get page's offline status, so it should be OK.
> Further, with our Shanghai CPU, we can disable certain parts
> of its L3 cache.
> Instead of offlining that broken CPU completely, just disable
> the broken
> part of it. The registers for this is in PCI config space.
> Since Xen delegates
> PCI access to Dom0, Dom0 can do that.
Sorry that I have no idea of Shanghai, but I'm a bit suprised that when error
happens to cache, we will transfer control to Dom0 and wait for Dom0's MCA
handler to take action to disable the cache, it is really a loooong code path.
Per my understanding, if there are issue in cache, we should clear/disable the
cache ASAP to avoid more server result, and it is a extreme example to let Xen
handle the MCA. Or maybe I missed something important in this feature?
BTW, I want to clarify that this patch is for #MC handling (i.e. the
"uncorrected" error in your mind). For hardware correctable error (i.e.
"correctable") , Xen will do nothing, but just pass it to Dom0 as vIRQ as our
previous patch
(http://lists.xensource.com/archives/html/xen-devel/2008-12/msg00970.html )
shown, because CE will not impact system. So if the "cache index disable" is to
disable part of cache after too many CE (Correctable Error) as proactive
action, I think we are on the same page.
I attached two foil that are part of our Xen summit presentation. Page 1 is
mainly for #MC handling, page2 is for CE handling (though CMCI or polling). The
page 1 is described in the patch clearly. Page 2 is what our previous patch did
.
Thanks
-- Yunhong Jiang
>
> Christoph
>
> --
> ---to satisfy European Law for business letters:
> Advanced Micro Devices GmbH
> Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
> Geschaeftsfuehrer: Jochen Polster, Thomas M. McCoy, Giuliano Meroni
> Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632
MCA.pdf
Description: MCA.pdf
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|