xen-devel
Re: [Xen-devel] [PATCH] re-work MCA telemetry internals; use common code
Ke, Liping wrote:
Hi, Frank
We did some small test here, found the CMCI problem is caused by the
"mce_banks_owned" bitmap param passing.
When CMCI happened, we print the bitmap (in smp_cmci_interrupt) value, it's correct (For cpu0, 16c).
When passing into mcheck_mca_logout, it turned to be "0xFFFF~~FFF" which is wrong.
Only for your info -:)
Still, I suggest split the patch since this patch is realy big -:)
Thanks a lot for your help!
Criping
Ok, since the patch is already in (thanks Keir!), let's work out any
issues you may see with it.
When you say that the mce_banks_owned value is wrong in
mcheck_mca_logout, do you mean that the parameter passing somehow went
wrong? Is the value ok right before the call, but not in
mcheck_mca_logout itself? That would be strange.
You mentioned the polling code: it didn't actually change much, in fact,
the general polling code now is pretty much the same as the polling code
that was originally in mce_intel.c. That code is very similar. Some AMD
code contained AMD-specific things, and has not been touched (it still
has its own poll function).
In general, the change is:
* machine_check_poll (going over the banks to look for non-fatal errors)
and the loops inside the mcheck handlers that also walk the MC banks,
have been combined in one function, which is mcheck_mca_logout.
mcheck_mca_logout still takes a context argument, as machine_check_poll did.
* The mcheck handlers themselves mostly just call mcheck_cmn_handler,
which calls mcheck_mca_logout, looks at the context, and then decides
what to do.
* The mctelem structure as returned by mcheck_mca_logout, contains the
relevant telemetry for the current error. If it should be put in the
"database" for dom0 retrieval, use mctelem_commit to put it there.
Otherwise, the usual course of action is to print the values and uses
mctelem_dismiss to discard the information.
There is one thing that needs to be addressed: in your Intel-specific
code, you do not reset the per-bank status in the case of a fatal error,
so that the BIOS may look at the MSRs. mcheck_mca_logout does always
reset the values. The best way to fix this is probably to pass in a flag
to mcheck_mca_logout to tell it whether it should reset the status MSRs
or not. The caller can then decide if they should be reset later.
Let me know if you have any more issues, I'll be happy to help merge
your code and test it.
- Frank
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|
|
|