Jan Beulich wrote:
>>>> On 08.10.11 at 10:29, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
>>>> wrote:
>
>>
>>> -----Original Message-----
>>> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
>>> Sent: Friday, September 30, 2011 3:25 PM
>>> To: Liu, Jinsong; Jiang, Yunhong
>>> Cc: keir.xen@xxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx
>>> Subject: RE: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler
>>>
>>>>>> On 30.09.11 at 04:51, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
>>>>>> wrote:
>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
>>>>> This made me look at the current source, and there I see in
>>>>> mce_urgent_action()
>>>>>
>>>>> if ( !(gstatus & MCG_STATUS_RIPV) && !guest_mode(regs))
>>>>> return -1;
>>>>>
>>>>> which I think should say ... _EIPV and use || instead. Thoughts?
>>>>
>>>> I think this code means, if the error happens in hypervisor mode
>>>> (i.e. !guest_mode()), and RIPV indicate the RIP in stack can't be
>>>> restarted, we have to panic.
>>>
>>> Then the guest_mode() check still lacks an extra check of EIPV, like
>>>
>>> if ( !(gstatus & MCG_STATUS_RIPV) &&
>>> (!(gstatus & MCG_STATUS_EIPV) || !guest_mode(regs)))
>>> return -1;
>>>
>>
>> The RIPV is not related to the EIPV. RIPV means the context saved in
>> the stack can't be restarted anymore. According to the SDM, RIPV
>> means "execution can be restarted reliably at the instruction
>> pointed to by the instruction pointer pushed on the stack". It's not
>> about error happened synchronously or asynchronously. The point is,
>> if the program is running in hypervisor context, and RIPV tells us
>> that the program can't be restarted, we can't do anything but panic,
>> because we can't switch context while we are in xen. So this code
>> have nothing to do with EIPV.
>
> I continue to disagree (including the statement in your other
> response): RIPV only tells us whether we can resume, not in which
> context the error occurred. EIPV tells us whether, by looking at the
> saved registers, we can determine the context that the error occurred
> in. Since with !RIPV
> we have to determine in what context the error occurred in order to
> decide whether to panic or just kill a guest, we can't ignore EIPV
> (and if it's not set we have to assume the worst case, since even if
> the registers indicate guest mode the error may have occurred in
> hypervisor context or accessing hypervisor structures [consider e.g. a
> data load error during a GDT access]).
>
> Jan
Yes, I agree EIPV=0 may indicate async error, but I think your solution
*overkilled* most cases (i.e. the real guest instruction fetch error).
Our idea is,
* xen mce would flush prefetched instruction so we can delay handle it until
if real need;
* a h/w error will not disappear, but if it was not being *consumed*, it's OK
for system keep going (like SRAO error which do not need s/w handle
immediately);
Suppose an async instruction fetch error (RIVP=EIVP=0), triggered at guest
context but instruction prefetch hypervisor context. The scenario is,
* at xen mce, the prefetched instruction has been flushed. xen mce handler
needn't panic, instead it mark the page as broken page, then trigger vmce to
guest;
* guest may kill app, kernel thread, guest itself, or whatever;
The error is still an error, w/ 2 possibilities in the future:
1. it may not be consumed as an SRAR error, system keep going, h/w mechanism
may detect a SRAO error (i.e. memroy scrub) at some time point and handled then;
2. it may be consumed at some time point and a SRAR error triggered again. At
this time,
1). if srar occurred at hypervisor context, xen will panic. or,
2). if srar occurred at guest context, xen kill the guest as a malicious one
(as what the 2nd patch do), and move the page to broken page list;
Considering the rare possibility of the above case, I think it's acceptable to
handle it in this way.
Thoughts?
Thanks,
Jinsong
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|