WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler

To: Jan Beulich <JBeulich@xxxxxxxx>, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
Subject: RE: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler
From: "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
Date: Tue, 11 Oct 2011 16:15:30 +0800
Accept-language: en-US
Acceptlanguage: en-US
Cc: "keir.xen@xxxxxxxxx" <keir.xen@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Tue, 11 Oct 2011 01:16:28 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4E92C369020000780005A68B@xxxxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <BC00F5384FCFC9499AF06F92E8B78A9E263B557B77@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <4E84ADF70200007800058882@xxxxxxxxxxxxxxxxxxxx> <789F9655DD1B8F43B48D77C5D306597312D2366A9B@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <4E858AF30200007800058A64@xxxxxxxxxxxxxxxxxxxx> <789F9655DD1B8F43B48D77C5D306597312D23D3B7D@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <4E92C369020000780005A68B@xxxxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcyHI2PJ7bYhqi7WRyGL80xsVTeSvAAsi7NQ
Thread-topic: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler
Jan Beulich wrote:
>>>> On 08.10.11 at 10:29, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
>>>> wrote: 
> 
>> 
>>> -----Original Message-----
>>> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
>>> Sent: Friday, September 30, 2011 3:25 PM
>>> To: Liu, Jinsong; Jiang, Yunhong
>>> Cc: keir.xen@xxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx
>>> Subject: RE: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler
>>> 
>>>>>> On 30.09.11 at 04:51, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
>>>>>> wrote: 
>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
>>>>> This made me look at the current source, and there I see in
>>>>> mce_urgent_action() 
>>>>> 
>>>>>     if ( !(gstatus & MCG_STATUS_RIPV) && !guest_mode(regs))      
>>>>> return -1; 
>>>>> 
>>>>> which I think should say ... _EIPV and use || instead. Thoughts?
>>>> 
>>>> I think this code means, if the error happens in hypervisor mode
>>>> (i.e. !guest_mode()), and RIPV indicate the RIP in stack can't be
>>>> restarted, we have to panic.
>>> 
>>> Then the guest_mode() check still lacks an extra check of EIPV, like
>>> 
>>>      if ( !(gstatus & MCG_STATUS_RIPV) &&
>>>           (!(gstatus & MCG_STATUS_EIPV) || !guest_mode(regs)))     
>>> return -1; 
>>> 
>> 
>> The RIPV is not related to the EIPV. RIPV means the context saved in
>> the stack can't be restarted anymore. According to the SDM, RIPV
>> means "execution can be restarted reliably at the instruction
>> pointed to by the instruction pointer pushed on the stack". It's not
>> about error happened synchronously or asynchronously. The point is,
>> if the program is running in hypervisor context, and RIPV tells us
>> that the program can't be restarted, we can't do anything but panic,
>> because we can't switch context while we are in xen. So this code
>> have nothing to do with EIPV. 
> 
> I continue to disagree (including the statement in your other
> response): RIPV only tells us whether we can resume, not in which
> context the error occurred. EIPV tells us whether, by looking at the
> saved registers, we can determine the context that the error occurred
> in. Since with !RIPV 
> we have to determine in what context the error occurred in order to
> decide whether to panic or just kill a guest, we can't ignore EIPV
> (and if it's not set we have to assume the worst case, since even if
> the registers indicate guest mode the error may have occurred in
> hypervisor context or accessing hypervisor structures [consider e.g. a
> data load error during a GDT access]).
> 
> Jan

Yes, I agree EIPV=0 may indicate async error, but I think your solution 
*overkilled* most cases (i.e. the real guest instruction fetch error).

Our idea is,
  * xen mce would flush prefetched instruction so we can delay handle it until 
if real need;
  * a h/w error will not disappear, but if it was not being *consumed*, it's OK 
for system keep going (like SRAO error which do not need s/w handle 
immediately);

Suppose an async instruction fetch error (RIVP=EIVP=0), triggered at guest 
context but instruction prefetch hypervisor context. The scenario is,
  * at xen mce, the prefetched instruction has been flushed. xen mce handler 
needn't panic, instead it mark the page as broken page, then trigger vmce to 
guest;
  * guest may kill app, kernel thread, guest itself, or whatever;

The error is still an error, w/ 2 possibilities in the future:
  1. it may not be consumed as an SRAR error, system keep going, h/w mechanism 
may detect a SRAO error (i.e. memroy scrub) at some time point and handled then;
  2. it may be consumed at some time point and a SRAR error triggered again. At 
this time,
   1). if srar occurred at hypervisor context, xen will panic. or, 
   2). if srar occurred at guest context, xen kill the guest as a malicious one 
(as what the 2nd patch do), and move the page to broken page list;

Considering the rare possibility of the above case, I think it's acceptable to 
handle it in this way.
Thoughts?

Thanks,
Jinsong
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel