WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler

To: Jan Beulich <JBeulich@xxxxxxxx>, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
Subject: RE: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler
From: "Liu, Jinsong" <jinsong.liu@xxxxxxxxx>
Date: Tue, 11 Oct 2011 17:51:56 +0800
Accept-language: en-US
Acceptlanguage: en-US
Cc: "keir.xen@xxxxxxxxx" <keir.xen@xxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Delivery-date: Tue, 11 Oct 2011 02:52:55 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <4E941DA6020000780005AA9B@xxxxxxxxxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <BC00F5384FCFC9499AF06F92E8B78A9E263B557B77@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <4E84ADF70200007800058882@xxxxxxxxxxxxxxxxxxxx> <789F9655DD1B8F43B48D77C5D306597312D2366A9B@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <4E858AF30200007800058A64@xxxxxxxxxxxxxxxxxxxx> <789F9655DD1B8F43B48D77C5D306597312D23D3B7D@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <4E92C369020000780005A68B@xxxxxxxxxxxxxxxxxxxx> <BC00F5384FCFC9499AF06F92E8B78A9E2693B8F9FB@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> <4E941DA6020000780005AA9B@xxxxxxxxxxxxxxxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcyH8cc1MZXHp7EFQ+atzw3eBDiTWQABnt4Q
Thread-topic: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler
Jan Beulich wrote:
>>>> On 11.10.11 at 10:15, "Liu, Jinsong" <jinsong.liu@xxxxxxxxx> wrote:
>> Jan Beulich wrote:
>>>>>> On 08.10.11 at 10:29, "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
>>>>>> wrote:
>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
>>>>> Sent: Friday, September 30, 2011 3:25 PM
>>>>> To: Liu, Jinsong; Jiang, Yunhong
>>>>> Cc: keir.xen@xxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx
>>>>> Subject: RE: [Xen-devel] [PATCH] X86 MCE: Add SRAR handler
>>>>> 
>>>>>>>> On 30.09.11 at 04:51, "Jiang, Yunhong"
>>>>>>>> <yunhong.jiang@xxxxxxxxx> wrote:
>>>>> 
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
>>>>>>> This made me look at the current source, and there I see in
>>>>>>> mce_urgent_action() 
>>>>>>> 
>>>>>>>     if ( !(gstatus & MCG_STATUS_RIPV) && !guest_mode(regs))
>>>>>>> return -1; 
>>>>>>> 
>>>>>>> which I think should say ... _EIPV and use || instead. Thoughts?
>>>>>> 
>>>>>> I think this code means, if the error happens in hypervisor mode
>>>>>> (i.e. !guest_mode()), and RIPV indicate the RIP in stack can't be
>>>>>> restarted, we have to panic.
>>>>> 
>>>>> Then the guest_mode() check still lacks an extra check of EIPV,
>>>>> like 
>>>>> 
>>>>>      if ( !(gstatus & MCG_STATUS_RIPV) &&
>>>>>           (!(gstatus & MCG_STATUS_EIPV) || !guest_mode(regs)))
>>>>> return -1; 
>>>>> 
>>>> 
>>>> The RIPV is not related to the EIPV. RIPV means the context saved
>>>> in the stack can't be restarted anymore. According to the SDM, RIPV
>>>> means "execution can be restarted reliably at the instruction
>>>> pointed to by the instruction pointer pushed on the stack". It's
>>>> not about error happened synchronously or asynchronously. The
>>>> point is, if the program is running in hypervisor context, and
>>>> RIPV tells us that the program can't be restarted, we can't do
>>>> anything but panic, because we can't switch context while we are
>>>> in xen. So this code have nothing to do with EIPV.
>>> 
>>> I continue to disagree (including the statement in your other
>>> response): RIPV only tells us whether we can resume, not in which
>>> context the error occurred. EIPV tells us whether, by looking at the
>>> saved registers, we can determine the context that the error
>>> occurred in. Since with !RIPV we have to determine in what context
>>> the error occurred in order to decide whether to panic or just kill
>>> a guest, we can't ignore EIPV (and if it's not set we have to
>>> assume the worst case, since even if the registers indicate guest
>>> mode the error may have occurred in hypervisor context or accessing
>>> hypervisor structures [consider e.g. a data load error during a GDT
>>> access]). 
>>> 
>>> Jan
>> 
>> Yes, I agree EIPV=0 may indicate async error, but I think your
>> solution *overkilled* most cases (i.e. the real guest instruction
>> fetch error). 
>> 
>> Our idea is,
>>   * xen mce would flush prefetched instruction so we can delay
>> handle it until if real need; 
>>   * a h/w error will not disappear, but if it was not being
>> *consumed*, it's OK for system keep going (like SRAO error which do
>> not need s/w handle immediately); 
>> 
>> Suppose an async instruction fetch error (RIVP=EIVP=0), triggered at
>> guest context but instruction prefetch hypervisor context. The
>>   scenario is, * at xen mce, the prefetched instruction has been
>> flushed. xen mce handler needn't panic, instead it mark the page as
>> broken page, then trigger vmce to guest;
> 
> If the prefetch was from Xen space (only in guest context),
> delivering a vMCE to the guest is pointless (and perhaps confusing to
> the guest). 
> 

Yes, exactly. how about delay handle it as:
* at mce isr
        if ( !(gstatus & MCG_STATUS_RIPV) && !guest_mode(regs))
                xen panic;
* at mce softirq
        if ( (srar error) && (EIPV ==0) && (broken page owned by hypervisor) )
                xen panic;

>>   * guest may kill app, kernel thread, guest itself, or whatever;
>> 
>> The error is still an error, w/ 2 possibilities in the future:
>>   1. it may not be consumed as an SRAR error, system keep going, h/w
>> mechanism may detect a SRAO error (i.e. memroy scrub) at some time
>> point and handled then; 
>>   2. it may be consumed at some time point and a SRAR error
>>    triggered again. At this time, 1). if srar occurred at hypervisor
>>    context, xen will panic. or, 2). if srar occurred at guest
>> context, xen kill the guest as a malicious one (as what the 2nd
>> patch do), and move the page to broken page list; 
>> 
>> Considering the rare possibility of the above case, I think it's
>> acceptable to handle it in this way. Thoughts?
> 
> You're only discussing instruction fetches (which can be discarded),
> but you're not covering the other example I gave (GDT access from
> guest context - just like this is a ring-0 operations from the paging
> unit's pov, this ought to be an out-of-context operation from MCE's
> perspective). 
> 
> Jan

That would be data load error (EIPV=1), a sync error.

Thanks,
Jinsong


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel