WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-bugs

[Xen-bugs] [Bug 1562] New: On NHM-EX ER, two SRAO errors will cause CPU

To: xen-bugs@xxxxxxxxxxxxxxxxxxx
Subject: [Xen-bugs] [Bug 1562] New: On NHM-EX ER, two SRAO errors will cause CPU shutdown.
From: bugzilla-daemon@xxxxxxxxxxxxxxxxxxx
Date: Thu, 14 Jan 2010 08:29:03 -0800
Delivery-date: Thu, 14 Jan 2010 08:29:07 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-bugs-request@lists.xensource.com?subject=help>
List-id: Xen Bugzilla <xen-bugs.lists.xensource.com>
List-post: <mailto:xen-bugs@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-bugs>, <mailto:xen-bugs-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-bugs>, <mailto:xen-bugs-request@lists.xensource.com?subject=unsubscribe>
Reply-to: bugs@xxxxxxxxxxxxxxxxxx
Sender: xen-bugs-bounces@xxxxxxxxxxxxxxxxxxx
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1562

           Summary: On NHM-EX ER, two SRAO errors will cause CPU shutdown.
           Product: Xen
           Version: unstable
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Hypervisor
        AssignedTo: xen-bugs@xxxxxxxxxxxxxxxxxxx
        ReportedBy: jiajun.xu@xxxxxxxxx


xen-changeset:   20122:8faef78ea759

pvops git: 
commit 16529fc075a95a84901842f7353ac906cd912bba
Merge: 5d78a20... 3186c67...
Author: Jeremy Fitzhardinge <jeremy.fitzhardinge@xxxxxxxxxx>

ioemu git: 
commit a83d119cfcc20bc7edb427992d6e31b3e99430be
Author: Ian Jackson <ian.jackson@xxxxxxxxxxxxx>
Date:   Mon Aug 10 18:02:56 2009 +0100

---

the cause is MCE handler doesn't clear up MCG_STATUS MCIP bit. 

when I inject a SRAO, get following log:
(XEN) MCE: clear_bank map 100
(XEN) CPU42 enter softirq
(XEN) CPU10 enter softirq
(XEN) CPU54 enter softirq
(XEN) CPU22 enter softirq
(XEN) CPU6 enter softirq
(XEN) CPU38 enter softirq
(XEN) CPU46 enter softirq
(XEN) CPU14 enter softirq
(XEN) CPU30 enter softirq
(XEN) CPU62 enter softirq
(XEN) CPU58 enter softirq
(XEN) CPU26 enter softirq
(XEN) CPU34 enter softirq
(XEN) CPU2 enter softirq
(XEN) CPU18 enter softirq
(XEN) CPU50 enter softirq
(XEN) CPU41 enter softirq
(XEN) CPU9 enter softirq
(XEN) CPU33 enter softirq
(XEN) CPU1 enter softirq
(XEN) CPU49 enter softirq
(XEN) CPU17 enter softirq
(XEN) CPU61 enter softirq
(XEN) CPU29 enter softirq
(XEN) CPU45 enter softirq
(XEN) CPU13 enter softirq
(XEN) CPU5 enter softirq
(XEN) CPU37 enter softirq
(XEN) CPU57 enter softirq
(XEN) CPU25 enter softirq
(XEN) CPU21 enter softirq
(XEN) CPU53 enter softirq
(XEN) CPU15 enter softirq
(XEN) CPU47 enter softirq
(XEN) CPU23 enter softirq
(XEN) CPU55 enter softirq
(XEN) CPU7 enter softirq
(XEN) CPU39 enter softirq
(XEN) CPU31 enter softirq
(XEN) CPU63 enter softirq
(XEN) CPU27 enter softirq
(XEN) CPU59 enter softirq
(XEN) CPU19 enter softirq
(XEN) CPU51 enter softirq
(XEN) CPU43 enter softirq
(XEN) CPU11 enter softirq
(XEN) CPU0: Machine Check Exception:                5
(XEN) CPU3 enter softirq
(XEN) Bank 8: bd000000004000cf[              89] at        85cb4f040
(XEN) CPU32 enter softirq
(XEN) CPU0 enter softirq
(XEN) CPU4 enter softirq
(XEN) CPU36 enter softirq
(XEN) CPU16 enter softirq
(XEN) CPU48 enter softirq
(XEN) CPU28 enter softirq
(XEN) CPU60 enter softirq
(XEN) CPU12 enter softirq
(XEN) CPU44 enter softirq
(XEN) CPU56 enter softirq
(XEN) CPU24 enter softirq
(XEN) CPU40 enter softirq
(XEN) CPU8 enter softirq
(XEN) CPU52 enter softirq
(XEN) CPU20 enter softirq
(XEN) CPU35 enter softirq
(XEN) CPU26 handling errors
(XEN) MCE: send MCE# to DOM0 through virq
(XEN) mce.c:694:d0 MCE: rdmsr MCG_CAP 0x1000816

Then I inject a CMCI error, get following log:
(XEN) CMCI: send CMCI to DOM0 through virq
(XEN) mce.c:694:d0 MCE: rdmsr MCG_CAP 0x100081
[root@lkp-nex03 einj]# mcelog 
mcelog: warning: record length longer than expected. Consider update.
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 3 BANK 8 TSC 8d40ecdc30 
STATUS d00000800800009f MCGSTATUS 5

from MCGSTATUS=5, we can get MCIP is still there not cleared. 

from code logic, seems mce_action is not called, i.e. UCR handler code is not
executed.


-- 
Configure bugmail: 
http://bugzilla.xensource.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

_______________________________________________
Xen-bugs mailing list
Xen-bugs@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-bugs

<Prev in Thread] Current Thread [Next in Thread>