WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] [PATCH 0/2] MCA support with page offlining

Hi,

> But I'm a bit confused some changes in this patch.

Sorry, I attached a wrong patch, this is a correct one.

Thanks,
KAZ

Signed-off-by: Kazuhiro Suzuki <kaz@xxxxxxxxxxxxxx>


From: "Jiang, Yunhong" <yunhong.jiang@xxxxxxxxx>
Subject: RE: [Xen-devel] [PATCH 0/2] MCA support with page offlining
Date: Fri, 19 Dec 2008 18:48:11 +0800

> SUZUKI Kazuhiro <mailto:kaz@xxxxxxxxxxxxxx> wrote:
> > Hi Yunhong and Shane,
> > 
> > Thank you for your comments and reviews.
> > 
> >> For page offlining, It may not be a good way to put the page into an array
> >> but a list. 
> > 
> > I modified to register offlined pages to a list instead of an array.
> > 
> >> I guess the code targets for offlining domain pages only,
> > right? How about free pages and xen pages?
> >> If so, no need to check in the following when allocating
> > free pages, since the offlined pages will not be freed into heap()()().
> >> If not, the following may have a bug.
> > 
> > Yes, I assumed that offlining page was needed for domain pages. If xen
> > pages are impacted, then it is enough to crash the Hypervisor, in current
> > implementation. 
> 
> We have a internal patch for the similar purpose, for  page offline caused by 
> both #MC or other purpose. We can base our work on your patch if needed.
> 
> But I'm a bit confused some changes in this patch.
> In your previous version, if a page is marked as PGC_reserved, it will not be 
> allocated anymore. However, in this version, when a page is marked as 
> PGC_reserved, it is just not freed, so does that mean the page will not be 
> removed from current list? That seems a bit hack for me.
> 
> What I think to do is for page offline is:
> a) mark a page as PGC_reserved (or other name like PGC_broken )
> b) If it is free, to step c, otherwise, wait till it is freed by the owner 
> and to step c.
> c) Remove the page from the buddy system, motve it to a special and seperated 
> list (i.e. not in the heap[][][] anymore), and return other page to the buddy 
> allocator.
> 
> Some argument is in step b, that if the page is owned by a guest, we can 
> replace it with a new page through p2m table, and don't need wait till it is 
> freed, we didn't do that currently because it is a bit complex to achieve 
> that.
> 
> How is your idea on this?
> 
> Thanks
> Yunhong Jiang
> 
> > 
> > I attach an updated patch for xen part which also includes some bug fixes.
> > 
> > Thanks,
> > KAZ
> > 
> > Signed-off-by: Kazuhiro Suzuki <kaz@xxxxxxxxxxxxxx>
> > 
> > 
> > From: "Wang, Shane" <shane.wang@xxxxxxxxx>
> > Subject: RE: [Xen-devel] [PATCH 0/2] MCA support with page offlining
> > Date: Tue, 16 Dec 2008 19:10:00 +0800
> > 
> >> For page offlining, It may not be a good way to put the page into an array
> >> but a list. 
> >> 
> >> +  pg->count_info |= PGC_reserved;
> >> +  page_offlining[num_page_offlining++] = pg;
> >> 
> >> I guess the code targets for offlining domain pages only,
> > right? How about free pages and xen pages?
> >> If so, no need to check in the following when allocating
> > free pages, since the offlined pages will not be freed into heap()()().
> >> If not, the following may have a bug.
> >> 
> >> +                if ( !list_empty(&heap(node, zone, j)) ) {
> >> +                    pg = list_entry(heap(node, zone,
> > j).next, struct page_info, list);
> >> +                    if (!(pg->count_info & PGC_reserved))
> >> +                        goto found;
> >> +                    else
> >> +                        printk(XENLOG_DEBUG "Page %p(%lx) is not to be
> >> allocated.\n", +                               pg, page_to_maddr(pg)); +  
> >> } 
> >> 
> >> If one free page (not pg) within pg and pg+(1U<<j) is
> > offlined, the range pg~pg+(1U<<j) has the risk to be allocated
> > with that page.
> >> 
> >> Shane
> >> 
> >> Jiang, Yunhong wrote:
> >>> xen-devel-bounces@xxxxxxxxxxxxxxxxxxx <> wrote:
> >>>> Hi all,
> >>>> 
> >>>> I had posted about MCA support for Intel64 before. It had only a
> >>>> function to log the MCA error data received from hypervisor.
> >>>> 
> >>>> http://lists.xensource.com/archives/html/xen-devel/2008-09/msg0 0876.html
> >>>> 
> >>>> I attach patches that support not only error logging but also Page
> >>>> Offlining function. The page where an MCA occurs will offline and not
> >>>> reuse. A new flag 'PGC_reserved' was added in page count_info to mark
> >>>> the impacted page. 
> >>>> 
> >>>> I know that it is better to implement the page offlining for general
> >>>> purpose, but I implemented for MCA specialized in this first step.
> >>> 
> >>> Maybe the MCA page offline is a bit different to normal page offline
> >>> requirement, so take it as first step maybe a good choice :)
> >>> 
> >>> As for your current page_offlining, I'm not sure why the PGC_reserved
> >>> page should not be freed? Also, for following code, will that make
> >>> the heap(node, zone, j) can't be allocated anymore? Maybe we can
> >>> creat a special list to hold all those pages and remove them from the
> >>> heap list? 
> >>> 
> >>> +                if ( !list_empty(&heap(node, zone, j)) ) {
> >>> +                    pg = list_entry(heap(node, zone, j).next, struct
> >>> page_info, list); +                    if (!(pg->count_info &
> >>> PGC_reserved)) +                        goto found; +                   
> >>> else +                        printk(XENLOG_DEBUG "Page %p(%lx) is not to
> >>> be allocated.\n", +                               pg, page_to_maddr(pg));
> >>> + 
> >>> 
> >>> 
> >>>> 
> >>>> And I also implement the MCA handler of Dom0 which support to
> >>>> shutdown the remote domain where a MCA occurred. If the MCA occurred
> >>>> on a DomU, Dom0 notifies it to the DomU. When the notify is failed,
> >>>> Dom0 calls SCHEDOP_remote_shutdown hypercall.
> >>>> 
> >>>> [1/2] xen part: mca-support-with-page-offlining-xen.patch
> >>> 
> >>> We are not sure we really need pass all #MC information to dom0
> >>> firstly, and let dom0 to notify domU. Xen should knows about
> >>> everything, so it may have knowledge to decide inject virtual #MC to
> >>> guest or not. Of course, this does not impact your patch.
> >>> 
> >>>> [2/2] linux/x86_64 part:
> > mca-support-with-page-offlining-linux.patch
> >>> 
> >>> As for how to inject virtual #MC to guest (including dom0), I think
> >>> we need consider following point:
> >>> 
> >>> a) Benefit from reusing guest #MC handler's . #MC handler is well
> >>> known difficult to test, and the native guest handler may have been
> >>> tested more widely. Also #MC handler improves as time going-on, reuse
> >>> guest's MCA handler share us those improvement.
> >>> b) Maintain the PV handler to different OS version may not so easy,
> >>> especially as hardware improves, and kernel may have better support
> >>> for error handling/containment.
> >>> c) #MC handler may need some model specific information to decide the
> >>> action, while guest (not dom0) has virtualized CPUID information.
> >>> d) Guest's MCA handler may requires the physical information when the
> >>> #MC hapen, like the CPU number the #MC happens.
> >>> e) For HVM domain, PV handler will be difficult (considering Windows
> >>> guest). 
> >>> 
> >>> And we have several option to support virtual #MC to guest:
> >>> 
> >>> Option 1 is what currently implemented. A PV #MC handler is
> >>> implemented in guest. This PV handler gets MCA information from Xen
> >>> HV through hypercall, including MCA MSR value, also some additional
> >>> information, like which physical CPU the MCA happened. Option 1 will
> >>> help us on issue d), but we need main a PV handler, and can't get
> >>> benifit from native handler. Also it does not resolve issue c) quite well.
> >>> 
> >>> option 2, Xen will provide MCA MSR virtualization so that guest's
> >>> native #MC handler can run without changes. It can benifit from guest
> >>> #MC handler, but it will be difficult to get model specific
> >>> information, and has no physical information.
> >>> 
> >>> Option 3 uses a PV #MC handler for guest as option 1, but interface
> >>> between Xen/guest is abstract event, like offline offending page,
> >>> terminate current execution context etc. This should be straight
> >>> forward for Linux, but may be difficult to Windows and other OS.
> >>> 
> >>> Currently we are considering option 2 to provide MCA MSR
> >>> virtualization to guest, and dom0 can also benifit from such support
> >>> (if guest has different CPUID as native, we will either keep guest
> >>> running, or kill guest based on error code). Of course, current
> >>> mechanism of passing MCA information from xen to dom0 will still be
> >>> useful, but that will be used for logging purpose or for Correcatable
> >>> Error. How do you think about this?
> >>> 
> >>> Thanks
> >>> Yunhong Jiang
> >>> 
> >>>> Signed-off-by: Kazuhiro Suzuki <kaz@xxxxxxxxxxxxxx>
> >>>> 
> >>>> Thanks,
> >>>> KAZ
> >>>> 
> >>>> _______________________________________________
> >>>> Xen-devel mailing list
> >>>> Xen-devel@xxxxxxxxxxxxxxxxxxx
> >>>> http://lists.xensource.com/xen-devel
> >>> _______________________________________________
> >>> Xen-devel mailing list
> >>> Xen-devel@xxxxxxxxxxxxxxxxxxx
> >>> http://lists.xensource.com/xen-devel
> >> 
> >> 
> >> _______________________________________________
> >> Xen-devel mailing list
> >> Xen-devel@xxxxxxxxxxxxxxxxxxx
> >> http://lists.xensource.com/xen-devel
diff -r 6595393a3d28 xen/arch/x86/cpu/mcheck/amd_f10.c
--- a/xen/arch/x86/cpu/mcheck/amd_f10.c Tue Dec 09 16:28:02 2008 +0000
+++ b/xen/arch/x86/cpu/mcheck/amd_f10.c Mon Dec 20 09:35:50 2008 +0900
@@ -82,8 +82,6 @@
 }
 
 
-extern void k8_machine_check(struct cpu_user_regs *regs, long error_code);
-
 /* AMD Family10 machine check */
 void amd_f10_mcheck_init(struct cpuinfo_x86 *c) 
 { 
@@ -91,7 +89,7 @@
        uint32_t i;
        int cpu_nr;
 
-       machine_check_vector = k8_machine_check;
+       machine_check_vector = x86_machine_check;
        mc_callback_bank_extended = amd_f10_handler;
        cpu_nr = smp_processor_id();
        wmb();
diff -r 6595393a3d28 xen/arch/x86/cpu/mcheck/amd_k8.c
--- a/xen/arch/x86/cpu/mcheck/amd_k8.c  Tue Dec 09 16:28:02 2008 +0000
+++ b/xen/arch/x86/cpu/mcheck/amd_k8.c  Mon Dec 20 09:35:50 2008 +0900
@@ -69,220 +69,8 @@
 #include "mce.h"
 #include "x86_mca.h"
 
+extern int mce_bootlog;
 
-/* Machine Check Handler for AMD K8 family series */
-void k8_machine_check(struct cpu_user_regs *regs, long error_code)
-{
-       struct vcpu *vcpu = current;
-       struct domain *curdom;
-       struct mc_info *mc_data;
-       struct mcinfo_global mc_global;
-       struct mcinfo_bank mc_info;
-       uint64_t status, addrv, miscv, uc;
-       uint32_t i;
-       unsigned int cpu_nr;
-       uint32_t xen_impacted = 0;
-#define DOM_NORMAL     0
-#define DOM0_TRAP      1
-#define DOMU_TRAP      2
-#define DOMU_KILLED    4
-       uint32_t dom_state = DOM_NORMAL;
-
-       /* This handler runs as interrupt gate. So IPIs from the
-        * polling service routine are defered until we finished.
-        */
-
-        /* Disable interrupts for the _vcpu_. It may not re-scheduled to
-        * an other physical CPU or the impacted process in the guest
-        * continues running with corrupted data, otherwise. */
-        vcpu_schedule_lock_irq(vcpu);
-
-       mc_data = x86_mcinfo_getptr();
-       cpu_nr = smp_processor_id();
-       curdom = vcpu->domain;
-
-       memset(&mc_global, 0, sizeof(mc_global));
-       mc_global.common.type = MC_TYPE_GLOBAL;
-       mc_global.common.size = sizeof(mc_global);
-
-       mc_global.mc_domid = curdom->domain_id; /* impacted domain */
-       mc_global.mc_coreid = vcpu->processor; /* impacted physical cpu */
-       BUG_ON(cpu_nr != vcpu->processor);
-       mc_global.mc_core_threadid = 0;
-       mc_global.mc_vcpuid = vcpu->vcpu_id; /* impacted vcpu */
-#if 0 /* TODO: on which socket is this physical core?
-         It's not clear to me how to figure this out. */
-       mc_global.mc_socketid = ???;
-#endif
-       mc_global.mc_flags |= MC_FLAG_UNCORRECTABLE;
-       rdmsrl(MSR_IA32_MCG_STATUS, mc_global.mc_gstatus);
-
-       /* Quick check, who is impacted */
-       xen_impacted = is_idle_domain(curdom);
-
-       /* Dom0 */
-       x86_mcinfo_clear(mc_data);
-       x86_mcinfo_add(mc_data, &mc_global);
-
-       for (i = 0; i < nr_mce_banks; i++) {
-               struct domain *d;
-
-               rdmsrl(MSR_IA32_MC0_STATUS + 4 * i, status);
-
-               if (!(status & MCi_STATUS_VAL))
-                       continue;
-
-               /* An error happened in this bank.
-                * This is expected to be an uncorrectable error,
-                * since correctable errors get polled.
-                */
-               uc = status & MCi_STATUS_UC;
-
-               memset(&mc_info, 0, sizeof(mc_info));
-               mc_info.common.type = MC_TYPE_BANK;
-               mc_info.common.size = sizeof(mc_info);
-               mc_info.mc_bank = i;
-               mc_info.mc_status = status;
-
-               addrv = 0;
-               if (status & MCi_STATUS_ADDRV) {
-                       rdmsrl(MSR_IA32_MC0_ADDR + 4 * i, addrv);
-                       
-                       d = maddr_get_owner(addrv);
-                       if (d != NULL)
-                               mc_info.mc_domid = d->domain_id;
-               }
-
-               miscv = 0;
-               if (status & MCi_STATUS_MISCV)
-                       rdmsrl(MSR_IA32_MC0_MISC + 4 * i, miscv);
-
-               mc_info.mc_addr = addrv;
-               mc_info.mc_misc = miscv;
-
-               x86_mcinfo_add(mc_data, &mc_info); /* Dom0 */
-
-               if (mc_callback_bank_extended)
-                       mc_callback_bank_extended(mc_data, i, status);
-
-               /* clear status */
-               wrmsrl(MSR_IA32_MC0_STATUS + 4 * i, 0x0ULL);
-               wmb();
-               add_taint(TAINT_MACHINE_CHECK);
-       }
-
-       status = mc_global.mc_gstatus;
-
-       /* clear MCIP or cpu enters shutdown state
-        * in case another MCE occurs. */
-       status &= ~MCG_STATUS_MCIP;
-       wrmsrl(MSR_IA32_MCG_STATUS, status);
-       wmb();
-
-       /* For the details see the discussion "MCE/MCA concept" on xen-devel.
-        * The thread started here:
-        * 
http://lists.xensource.com/archives/html/xen-devel/2007-05/msg01015.html
-        */
-
-       /* MCG_STATUS_RIPV: 
-        * When this bit is not set, then the instruction pointer onto the stack
-        * to resume at is not valid. If xen is interrupted, then we panic 
anyway
-        * right below. Otherwise it is up to the guest to figure out if 
-        * guest kernel or guest userland is affected and should kill either
-        * itself or the affected process.
-        */
-
-       /* MCG_STATUS_EIPV:
-        * Evaluation of EIPV is the job of the guest.
-        */
-
-       if (xen_impacted) {
-               /* Now we are going to panic anyway. Allow interrupts, so that
-                * printk on serial console can work. */
-               vcpu_schedule_unlock_irq(vcpu);
-
-               /* Uh, that means, machine check exception
-                * inside Xen occured. */
-               printk("Machine check exception occured in Xen.\n");
-
-               /* if MCG_STATUS_EIPV indicates, the IP on the stack is related
-                * to the error then it makes sense to print a stack trace.
-                * That can be useful for more detailed error analysis and/or
-                * error case studies to figure out, if we can clear
-                * xen_impacted and kill a DomU instead
-                * (i.e. if a guest only control structure is affected, but then
-                * we must ensure the bad pages are not re-used again).
-                */
-               if (status & MCG_STATUS_EIPV) {
-                       printk("MCE: Instruction Pointer is related to the 
error. "
-                               "Therefore, print the execution state.\n");
-                       show_execution_state(regs);
-               }
-               x86_mcinfo_dump(mc_data);
-               panic("End of MCE. Use mcelog to decode above error codes.\n");
-       }
-
-       /* If Dom0 registered a machine check handler, which is only possible
-        * with a PV MCA driver, then ... */
-       if ( guest_has_trap_callback(dom0, 0, TRAP_machine_check) ) {
-               dom_state = DOM0_TRAP;
-
-               /* ... deliver machine check trap to Dom0. */
-               send_guest_trap(dom0, 0, TRAP_machine_check);
-
-               /* Xen may tell Dom0 now to notify the DomU.
-                * But this will happen through a hypercall. */
-       } else
-               /* Dom0 did not register a machine check handler, but if DomU
-                * did so, then... */
-                if ( guest_has_trap_callback(curdom, vcpu->vcpu_id, 
TRAP_machine_check) ) {
-                       dom_state = DOMU_TRAP;
-
-                       /* ... deliver machine check trap to DomU */
-                       send_guest_trap(curdom, vcpu->vcpu_id, 
TRAP_machine_check);
-       } else {
-               /* hmm... noone feels responsible to handle the error.
-                * So, do a quick check if a DomU is impacted or not.
-                */
-               if (curdom == dom0) {
-                       /* Dom0 is impacted. Since noone can't handle
-                        * this error, panic! */
-                       x86_mcinfo_dump(mc_data);
-                       panic("MCE occured in Dom0, which it can't handle\n");
-
-                       /* UNREACHED */
-               } else {
-                       dom_state = DOMU_KILLED;
-
-                       /* Enable interrupts. This basically results in
-                        * calling sti on the *physical* cpu. But after
-                        * domain_crash() the vcpu pointer is invalid.
-                        * Therefore, we must unlock the irqs before killing
-                        * it. */
-                       vcpu_schedule_unlock_irq(vcpu);
-
-                       /* DomU is impacted. Kill it and continue. */
-                       domain_crash(curdom);
-               }
-       }
-
-
-       switch (dom_state) {
-       case DOM0_TRAP:
-       case DOMU_TRAP:
-               /* Enable interrupts. */
-               vcpu_schedule_unlock_irq(vcpu);
-
-               /* guest softirqs and event callbacks are scheduled
-                * immediately after this handler exits. */
-               break;
-       case DOMU_KILLED:
-               /* Nothing to do here. */
-               break;
-       default:
-               BUG();
-       }
-}
 
 
 /* AMD K8 machine check */
@@ -292,7 +80,7 @@
        uint32_t i;
        int cpu_nr;
 
-       machine_check_vector = k8_machine_check;
+       machine_check_vector = x86_machine_check;
        cpu_nr = smp_processor_id();
        wmb();
 
@@ -300,6 +88,17 @@
        if (value & MCG_CTL_P)  /* Control register present ? */
                wrmsrl (MSR_IA32_MCG_CTL, 0xffffffffffffffffULL);
        nr_mce_banks = value & MCG_CAP_COUNT;
+
+       /* Log the machine checks left over from the previous reset.
+          This also clears all registers */
+       for (i=0; i<nr_mce_banks; i++) {
+               u64 status;
+               rdmsrl(MSR_IA32_MC0_STATUS + i*4, status);
+               if (status & MCi_STATUS_VAL) {
+                       x86_machine_check(NULL, mce_bootlog ? -1 : -2);
+                       break;
+               }
+       }
 
        for (i = 0; i < nr_mce_banks; i++) {
                switch (i) {
diff -r 6595393a3d28 xen/arch/x86/cpu/mcheck/amd_nonfatal.c
--- a/xen/arch/x86/cpu/mcheck/amd_nonfatal.c    Tue Dec 09 16:28:02 2008 +0000
+++ b/xen/arch/x86/cpu/mcheck/amd_nonfatal.c    Mon Dec 20 09:35:50 2008 +0900
@@ -65,117 +65,12 @@
 #include "mce.h"
 #include "x86_mca.h"
 
-static struct timer mce_timer;
+static int hw_threshold = 0;
 
-#define MCE_PERIOD MILLISECS(15000)
-#define MCE_MIN    MILLISECS(2000)
-#define MCE_MAX    MILLISECS(30000)
+extern struct timer mce_timer;
 
-static s_time_t period = MCE_PERIOD;
-static int hw_threshold = 0;
-static int adjust = 0;
-
-/* The polling service routine:
- * Collects information of correctable errors and notifies
- * Dom0 via an event.
- */
-void mce_amd_checkregs(void *info)
-{
-       struct vcpu *vcpu = current;
-       struct mc_info *mc_data;
-       struct mcinfo_global mc_global;
-       struct mcinfo_bank mc_info;
-       uint64_t status, addrv, miscv;
-       unsigned int i;
-       unsigned int event_enabled;
-       unsigned int cpu_nr;
-       int error_found;
-
-       /* We don't need a slot yet. Only allocate one on error. */
-       mc_data = NULL;
-
-       cpu_nr = smp_processor_id();
-       event_enabled = guest_enabled_event(dom0->vcpu[0], VIRQ_MCA);
-       error_found = 0;
-
-       memset(&mc_global, 0, sizeof(mc_global));
-       mc_global.common.type = MC_TYPE_GLOBAL;
-       mc_global.common.size = sizeof(mc_global);
-
-       mc_global.mc_domid = vcpu->domain->domain_id; /* impacted domain */
-       mc_global.mc_coreid = vcpu->processor; /* impacted physical cpu */
-       BUG_ON(cpu_nr != vcpu->processor);
-       mc_global.mc_core_threadid = 0;
-       mc_global.mc_vcpuid = vcpu->vcpu_id; /* impacted vcpu */
-#if 0 /* TODO: on which socket is this physical core?
-         It's not clear to me how to figure this out. */
-       mc_global.mc_socketid = ???;
-#endif
-       mc_global.mc_flags |= MC_FLAG_CORRECTABLE;
-       rdmsrl(MSR_IA32_MCG_STATUS, mc_global.mc_gstatus);
-
-       for (i = 0; i < nr_mce_banks; i++) {
-               struct domain *d;
-
-               rdmsrl(MSR_IA32_MC0_STATUS + i * 4, status);
-
-               if (!(status & MCi_STATUS_VAL))
-                       continue;
-
-               if (mc_data == NULL) {
-                       /* Now we need a slot to fill in error telemetry. */
-                       mc_data = x86_mcinfo_getptr();
-                       BUG_ON(mc_data == NULL);
-                       x86_mcinfo_clear(mc_data);
-                       x86_mcinfo_add(mc_data, &mc_global);
-               }
-
-               memset(&mc_info, 0, sizeof(mc_info));
-               mc_info.common.type = MC_TYPE_BANK;
-               mc_info.common.size = sizeof(mc_info);
-               mc_info.mc_bank = i;
-               mc_info.mc_status = status;
-
-               /* Increase polling frequency */
-               error_found = 1;
-
-               addrv = 0;
-               if (status & MCi_STATUS_ADDRV) {
-                       rdmsrl(MSR_IA32_MC0_ADDR + i * 4, addrv);
-
-                       d = maddr_get_owner(addrv);
-                       if (d != NULL)
-                               mc_info.mc_domid = d->domain_id;
-               }
-
-               miscv = 0;
-               if (status & MCi_STATUS_MISCV)
-                       rdmsrl(MSR_IA32_MC0_MISC + i * 4, miscv);
-
-               mc_info.mc_addr = addrv;
-               mc_info.mc_misc = miscv;
-               x86_mcinfo_add(mc_data, &mc_info);
-
-               if (mc_callback_bank_extended)
-                       mc_callback_bank_extended(mc_data, i, status);
-
-               /* clear status */
-               wrmsrl(MSR_IA32_MC0_STATUS + i * 4, 0x0ULL);
-               wmb();
-       }
-
-       if (error_found > 0) {
-               /* If Dom0 enabled the VIRQ_MCA event, then ... */
-               if (event_enabled)
-                       /* ... notify it. */
-                       send_guest_global_virq(dom0, VIRQ_MCA);
-               else
-                       /* ... or dump it */
-                       x86_mcinfo_dump(mc_data);
-       }
-
-       adjust += error_found;
-}
+extern s_time_t period;
+extern int adjust;
 
 /* polling service routine invoker:
  * Adjust poll frequency at runtime. No error means slow polling frequency,
@@ -186,7 +81,7 @@
  */
 static void mce_amd_work_fn(void *data)
 {
-       on_each_cpu(mce_amd_checkregs, data, 1, 1);
+       on_each_cpu(x86_mce_checkregs, data, 1, 1);
 
        if (adjust > 0) {
                if ( !guest_enabled_event(dom0->vcpu[0], VIRQ_MCA) ) {
diff -r 6595393a3d28 xen/arch/x86/cpu/mcheck/mce.c
--- a/xen/arch/x86/cpu/mcheck/mce.c     Tue Dec 09 16:28:02 2008 +0000
+++ b/xen/arch/x86/cpu/mcheck/mce.c     Mon Dec 20 09:35:50 2008 +0900
@@ -7,6 +7,9 @@
 #include <xen/types.h>
 #include <xen/kernel.h>
 #include <xen/config.h>
+#include <xen/sched.h>
+#include <xen/sched-if.h>
+#include <xen/paging.h>
 #include <xen/smp.h>
 #include <xen/errno.h>
 
@@ -18,6 +21,14 @@
 
 int mce_disabled = 0;
 unsigned int nr_mce_banks;
+int mce_bootlog = 1;
+
+typedef struct x86_page_offlining {
+       struct page_info *page;
+       struct list_head list;
+} x86_page_offlining_t;
+
+LIST_HEAD(page_offlining);
 
 EXPORT_SYMBOL_GPL(nr_mce_banks);       /* non-fatal.o */
 
@@ -136,6 +147,9 @@
                        intel_p5_mcheck_init(c);
                if (c->x86==6)
                        intel_p6_mcheck_init(c);
+#else
+               if (c->x86==6)
+                       intel_p4_mcheck_init(c);
 #endif
                if (c->x86==15)
                        intel_p4_mcheck_init(c);
@@ -159,9 +173,19 @@
        mce_disabled = 1;
 }
 
+/* mce=off disables machine check.
+   mce=bootlog Log MCEs from before booting. Disabled by default on AMD.
+   mce=nobootlog Don't log MCEs from before booting. */
 static void __init mcheck_enable(char *str)
 {
-       mce_disabled = -1;
+       if (*str == '=')
+               str++;
+       if (!strcmp(str, "off"))
+               mce_disabled = 1;
+       else if (!strcmp(str, "bootlog") || !strcmp(str,"nobootlog"))
+               mce_bootlog = str[0] == 'b';
+       else
+               printk("mce= argument %s ignored.", str); 
 }
 
 custom_param("nomce", mcheck_disable);
@@ -221,6 +245,12 @@
        /* This function is called from the fetch hypercall with
         * the mc_lock spinlock held. Thus, no need for locking here.
         */
+
+       /* Return NULL if no data is available. */
+       if (mc_data.fetch_idx == mc_data.error_idx) {
+               *fetch_idx = mc_data.fetch_idx;
+               return NULL;
+       }
        mi = &(x86_mcinfo_mcdata(mc_data.fetch_idx));
        if ((d != dom0) && !x86_mcinfo_matches_guest(mi, d, v)) {
                /* Bogus domU command detected. */
@@ -412,7 +442,7 @@
                if (mic == NULL)
                        return;
                if (mic->type != MC_TYPE_BANK)
-                       continue;
+                       goto next;
 
                mc_bank = (struct mcinfo_bank *)mic;
        
@@ -425,12 +455,283 @@
                        printk(" at %16"PRIx64, mc_bank->mc_addr);
 
                printk("\n");
+       next:
                mic = x86_mcinfo_next(mic); /* next entry */
                if ((mic == NULL) || (mic->size == 0))
                        break;
        } while (1);
 }
 
+static int x86_page_offlining(unsigned long maddr, struct domain *d)
+{
+       struct page_info *pg, *page;
+       x86_page_offlining_t *e;
+
+       if (!mfn_valid(maddr >> PAGE_SHIFT)) {
+               printk(XENLOG_ERR "Page offlining: ( %lx ) invalid.\n", maddr);
+               return -1;
+       }
+
+       /* convert physical address to physical page number */
+       pg = maddr_to_page(maddr);
+
+       if (pg == NULL) {
+               printk(XENLOG_ERR "Page offlining: ( %lx ) not found.\n",
+                      maddr);
+               return -1;
+       }
+
+       /* check whether a page number have been already registered or not */
+       list_for_each_entry(page, &page_offlining, list) {
+               if (page == pg)
+                       goto out;
+       }
+
+       /* check whether already having attribute 'reserved' */
+       if (pg->count_info & PGC_reserved) {
+               printk(XENLOG_DEBUG "Page offlining: ( %lx ) failure.\n",
+                      maddr);
+               return 1;
+       }
+
+       /* add attribute 'reserved' and register the page */
+       get_page(pg, d);
+       pg->count_info |= PGC_reserved;
+
+       e = xmalloc(x86_page_offlining_t);
+       BUG_ON(e == NULL);
+       e->page = pg;
+       list_add(&e->list, &page_offlining);
+
+ out:
+       printk(XENLOG_DEBUG "Page offlining: ( %lx ) success.\n", maddr);
+       return 0;
+}
+
+
+/* Machine Check Handler for AMD K8 family series and Intel P4/Xeon family */
+void x86_machine_check(struct cpu_user_regs *regs, long error_code)
+{
+       struct vcpu *vcpu = current;
+       struct domain *curdom;
+       struct mc_info *mc_data;
+       struct mcinfo_global mc_global;
+       struct mcinfo_bank mc_info;
+       uint64_t status, addrv, miscv, uc;
+       uint32_t i;
+       unsigned int cpu_nr;
+       uint32_t xen_impacted = 0;
+#define DOM_NORMAL     0
+#define DOM0_TRAP      1
+#define DOMU_TRAP      2
+#define DOMU_KILLED    4
+       uint32_t dom_state = DOM_NORMAL;
+
+       /* This handler runs as interrupt gate. So IPIs from the
+        * polling service routine are defered until we finished.
+        */
+
+        /* Disable interrupts for the _vcpu_. It may not re-scheduled to
+        * an other physical CPU or the impacted process in the guest
+        * continues running with corrupted data, otherwise. */
+        vcpu_schedule_lock_irq(vcpu);
+
+       mc_data = x86_mcinfo_getptr();
+       cpu_nr = smp_processor_id();
+       curdom = vcpu->domain;
+
+       memset(&mc_global, 0, sizeof(mc_global));
+       mc_global.common.type = MC_TYPE_GLOBAL;
+       mc_global.common.size = sizeof(mc_global);
+
+       mc_global.mc_domid = curdom->domain_id; /* impacted domain */
+       mc_global.mc_coreid = vcpu->processor; /* impacted physical cpu */
+       BUG_ON(cpu_nr != vcpu->processor);
+       mc_global.mc_core_threadid = 0;
+       mc_global.mc_vcpuid = vcpu->vcpu_id; /* impacted vcpu */
+#if 0 /* TODO: on which socket is this physical core?
+         It's not clear to me how to figure this out. */
+       mc_global.mc_socketid = ???;
+#endif
+       mc_global.mc_flags |= MC_FLAG_UNCORRECTABLE;
+       rdmsrl(MSR_IA32_MCG_STATUS, mc_global.mc_gstatus);
+
+       /* Quick check, who is impacted */
+       xen_impacted = is_idle_domain(curdom);
+
+       /* Dom0 */
+       x86_mcinfo_clear(mc_data);
+       x86_mcinfo_add(mc_data, &mc_global);
+
+       for (i = 0; i < nr_mce_banks; i++) {
+               struct domain *d;
+
+               rdmsrl(MSR_IA32_MC0_STATUS + 4 * i, status);
+
+               if (!(status & MCi_STATUS_VAL))
+                       continue;
+
+               /* An error happened in this bank.
+                * This is expected to be an uncorrectable error,
+                * since correctable errors get polled.
+                */
+               uc = status & MCi_STATUS_UC;
+
+               memset(&mc_info, 0, sizeof(mc_info));
+               mc_info.common.type = MC_TYPE_BANK;
+               mc_info.common.size = sizeof(mc_info);
+               mc_info.mc_bank = i;
+               mc_info.mc_status = status;
+
+               addrv = 0;
+               if (status & MCi_STATUS_ADDRV) {
+                       rdmsrl(MSR_IA32_MC0_ADDR + 4 * i, addrv);
+                       
+                       d = maddr_get_owner(addrv);
+                       if (d != NULL) {
+                               mc_info.mc_domid = d->domain_id;
+
+                               /* Page offlining */
+                               x86_page_offlining(addrv, d);
+                       }
+               }
+
+               miscv = 0;
+               if (status & MCi_STATUS_MISCV)
+                       rdmsrl(MSR_IA32_MC0_MISC + 4 * i, miscv);
+
+               mc_info.mc_addr = addrv;
+               mc_info.mc_misc = miscv;
+
+               x86_mcinfo_add(mc_data, &mc_info); /* Dom0 */
+
+               if (mc_callback_bank_extended)
+                       mc_callback_bank_extended(mc_data, i, status);
+
+               /* clear status */
+               wrmsrl(MSR_IA32_MC0_STATUS + 4 * i, 0x0ULL);
+               wmb();
+               add_taint(TAINT_MACHINE_CHECK);
+       }
+
+       /* Never do anything final for the previous reset */
+       if (!regs) {
+               vcpu_schedule_unlock_irq(vcpu);
+               return;
+       }
+
+       status = mc_global.mc_gstatus;
+
+       /* clear MCIP or cpu enters shutdown state
+        * in case another MCE occurs. */
+       status &= ~MCG_STATUS_MCIP;
+       wrmsrl(MSR_IA32_MCG_STATUS, status);
+       wmb();
+
+       /* For the details see the discussion "MCE/MCA concept" on xen-devel.
+        * The thread started here:
+        * 
http://lists.xensource.com/archives/html/xen-devel/2007-05/msg01015.html
+        */
+
+       /* MCG_STATUS_RIPV: 
+        * When this bit is not set, then the instruction pointer onto the stack
+        * to resume at is not valid. If xen is interrupted, then we panic 
anyway
+        * right below. Otherwise it is up to the guest to figure out if 
+        * guest kernel or guest userland is affected and should kill either
+        * itself or the affected process.
+        */
+
+       /* MCG_STATUS_EIPV:
+        * Evaluation of EIPV is the job of the guest.
+        */
+
+       if (xen_impacted) {
+               /* Now we are going to panic anyway. Allow interrupts, so that
+                * printk on serial console can work. */
+               vcpu_schedule_unlock_irq(vcpu);
+
+               /* Uh, that means, machine check exception
+                * inside Xen occured. */
+               printk("Machine check exception occured in Xen.\n");
+
+               /* if MCG_STATUS_EIPV indicates, the IP on the stack is related
+                * to the error then it makes sense to print a stack trace.
+                * That can be useful for more detailed error analysis and/or
+                * error case studies to figure out, if we can clear
+                * xen_impacted and kill a DomU instead
+                * (i.e. if a guest only control structure is affected, but then
+                * we must ensure the bad pages are not re-used again).
+                */
+               if (status & MCG_STATUS_EIPV) {
+                       printk("MCE: Instruction Pointer is related to the 
error. "
+                               "Therefore, print the execution state.\n");
+                       show_execution_state(regs);
+               }
+               x86_mcinfo_dump(mc_data);
+               panic("End of MCE. Use mcelog to decode above error codes.\n");
+       }
+
+       /* If Dom0 registered a machine check handler, which is only possible
+        * with a PV MCA driver, then ... */
+       if ( guest_has_trap_callback(dom0, 0, TRAP_machine_check) ) {
+               dom_state = DOM0_TRAP;
+
+               /* ... deliver machine check trap to Dom0. */
+               send_guest_trap(dom0, 0, TRAP_machine_check);
+
+               /* Xen may tell Dom0 now to notify the DomU.
+                * But this will happen through a hypercall. */
+       } else
+               /* Dom0 did not register a machine check handler, but if DomU
+                * did so, then... */
+                if ( guest_has_trap_callback(curdom, vcpu->vcpu_id, 
TRAP_machine_check) ) {
+                       dom_state = DOMU_TRAP;
+
+                       /* ... deliver machine check trap to DomU */
+                       send_guest_trap(curdom, vcpu->vcpu_id, 
TRAP_machine_check);
+       } else {
+               /* hmm... noone feels responsible to handle the error.
+                * So, do a quick check if a DomU is impacted or not.
+                */
+               if (curdom == dom0) {
+                       /* Dom0 is impacted. Since noone can't handle
+                        * this error, panic! */
+                       x86_mcinfo_dump(mc_data);
+                       panic("MCE occured in Dom0, which it can't handle\n");
+
+                       /* UNREACHED */
+               } else {
+                       dom_state = DOMU_KILLED;
+
+                       /* Enable interrupts. This basically results in
+                        * calling sti on the *physical* cpu. But after
+                        * domain_crash() the vcpu pointer is invalid.
+                        * Therefore, we must unlock the irqs before killing
+                        * it. */
+                       vcpu_schedule_unlock_irq(vcpu);
+
+                       /* DomU is impacted. Kill it and continue. */
+                       domain_crash(curdom);
+               }
+       }
+
+
+       switch (dom_state) {
+       case DOM0_TRAP:
+       case DOMU_TRAP:
+               /* Enable interrupts. */
+               vcpu_schedule_unlock_irq(vcpu);
+
+               /* guest softirqs and event callbacks are scheduled
+                * immediately after this handler exits. */
+               break;
+       case DOMU_KILLED:
+               /* Nothing to do here. */
+               break;
+       default:
+               BUG();
+       }
+}
 
 
 /* Machine Check Architecture Hypercall */
@@ -564,7 +865,7 @@
                if ( copy_to_guest(u_xen_mc, op, 1) )
                        ret = -EFAULT;
 
-               if (ret == 0) {
+               if (ret == 0 && mc_notifydomain->flags == XEN_MC_OK) {
                        x86_mcinfo_marknotified(mc_notifydomain);
                }
 
diff -r 6595393a3d28 xen/arch/x86/cpu/mcheck/non-fatal.c
--- a/xen/arch/x86/cpu/mcheck/non-fatal.c       Tue Dec 09 16:28:02 2008 +0000
+++ b/xen/arch/x86/cpu/mcheck/non-fatal.c       Mon Dec 20 09:35:50 2008 +0900
@@ -14,16 +14,158 @@
 #include <xen/smp.h>
 #include <xen/timer.h>
 #include <xen/errno.h>
+#include <xen/event.h>
 #include <asm/processor.h> 
 #include <asm/system.h>
 #include <asm/msr.h>
 
 #include "mce.h"
+#include "x86_mca.h"
 
 static int firstbank;
-static struct timer mce_timer;
 
-#define MCE_PERIOD MILLISECS(15000)
+struct timer mce_timer;
+
+s_time_t period = MCE_PERIOD;
+int adjust = 0;
+
+/* The polling service routine:
+ * Collects information of correctable errors and notifies
+ * Dom0 via an event.
+ */
+void x86_mce_checkregs(void *info)
+{
+       struct vcpu *vcpu = current;
+       struct mc_info *mc_data;
+       struct mcinfo_global mc_global;
+       struct mcinfo_bank mc_info;
+       uint64_t status, addrv, miscv;
+       unsigned int i;
+       unsigned int event_enabled;
+       unsigned int cpu_nr;
+       int error_found;
+
+       /* We don't need a slot yet. Only allocate one on error. */
+       mc_data = NULL;
+
+       cpu_nr = smp_processor_id();
+       event_enabled = guest_enabled_event(dom0->vcpu[0], VIRQ_MCA);
+       error_found = 0;
+
+       memset(&mc_global, 0, sizeof(mc_global));
+       mc_global.common.type = MC_TYPE_GLOBAL;
+       mc_global.common.size = sizeof(mc_global);
+
+       mc_global.mc_domid = vcpu->domain->domain_id; /* impacted domain */
+       mc_global.mc_coreid = vcpu->processor; /* impacted physical cpu */
+       BUG_ON(cpu_nr != vcpu->processor);
+       mc_global.mc_core_threadid = 0;
+       mc_global.mc_vcpuid = vcpu->vcpu_id; /* impacted vcpu */
+#if 0 /* TODO: on which socket is this physical core?
+         It's not clear to me how to figure this out. */
+       mc_global.mc_socketid = ???;
+#endif
+       mc_global.mc_flags |= MC_FLAG_CORRECTABLE;
+       rdmsrl(MSR_IA32_MCG_STATUS, mc_global.mc_gstatus);
+
+       for (i = 0; i < nr_mce_banks; i++) {
+               struct domain *d;
+
+               rdmsrl(MSR_IA32_MC0_STATUS + i * 4, status);
+
+               if (!(status & MCi_STATUS_VAL))
+                       continue;
+
+               if (mc_data == NULL) {
+                       /* Now we need a slot to fill in error telemetry. */
+                       mc_data = x86_mcinfo_getptr();
+                       BUG_ON(mc_data == NULL);
+                       x86_mcinfo_clear(mc_data);
+                       x86_mcinfo_add(mc_data, &mc_global);
+               }
+
+               memset(&mc_info, 0, sizeof(mc_info));
+               mc_info.common.type = MC_TYPE_BANK;
+               mc_info.common.size = sizeof(mc_info);
+               mc_info.mc_bank = i;
+               mc_info.mc_status = status;
+
+               /* Increase polling frequency */
+               error_found = 1;
+
+               addrv = 0;
+               if (status & MCi_STATUS_ADDRV) {
+                       rdmsrl(MSR_IA32_MC0_ADDR + i * 4, addrv);
+
+                       d = maddr_get_owner(addrv);
+                       if (d != NULL)
+                               mc_info.mc_domid = d->domain_id;
+               }
+
+               miscv = 0;
+               if (status & MCi_STATUS_MISCV)
+                       rdmsrl(MSR_IA32_MC0_MISC + i * 4, miscv);
+
+               mc_info.mc_addr = addrv;
+               mc_info.mc_misc = miscv;
+               x86_mcinfo_add(mc_data, &mc_info);
+
+               if (mc_callback_bank_extended)
+                       mc_callback_bank_extended(mc_data, i, status);
+
+               /* clear status */
+               wrmsrl(MSR_IA32_MC0_STATUS + i * 4, 0x0ULL);
+               wmb();
+       }
+
+       if (error_found > 0) {
+               /* If Dom0 enabled the VIRQ_MCA event, then ... */
+               if (event_enabled)
+                       /* ... notify it. */
+                       send_guest_global_virq(dom0, VIRQ_MCA);
+               else
+                       /* ... or dump it */
+                       x86_mcinfo_dump(mc_data);
+       }
+
+       adjust += error_found;
+}
+
+static void p4_mce_work_fn(void *data)
+{ 
+       on_each_cpu(x86_mce_checkregs, NULL, 1, 1);
+
+       if (adjust > 0) {
+               if ( !guest_enabled_event(dom0->vcpu[0], VIRQ_MCA) ) {
+                       /* Dom0 did not enable VIRQ_MCA, so Xen is reporting. */
+                       printk("MCE: polling routine found correctable error. "
+                               " Use mcelog to parse above error output.\n");
+               }
+       }
+
+       if (adjust > 0) {
+               /* Increase polling frequency */
+               adjust++; /* adjust == 1 must have an effect */
+               period /= adjust;
+       } else {
+               /* Decrease polling frequency */
+               period *= 2;
+       }
+       if (period > MCE_MAX) {
+               /* limit: Poll at least every 30s */
+               period = MCE_MAX;
+       }
+       if (period < MCE_MIN) {
+               /* limit: Poll every 2s.
+                * When this is reached an uncorrectable error
+                * is expected to happen, if Dom0 does nothing.
+                */
+               period = MCE_MIN;
+       }
+
+       set_timer(&mce_timer, NOW() + period);
+       adjust = 0;
+}
 
 static void mce_checkregs (void *info)
 {
@@ -85,6 +227,15 @@
                break;
 
        case X86_VENDOR_INTEL:
+               if (c->x86 == 15        /* P4/Xeon */
+#ifdef CONFIG_X86_64
+                   || c->x86 == 6
+#endif
+                   ) {
+                       init_timer(&mce_timer, p4_mce_work_fn, NULL, 0);
+                       set_timer(&mce_timer, NOW() + period);
+                       break;
+               }
                init_timer(&mce_timer, mce_work_fn, NULL, 0);
                set_timer(&mce_timer, NOW() + MCE_PERIOD);
                break;
diff -r 6595393a3d28 xen/arch/x86/cpu/mcheck/p4.c
--- a/xen/arch/x86/cpu/mcheck/p4.c      Tue Dec 09 16:28:02 2008 +0000
+++ b/xen/arch/x86/cpu/mcheck/p4.c      Mon Dec 20 09:35:50 2008 +0900
@@ -15,6 +15,7 @@
 #include <asm/apic.h>
 
 #include "mce.h"
+#include "x86_mca.h"
 
 /* as supported by the P4/Xeon family */
 struct intel_mce_extended_msrs {
@@ -32,6 +33,7 @@
 };
 
 static int mce_num_extended_msrs = 0;
+extern int mce_bootlog;
 
 
 #ifdef CONFIG_X86_MCE_P4THERMAL
@@ -158,85 +160,13 @@
        return mce_num_extended_msrs;
 }
 
-static fastcall void intel_machine_check(struct cpu_user_regs * regs, long 
error_code)
-{
-       int recover=1;
-       u32 alow, ahigh, high, low;
-       u32 mcgstl, mcgsth;
-       int i;
-       struct intel_mce_extended_msrs dbg;
-
-       rdmsr (MSR_IA32_MCG_STATUS, mcgstl, mcgsth);
-       if (mcgstl & (1<<0))    /* Recoverable ? */
-               recover=0;
-
-       printk (KERN_EMERG "CPU %d: Machine Check Exception: %08x%08x\n",
-               smp_processor_id(), mcgsth, mcgstl);
-
-       if (intel_get_extended_msrs(&dbg)) {
-               printk (KERN_DEBUG "CPU %d: EIP: %08x EFLAGS: %08x\n",
-                       smp_processor_id(), dbg.eip, dbg.eflags);
-               printk (KERN_DEBUG "\teax: %08x ebx: %08x ecx: %08x edx: 
%08x\n",
-                       dbg.eax, dbg.ebx, dbg.ecx, dbg.edx);
-               printk (KERN_DEBUG "\tesi: %08x edi: %08x ebp: %08x esp: 
%08x\n",
-                       dbg.esi, dbg.edi, dbg.ebp, dbg.esp);
-       }
-
-       for (i=0; i<nr_mce_banks; i++) {
-               rdmsr (MSR_IA32_MC0_STATUS+i*4,low, high);
-               if (high & (1<<31)) {
-                       if (high & (1<<29))
-                               recover |= 1;
-                       if (high & (1<<25))
-                               recover |= 2;
-                       printk (KERN_EMERG "Bank %d: %08x%08x", i, high, low);
-                       high &= ~(1<<31);
-                       if (high & (1<<27)) {
-                               rdmsr (MSR_IA32_MC0_MISC+i*4, alow, ahigh);
-                               printk ("[%08x%08x]", ahigh, alow);
-                       }
-                       if (high & (1<<26)) {
-                               rdmsr (MSR_IA32_MC0_ADDR+i*4, alow, ahigh);
-                               printk (" at %08x%08x", ahigh, alow);
-                       }
-                       printk ("\n");
-               }
-       }
-
-       if (recover & 2)
-               panic ("CPU context corrupt");
-       if (recover & 1)
-               panic ("Unable to continue");
-
-       printk(KERN_EMERG "Attempting to continue.\n");
-       /* 
-        * Do not clear the MSR_IA32_MCi_STATUS if the error is not 
-        * recoverable/continuable.This will allow BIOS to look at the MSRs
-        * for errors if the OS could not log the error.
-        */
-       for (i=0; i<nr_mce_banks; i++) {
-               u32 msr;
-               msr = MSR_IA32_MC0_STATUS+i*4;
-               rdmsr (msr, low, high);
-               if (high&(1<<31)) {
-                       /* Clear it */
-                       wrmsr(msr, 0UL, 0UL);
-                       /* Serialize */
-                       wmb();
-                       add_taint(TAINT_MACHINE_CHECK);
-               }
-       }
-       mcgstl &= ~(1<<2);
-       wrmsr (MSR_IA32_MCG_STATUS,mcgstl, mcgsth);
-}
-
 
 void intel_p4_mcheck_init(struct cpuinfo_x86 *c)
 {
        u32 l, h;
        int i;
        
-       machine_check_vector = intel_machine_check;
+       machine_check_vector = x86_machine_check;
        wmb();
 
        printk (KERN_INFO "Intel machine check architecture supported.\n");
@@ -244,6 +174,17 @@
        if (l & (1<<8)) /* Control register present ? */
                wrmsr (MSR_IA32_MCG_CTL, 0xffffffff, 0xffffffff);
        nr_mce_banks = l & 0xff;
+
+       /* Log the machine checks left over from the previous reset.
+          This also clears all registers */
+       for (i=0; i<nr_mce_banks; i++) {
+               u64 status;
+               rdmsrl(MSR_IA32_MC0_STATUS + i*4, status);
+               if (status & MCi_STATUS_VAL) {
+                       x86_machine_check(NULL, mce_bootlog ? -1 : -2);
+                       break;
+               }
+       }
 
        for (i=0; i<nr_mce_banks; i++) {
                wrmsr (MSR_IA32_MC0_CTL+4*i, 0xffffffff, 0xffffffff);
diff -r 6595393a3d28 xen/arch/x86/cpu/mcheck/x86_mca.h
--- a/xen/arch/x86/cpu/mcheck/x86_mca.h Tue Dec 09 16:28:02 2008 +0000
+++ b/xen/arch/x86/cpu/mcheck/x86_mca.h Mon Dec 20 09:35:50 2008 +0900
@@ -70,3 +70,11 @@
 /* reserved bits */
 #define MCi_STATUS_OTHER_RESERVED2      0x0180000000000000ULL
 
+/* Polling period */
+#define MCE_PERIOD MILLISECS(15000)
+#define MCE_MIN    MILLISECS(2000)
+#define MCE_MAX    MILLISECS(30000)
+
+/* Common routines */
+void x86_machine_check(struct cpu_user_regs *regs, long error_code);
+void x86_mce_checkregs(void *info);
diff -r 6595393a3d28 xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c      Tue Dec 09 16:28:02 2008 +0000
+++ b/xen/arch/x86/traps.c      Mon Dec 20 09:35:50 2008 +0900
@@ -726,8 +726,10 @@
         if ( !opt_allow_hugepage )
             __clear_bit(X86_FEATURE_PSE, &d);
         __clear_bit(X86_FEATURE_PGE, &d);
+#ifndef __x86_64__
         __clear_bit(X86_FEATURE_MCE, &d);
         __clear_bit(X86_FEATURE_MCA, &d);
+#endif
         __clear_bit(X86_FEATURE_PSE36, &d);
     }
     switch ( (uint32_t)regs->eax )
diff -r 6595393a3d28 xen/common/page_alloc.c
--- a/xen/common/page_alloc.c   Tue Dec 09 16:28:02 2008 +0000
+++ b/xen/common/page_alloc.c   Mon Dec 20 09:35:50 2008 +0900
@@ -338,8 +338,14 @@
 
             /* Find smallest order which can satisfy the request. */
             for ( j = order; j <= MAX_ORDER; j++ )
-                if ( !list_empty(&heap(node, zone, j)) )
-                    goto found;
+                if ( !list_empty(&heap(node, zone, j)) ) {
+                    pg = list_entry(heap(node, zone, j).next, struct 
page_info, list);
+                    if (!(pg->count_info & PGC_reserved))
+                        goto found;
+                    else
+                        printk(XENLOG_DEBUG "Page %p(%lx) is not to be 
allocated.\n",
+                               pg, page_to_maddr(pg));
+                }
         } while ( zone-- > zone_lo ); /* careful: unsigned zone may wrap */
 
         /* Pick next node, wrapping around if needed. */
@@ -402,11 +408,22 @@
     unsigned long mask;
     unsigned int i, node = phys_to_nid(page_to_maddr(pg));
     struct domain *d;
+    int reserved = 0;
 
     ASSERT(zone < NR_ZONES);
     ASSERT(order <= MAX_ORDER);
     ASSERT(node >= 0);
     ASSERT(node < num_online_nodes());
+
+    for ( i = 0; i < (1 << order); i++) {
+        reserved += !!(pg[i].count_info & PGC_reserved);
+        if (!!(pg[i].count_info & PGC_reserved))
+            printk(XENLOG_DEBUG "Page %p(%lx) is not to be freed\n",
+                   &pg[i], page_to_maddr(&pg[i]));
+    }
+
+    if (reserved)
+        return;
 
     for ( i = 0; i < (1 << order); i++ )
     {
diff -r 6595393a3d28 xen/include/asm-x86/mm.h
--- a/xen/include/asm-x86/mm.h  Tue Dec 09 16:28:02 2008 +0000
+++ b/xen/include/asm-x86/mm.h  Mon Dec 20 09:35:50 2008 +0900
@@ -142,8 +142,11 @@
  /* 3-bit PAT/PCD/PWT cache-attribute hint. */
 #define PGC_cacheattr_base  26
 #define PGC_cacheattr_mask  (7U<<PGC_cacheattr_base)
- /* 26-bit count of references to this frame. */
-#define PGC_count_mask      ((1U<<26)-1)
+ /* Set for special pages, which can never be used */
+#define _PGC_reserved      25
+#define PGC_reserved       (1U<<_PGC_reserved)
+ /* 25-bit count of references to this frame. */
+#define PGC_count_mask      ((1U<<25)-1)
 
 #define is_xen_heap_page(page) is_xen_heap_mfn(page_to_mfn(page))
 #define is_xen_heap_mfn(mfn) ({                         \
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel