This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


Re: [Xen-devel] Re: [PATCH 1/4] CPU online/offline support in Xen

To: Christoph Egger <Christoph.Egger@xxxxxxx>
Subject: Re: [Xen-devel] Re: [PATCH 1/4] CPU online/offline support in Xen
From: Gavin Maltby <Gavin.Maltby@xxxxxxx>
Date: Wed, 17 Sep 2008 14:17:02 +1000
Cc: Haitao Shan <maillists.shan@xxxxxxxxx>, "Tian, Kevin" <kevin.tian@xxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxx, "Shan, Haitao" <haitao.shan@xxxxxxxxx>, Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Delivery-date: Tue, 16 Sep 2008 21:17:39 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <200809111623.11316.Christoph.Egger@xxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: <C4EEE682.2707B%keir.fraser@xxxxxxxxxxxxx> <200809111623.11316.Christoph.Egger@xxxxxxx>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Thunderbird (X11/20080825)
Christoph Egger wrote:
On Thursday 11 September 2008 16:15:14 Keir Fraser wrote:
I applied the patch with the following changes:
 * I rewrote your changes to fixup_irqs(). We should force lazy EOIs
*after* we have serviced any straggling interrupts. Also we should actually
clear the EOI stack so it is empty next time the CPU comes online.
 * I simplified your changes to schedule.c in light of the fact we run in
stop_machine context. Hence we can be quite relaxed about locking, for
 * I removed your change to __csched_vcpu_is_migrateable() and instead put
a similar check in csched_load_balance(). I think this is clearer and also

I note that the VCPU currently running on the offlined CPU continues to run
there even after __cpu_disable(), and until that CPU does a final run
through the scheduler soon after. I hope it does not matter there is one
vcpu with v->processor == offlined_cpu for a short while

This is not acceptable regarding to machine check. When Dom0 offlines a
defect cpu, nothing may continue on it or silent data corruption occurs.

I don't see this as a problem for machine check correctness.

If dom0 asks to offline a cpu (because it believes the cpu is busted and
a threat to uptime), that decision is fundamentally asynchronous
to the actual error handling that occured at machine check exception

 - running in whatever context
 - MCE occurs
 - trap to hypervisor MCE handler
        . this decides on hypervisor panic, or other appropriate
          immediate (in handler) response
        . telemetry forwarded to dom0 for logging and analysis
 - assume no hypervisor panic
 - eons pass during which any unconstrained bad data remaining
   after initial handling may go anywhere
 - dom0 gets telemetry and let's say diagnoses a fault and
   decides to call back into the hypervisor to offline the
   offending cpu

Note the "eons pass" bit;  tonnes of instructions may run on the
bad cpu in this time, and a few more for some offline delay won't


Xen-devel mailing list

<Prev in Thread] Current Thread [Next in Thread>