Re: [Xen-devel] RFC: MCA/MCE concept

On Wednesday 06 June 2007 12:35:15 Gavin Maltby wrote:
> Hi,
>
> On 06/06/07 10:28, Christoph Egger wrote:
> > On Monday 04 June 2007 18:16:56 Gavin Maltby wrote:
> >> Hi,
> >>
> >> On 05/30/07 10:10, Christoph Egger wrote:
> >>> On Wednesday 30 May 2007 10:49:40 Jan Beulich wrote:
> >>>>>>> "Christoph Egger" <Christoph.Egger@xxxxxxx> 30.05.07 09:45 >>>
> >>>>>
> >>>>> On Wednesday 30 May 2007 09:19:12 Jan Beulich wrote:
> >>>>>>> case I) - Xen reveives a MCE from the CPU
> >>>>>>>
> >>>>>>> 1) Xen MCE handler figures out if error is an correctable error
> >>>>>>> (CE) or uncorrectable error (UE)
> >>>>>>> 2a) error == CE:
> >>>>>>>     Xen notifies Dom0 if Dom0 installed an MCA event handler
> >>>>>>>     for statistical purpose
> >>
> >> [rest cut]
> >>
> >> For the hypervisor to dom0 communication that 2a) above refers to I
> >> think we need to agree on two aspects:  what form the notification event
> >> will take, and what error telemetry data and additional information will
> >> be provided by the hypervisor for dom0 to chew on for statistical and
> >> diagnosis purposes.
> >
> > Additionally, the hypervisor must be able to notify domU that has
> > a PV MCA driver.
>
> Yes, forgot that; although I guess I view that most likely as a future
> phase.

Yes, but ignoring this can lead to a design that is bad for DomU and
requires a re-design in the worst case.

> >> For the first I've assumed so far that an event channel notification
> >> of the MCA event will suffice;  as long as the hypervisor only polls
> >> for correctable MCA errors at a low-frequency rate (currently 15s
> >> interval) there is no danger of spamming that single notification.
> >
> > Why polling?
>
> Polling for correctable errors, but #MC as usual for others.  Setting
> MCi_CTL bits for correctable errors does not produce a machine check,
> so polling is the only approach unless one sets additional (and
> undocumented, certainly for AMD chips) config bits.  What I was getting
> at here is that polling at largish intervals for correctables is
> the correct approach - trapping for them or polling at a high-frequency
> is bad because in cases where you have some form of solid correctable
> error (say a single bad pin in a dimm socket affecting one or two ranks
> of that dimm but never able to produce a UE) the trap handling and
> diagnosis software consume the machine and things make little useful
> forward progress.

I still don't see, why #MC for all kind of errors is bad.

> >> On receipt of the notification the event handler will need to suck
> >> some event data out of somewhere - uncertain which somewhere would
> >> be best?
> >>
> >> We should standardize both the format and the content of this event
> >> data.  The following is just to get the conversation started in this
> >> area.
> >>
> >> Content first.  Obviously we need the raw MCA register content -
> >> MCi_STATUS, MCi_ADDR, MCi_MISC.  We also need know which
> >> MCA detector bank made the observation, so we need to include
> >> some indication of which chip (where I use "chip" to coincide
> >> with "socket"), core on that chip, and MCA bank number
> >> the telemetry came from.  I think I am correct in saying that
> >> hyperthreaded CPUs do not have any MCA banks per-thread, but we
> >> may want to allow for that future possibility (I know, for instance,
> >> that some SPARC cpus have error state for each hardware thread).
> >
> > And we need the domain and the domain's vcpu to identify
> > who is impacted.
>
> Yes, the domain ID.  I'm not sure we need the vcpu id if we instead
> present some physical identifiers such as chip, core number etc
> (and have the namespaces well-defined).  If we don't present those
> the vcpu in the payload and some external method to resolve that to
> physical components.  Since errors correlate to physical components it
> would, I think, be nicer to report detector info in some physical sense.

The vcpu is more interesting for the domU than for dom0.
See below.

> As regards a vcpu to physical translation, I didn't think there was any
> fixed mapping (or certainly any mapping that a dom0 should interpret
> and rely on).  For example if we have two physical cores but choose
> to present 32 vcpus to domain I don't believe there is anything to
> say that 0-15 map always run on physical core 0?
>
> >> We should also allow for additional model-specific error telemetry
> >> that may be available and relevant - I know that will be necessary
> >> for some upcoming x86 cpu models.  We should probably avoid adding
> >> "cooked" content to this error event payload - such cooking of the
> >> raw data is much more easily performed in dom0 (the example I'm
> >> thinking of here is physical address to memory location translation).
> >>
> >> In terms of the form of the error event data, the simplest but also
> >> the dumbest would be a binary structure passed from hypervisor
> >> to dom0:
> >
> > struct mca_error_data_ver1 {
> >     uint8_t version;        /* structure version */
> >     uint64_t mc_status;
> >     uint64_t mc_addr;
> >     uint64_t mc_misc;
> >     uint16_t mc_chip;
> >     uint16_t mc_core;
> >     uint16_t mc_bank;
> >         uint16_t domid;
> >         uint16_t vcpu_id;
> >     ...
> > };
> >
> >> That is easily passed around and can be extended by versioning.
> >> A more self-describing and naturally extensible approach would be
> >> to parcel the error data in some form of name-type-value list.
> >> That's what we do in the corresponding kernel->userland error
> >> code in Solaris; the downside is that the supporting libnvpair
> >> library is not tiny and likely not the sort of footprint to
> >> include in a hypervisor.  Perhaps some cut-down form would do.
> >
> > In the public xen.h header is a VIRQ_DOM_EXC defined, which seems
> > to be appropriate for an NMI event.
> > There are two functions to send VIRQs: send_guest_vcpu_virq() and
> > send_guest_global_virq().
> >
> > However, VIRQ_DOM_EXC is not properly implemented:
> > All virtual interrupts are maskable. We definitely need
> > an event that guarantees to immediately interrupts the guest, no matter
> > if this is Dom0 or DomU and whatever they are doing.
> >
> > And VIRQ_DOM_EXC is explicitely reserved for Dom0. Maybe
> > we should introduce a VIRQ_MCA as a special NMI event for both Dom0 and
> > DomU?
>
> Sounds like it may be necessary.  I don't know this mechanism very well
> so I'll go and do some reading (after a big long unrelated codereview).

After some code reading I found a nmi_pending, nmi_masked and nmi_addr in
struct vcpu in xen/include/xen/sched.h.  xen/include/xen/nmi.h is also of 
interest. The implementation is in xen/common/kernel.c.
There is only one callback per vcpu allowed and only Dom0 can register an
NMI. So the guests NMI handler must multiplex several nmi handlers - at least
for Dom0 (MCA + watchdog timer). It's fine with me to allow DomUs to
only register the MCA NMI.

To inform domU (having a PV MCA driver), they must be able to register an
NMI callback as well. To allow this, struct vcpu_info in the PUBLIC xen.h
also needs nmi_pending and nmi_addr.


Keir: How do you feel about all this? Is this the right way or do you see
things that should be done in a different way?


Christoph


-- 
AMD Saxony, Dresden, Germany
Operating System Research Center

Legal Information:
AMD Saxony Limited Liability Company & Co. KG
Sitz (Geschäftsanschrift):
   Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland
Registergericht Dresden: HRA 4896
vertretungsberechtigter Komplementär:
   AMD Saxony LLC (Sitz Wilmington, Delaware, USA)
Geschäftsführer der AMD Saxony LLC:
   Dr. Hans-R. Deppe, Thomas McCoy



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

Re: [Xen-devel] RFC: MCA/MCE concept