> -----Original Message-----
> From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
> [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of
> Christoph Egger
> Sent: 01 June 2007 09:12
> To: xen-devel@xxxxxxxxxxxxxxxxxxx
> Cc: Gavin Maltby
> Subject: Re: [Xen-devel] RFC: MCA/MCE concept
>
> On Wednesday 30 May 2007 17:03:55 Petersson, Mats wrote:
> > [snip]
> >
> > > My feeling is that the hypervisor and dom0 own the hardware
> > > and as such
> > > all hardware fault management should reside there. So we
> should never
> > > deliver any form of #MC to a domU, nor should a poll of
> MCA state from
> > > a domU ever observe valid state (e.g, make the RDMSR return 0).
> > > So all handling, logging and diagnosis as well as hardware
> > > response actions
> > > (such as to deploy an online spare chip-select) are controlled
> > > in the hypervisor/dom0 combination. That seems a consistent
> > > model - e.g.,
> > > if a domU is migrated to another system it should not carry the
> > > diagnosis state of the original system across etc, since that
> > > belongs with
> > > the one domain that cannot migrate.
> >
> > I agree entirely with this.
> >
> > > But that is not to say that (I think at a future phase) domU
> > > should not
> > > participate in a higher-level fault management function, at
> > > the direction
> > > of the hypervisor/dom0 combo. For example if/when we can
> isolate an
> > > uncorrectable error to a single domU we could forward
> such an event to
> > > the affected domU if it has registered its
> ability/interest in such
> > > events. These won't be in the form of a faked #MC or anything,
> > > instead they'd be some form of synchronous trap
> experienced when next
> > > the affected domU context resumes on CPU. The intelligent
> > > domU handler
> > > can then decide whether the domU must panic, whether it
> could simply
> > > kill the affected process etc. Those details are clearly
> > > sketchy, but the
> > > idea is to up-level the communication to a domU to be more like
> > > "you're broken" rather than "here's a machine-level
> hardware error for
> > > you to interpret and decide what to do with".
> >
> > Yes, this makes much more sense than forwarding #MC, as the
> guest would
> > have a hard time to actually do anything really useful with
> this. As far
> > as I know, most uncorrectable errors are near enough
> entirely fatal in
> > most commercial non-Enterprise OS's anyways - e.g. in Windows XP or
> > Server 2K3, it always ends in a blue-screen - which is
> hardly any better
> > than the guest being "humanely euthenazed" by Dom0.
> >
> > I take it this would be some sort of hypercall (available
> through the
> > regular PV-driver interface for HVM guests) to say "Let me
> know if I'm
> > broken - trap on vector X".
>
> For short, guests with a PV MCA driver will see a certain event
> (assuming the event mechanism will be used for the notification)
> and guests w/o a PV MCA driver will see a "General Protection Fault".
> Is that right?
Not sure if GP fault is the right thing for non-"MCA PV driver" domains. I
think "just killing" the domain is the right thing to do.
We can't gurantee that a GP fault is actually going to "kill" the guest. Let's
assume the code that ran on the guest was something along the lines of:
int some_function(...)
{
...
try {
...
/* Some code that does quite a lot of "random" processing that may cause,
for example, GP fault */
...
} catch(Exception e)
{
...
/* handles GP fault within the kernel code */
...
}
}
Note that Windows kernel drivers are allowed to use the kernel exception
handling, and ARE allowed to "allow" GP faults if they wish to do so. [Don't
ask me why MS allows this, but that's the case, so we have to live with it].
I'm not sure if Linux, Solaris, *BSD, OS/2 or other OS's will allow "catching"
a Kernel GP fault in a non-precise fashion (I know Linux has exception handling
for EXACT positions in the code). But since at least one kernel DOES allow
this, we can't be sure that a GPF will destroy the guest.
Second point to note is of course that if the guest is in user-mode when the
GPF happens, then almost all OS's will just kill the application - and there's
absolutely no reason to believe that the application running is necessarily
where the actual memory problem is - it may be caused by memory scrubbing for
example.
Whatever we do to the guest, it should be a "certain death", unless the kernel
has told us "I can handle MCE's".
--
Mats
>
> > --
> > Mats
> >
> > > Gavin
> > >
>
> --
> AMD Saxony, Dresden, Germany
> Operating System Research Center
>
> Legal Information:
> AMD Saxony Limited Liability Company & Co. KG
> Sitz (Geschäftsanschrift):
> Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland
> Registergericht Dresden: HRA 4896
> vertretungsberechtigter Komplementär:
> AMD Saxony LLC (Sitz Wilmington, Delaware, USA)
> Geschäftsführer der AMD Saxony LLC:
> Dr. Hans-R. Deppe, Thomas McCoy
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
>
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|