Folks, there's been some disucssion about the VMI interface proposal
between myself and Linus/Andrew. I've appeneded my latest reply.
As regards the VMI proposal itself, I don't think I can forward it, so
if you don't have it you'd better ask Pratap Subrahmanyam
[pratap@xxxxxxxxxx] for it directly.
From: Ian Pratt [mailto:m+Ian.Pratt@xxxxxxxxxxxx]
Sent: 08 August 2005 20:59
To: Andrew Morton; torvalds@xxxxxxxx
Subject: RE: vmware's virtual machine interface
> Ian, the vmware guys were sounding a little anxious that they hadn't
> heard anything back on the VMI proposal and spec?
The first few of their patches are fine -- just cleanups to existing
arch code that we have similar patches for in our own tree. However, our
views on the actual VMI interface haven't changed since the discussion
at OLS, and we have serious reservations about the proposal.
I believe being able to override bits of kernel code and divert
execution through a "ROM" image supplied by the hypervisor is going to
lead to a maintenance nightmare.
People making changes to the kernel won't be able to see what the ROM
code is doing, and hence won't know how their changes effect it.
There'll be pressure to freeze internal APIs, otherwise it will be a
struggle to keep the 'ROM' up to date. I suspect we'll also end up with
a proliferation of hook points that no-one knows whether they're
actually used or not (there are currently 86). There'll also be pressure
to allocate opaque VMI private data areas in various structures such as
struct mm and struct page.
Looking at the VMI hooks themselves, I don't think they've really
thought through the design, at least not for a high-performance
implementation. For example, they have an API for doing batched updates
to PTEs. The problem with this approach is that it's difficult to avoid
read-after-write hazards on queued PTE updates -- you need to sprinkle
flushes liberally throughout arch independent code. Working out where to
put the flushes is tough: Xen 1.0 used this approach and we were never
quite sure we had flushes in all the necessary places in Linux 2.4 --
that's why we abandoned the approach with Xen 2.0 and provided a new
interface that avoids the problem entirely (and is also required for
doing fast atomic updates which are essential to make SMP guests get
The current VMI design is mostly looking at things at an instruction
level, providing hooks for all the privileged instructions plus some for
PTE handling. Xen's ABI is a bit different. We discovered that is wasn't
worth creating hooks for many of the privileged instructions since
they're so infrequently executed that you might as well take the trap
and decode and emulate the instruction. The only ones that matter are on
critical paths (such as the context switch path, demand fault, IPI,
interrupt, fork, exec, exit etc), and we've concentrated our efforts on
making these paths go fast, driven by performance data.
As it stands, the VMI design wouldn't support several of the
optimizations that we've found to be very important for getting
near-native performance. The VMI design assumes you're using shadow page
tables, but a substantial part of Xen's performance comes from avoiding
their use. There's also no mention of SMP. This has been one of the
trickiest bits to get right on Xen -- it's essential to be able to
support SMP guests with very low overhead, and this required a few small
but carefully placed changes to the way IPIs and memory management are
handled (some of which have benefits on native too). The API doesn't
address IO virtualization at all.
We tend to think of the hypervisor API like a hardware architecture.
It's fairly fixed but can be extended from time to time in a backward
compatible fashion (after considerable thought and examination of
benchmark data, just as happens for h/w CPUs). The core parts of the Xen
CPU API have been fixed for quite a while (there have been some changes
to the para-virtualized IO driver APIs, but these are not addressed by
VMI at all).
One attractive aspect of the VMI approach is that it's possible to have
one kernel that works on native (at reduced performance) or on
potentially multiple hypervisors. However, the real cost to linux
distros and ISVs of having multiple linux kernels is the fact that they
need to do all the s/w qualification on each one. The VMI approach
doesn't change this at all: they will still have to do qualification
tests on native, Xen, VMware etc just as they do today[*]. Although it
would be nice to be able to move a running kernel between different
hypervisors at run time I really can't see how VMI would make this
feasible. There's far too much hidden state in the ROM and hypervisor
At an implementation level their design could be improved. Using
function pointers to provide hook points causes unnecessary overhead --
it's better to insert 5 byte NOPs that can be easily patched.
In summary: the cleanup part of their patch is useful, but I think VMI
"ROM" approach is going to be messy and very troublesome to get right.
Chris Wright, Martin Bligh et al are currently make good progress
refactoring the xen patch to get it into a form that should be more
[See http://lists.xensource.com/archives/html/xen-merge/ ] It wouldn't
be a big deal to add VMI-like hooks to the Xen sub arch if VMware want
to go down that route (though we'd prefer to do it with NOP padding
rather than by adding an unnecessary indirection).
[*]Having a single kernel image that works native and on a hypervisor is
quite convenient from a user POV. We've looked into addressing this
problem in a different way, by building multiple kernels and then using
a tool that does a function-by-function 'union' operation, merging the
duplicates and creating a re-write table that can be used to patch the
kernel from native to Xen. This approach has no run time overhead, and
is entirely 'mechanical' rather than having to having to do it as source
level that can be both tricky and messy.
Xen-merge mailing list