[Xen-devel] Essay on an important Xen decision (long)

A fundamental architectural decision has to be made for
Xen regarding handling of physical/machine memory; at a high
level, the question is:

        Should Xen drivers be made more flexible to accommodate
        different approaches to managing physical memory, or
        should other architectures be required to conform to
        the Xen/x86 model?

A more detailed description of the specific decision is below.
The Xen/ia64 community would like to make this decision soon --
possibly at the Xen summit -- as next steps of Xen/ia64
functionality are significantly affected.  Since either choice
has an impact on common code and on future Xen architecture,
this decision must involve core Xen developers and the broader
Xen community rather than just Xen/ia64 developers.

While this may seem to be a trivial matter, such fundamental
choices often have a way of pre-selecting future design and
implementation directions that can have major negative or positive
impacts -- possibly unexpected -- on different parties.  For example,
a decision might make a Xen developers' life easier but create
headaches for a distro or a Linux maintainer.  If nothing else,
discussing fundamental decision points often helps to
bring out and codify/document hidden assumptions about
the future.

This is a lengthy document but I hope to touch on most of
the various issues and tradeoffs.  Understanding -- or, at
a minimum, reading -- this document should probably be
a prerequisite for involvement in discussions to resolve this.
I would encourage all readers to give the issues and tradeoffs
some thought as the "obvious x86" answer may not be the best
answer for the future of Xen.

First a little terminology and background:

In a virtualized environment, the resources of the physical
machine must subdivided and/or shared between multiple virtual
machines.  Like an OS manages memory for its applications, one of
the primary roles of a hypervisor is to provide the illusion to
each guest OS that it owns some amount of "RAM" in the system.
Thus there are two kinds of physical memory addresses: the
addresses that a guest believes to be physical addresses and
the addresses that actually refer to RAM (e.g. bus addresses).
The literature (and Xen) confusingly labels these as "physical"
addresses and "machine" addresses.  In a virtualized environment,
there must be some way of maintaining the relationship -- or
"mapping" -- between physical addresses and machine addresses.

In Xen (across all architectures), there are currently three
different approaches for mapping physical addresses to machine
addresses:

1) P==M: The guest is given a subset of machine memory that it
   can access "directly".  Accesses to machine memory addresses
   outside of this range must somehow be restricted (but not
   necessarily disallowed) by Xen.

2) guest-aware p!=m (P2M): The guest is given max_pages of
   contiguous physical memory starting at zero and the knowledge
   that physical addresses are different than machine addresses.
   The guest must understand the difference between a physical
   address and a machine address and utilize the correct one in
   different situations.

3) virtual physical (VP): The guest is given max_pages of
   contiguous physical memory starting at zero.  Xen provides
   the illusion to the guest that this is machine memory;
   any physical-to-machine translation required for functional
   correctness is handled invisibly by Xen.  VP cannot be used
   by guests that directly program DMA-based I/O devices
   because a DMA device requires a machine address and, by
   definition, the guest knows only about physical addresses.

Xen/x86 and Xen/x86_64 use P2M, but switch to VP (aka "shadow
mode") for an unprivileged guest when a migration is underway.
Xen/ia64 currently uses P==M for domain0 and VP for unprivileged
guests.  Xen/ppc intends to use VP only.

There is an architectural proposal to change Xen/ia64 so that
domain0 uses P2M instead of P==M.  We will call this choice P2M
and the choice to stay on the current path P==M.

Here's what I think are the key issues/tradeoffs:

XEN CODE IMPACT

Some Xen drivers, such as the blkif driver, have been "converted"
to accommodate P==M. Others have not.  For example, the balloon driver
currently assumes domain0 is P2M and thus does not currently work
on Xen/ia64 or Xen/ppc.  The word "converted" is quoted because
nobody is particularly satisfied with the current state of the
converted drivers.  Many apparently significant function calls are
define'd out of existence by macros.  Other code does radically
different things depending on the architecture or on whether it
is being executed by dom0 or an unprivileged domain.  And a few
ifdef's are sprinkled about.  In short, what's done works but is
an ugly hack.  Some believe that the best way to solve this mess
is for other architectures to do things more like Xen/x86.  Others
believe there is an advantage to defining clear abstractions and
making the drivers truly more architecture-independent.

P2M will require some rewriting of existing Xen/ia64 core code and the
addition of significant changes to Xenlinux/ia64 code but will allow
much easier porting of Xen's balloon/networking/migration drivers
and also enable some simplifying changes in the Xen block driver.
It is fair to guess that it will take at least several weeks/months
to rewrite and debug the core and Xenlinux code to get Xen/ia64 back
to where it is today, but future driver work will be much faster.
Fewer differences from Xen/x86 means less maintenance work for Xen
core and Xen/ia64 developers.  I'd imagine also that more code will
be shared between Xen/VT-i and Xen/VT-x.

P==M will require Xen's balloon/networking/migration drivers to
evolve to incorporate non-P2M models.  This can be done, but is most
likely to end up (at least in the short term) as a collection of
unpalatable hacks like with the Xen block driver.  However, making
Xen drivers more tolerant of different approaches may be a good
thing in the long run for Xen.

XENLINUX IMPACT

Today's operating systems are not implemented with an understanding
that a physical address and a machine address might be different.
Building this awareness into an OS requires non-trivial source
code change.  For example, Xenlinux/x86 maintains a "p2m" mapping
table for quick translation and provides a "m2p" hypercall to keep
Xen in sync.  OS code that manipulates physical addresses must be
modified to access/manage this table and make hypercalls when
appropriate.  Macros can hide much of the complexity but much OS/driver
code exists that does not use standard macros.  There is some
disagreement on how extensive are the required source code changes,
and how difficult it will be to maintain these changes across future
versions of guest OS's.  One illustrative example however:  In
paravirtualizing Xenlinux/ia64, seven header files are changed;
it is closer to 40 for Xenlinux/x86.

Related, some would assert that pushing a small number of changes into
Linux (or any OS, open source or not) is far easier that pushing a
large number of changes into Linux.  Until all the Xen/x86 changes are
in, it remains to be seen whether this is true or not.  There is
a reasonable concern that the broad review required for such
an extensive set of changes will involve a large number of people
with a large number of agendas and force a number of Xen design
issues to be revisited -- at least clearly justified if not changed.
This is especially true if Xen's foes have any influence in the
process.

Transparent paravirtualization (also called "shared binary") is the
ability for the same binary to be used both as a Xen guest and
natively on real hardware.  Xenlinux/ia64 currently support this;
indeed, ignoring a couple of existing bugs, the same Xenlinux/ia64
binary can be used natively, and as domain0 and as an unprivileged
domain. There have been proposals to do the same for Xenlinux/x86,
but the degree of code changed is much much higher.  There is debate
about the cost/benefit of transparent paravirtualization, but the
primary beneficiaries -- distros and end customers -- are not very
well represented here.

With P2M, it is unlikely that Xenlinux/ia64 will ever again be
transparently paravirtualizable.  As with Xenlinux/x86, the changes
will probably be pushed into a subarch (mach-xen).  Since Linux/ia64
has a more diverse set of subarch's, there may be additional work
to ensure that Xen is orthogonal (and thus works with) all the
subarch's.

P==M would continue to allow transparent paravirtualization.
This plus the reduced number of changes should make it easier to
get Xen/ia64 support into Linux/ia64 (assuming Xen/x86 support
gets included in Linux/x86).

DRIVER DOMAINS

Driver domains are "coming soon" and support of driver domains is a
"must", however support for hybrid driver domains (i.e. domains that
utilize both backend and frontend drivers) is open to debate.  It can
be assumed however that all driver domains will require DMA access.

P2M should make driver domains easier to implement (once the initial
Xenlinux/ia64 work is completed) and able to support a broader range
of functionality.  P==M may disallow hybrid driver domains and
create other restrictions, though some creative person may be able
to solve these.

FUTURE XEN FEATURE SUPPORT

None of the approaches have been "design-tested" significantly for
support or compatibility with future Xen functionality such as
oversubscription or machine-memory hot-plug, nor for exotic
machine memory topologies such as NUMA or discontig (sparsely
populated).  Such functionalities and topologies are much more
likely to be encountered in high-end server architectures rather
than widely-available PCs and low-end servers.  There is some
debate as to whether the existing Xen memory architecture will easily
evolve to accommodate these future changes or if more fundamental
changes will be required.  Architectural decisions and restrictions
should be made with these uncertainties in mind.

Some believe that discovery and policy for machine memory will
eventually need to move out of Xen into domain0, leaving only
enforcement mechanism in Xen.  For example, oversubscription, NUMA
or hot-plug memory support are likely to be fairly complicated
and a commonly stated goal is to move unnecessary complexity out
of Xen.  And the plethora of recent changes in Linux/ia64
involving machine memory models indicates there are still many
unknowns.  P==M more easily supports a model where domain0
owns ALL of machine memory *except* a small amount reserved for
and protected by Xen itself.  If this is all true, Xen/x86 may
eventually need to move to a dom0 P==M model, in which case it
would be silly for Xen/ia64 to move to P2M and then back to P==M.

Others think these features will be easy to implement in Xen and,
with minor changes, entirely compatible with P2M.  And that
P2M is the once and future model for domain0.

SUMMARY

I'm sure there are more issues and tradeoffs that will come up
in discussion, but let me summarize these:

Move domain0 to P2M:
+ Fewer differences in Xen drivers between Xen/x86 and Xen/ia64
+ Fewer differences in Xen drivers between Xen/VT-x and Xen/VT-i
+ Easier to implement remaining Xen drivers for Xen/ia64
- Major changes may require months for Xen/ia64 to regain stability
- Many more changes to Xenlinux/ia64; more difficulty pushing upstream
- No attempt to make Xen more resilient for future architectures

Leave domain0 as P==M:
+ Fewer changes in Xenlinux; easier to push upstream
+ Making Xen more flexible is a good thing
? May provide better foundation for future features (oversubscr, NUMA)
- More restrictions on driver domains
- More hacks required for some Xen drivers, or
- More work to better abstract and define a portable driver
  architecture abstract

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] Essay on an important Xen decision (long)