[Xen-devel] [RFC] New shadow paging code

We (Michael Fetterman, George Dunlap and I) have been working over the
last while on a full replacement for Xen's shadow pagetable support. 

This mail contains some design notes, below; a patch against
xen-unstable, giving a snapshot of the current state of the new shadow
code, is at http://www.cl.cam.ac.uk/~tjd21/shadow2.patch

Comments on both are welcome, although the code is not finished -- in
particular there are both some optimizations and some tidying-up that
need to be done.

Cheers,

Tim.

----

The new shadow code (dubbed 'shadow2'), is designed as a replacement for
the current shadow code.  It's been designed from the ground up to support
the following capabilities:
 * Work for both paravirtualized and HVM guests.  Our focus is on Windows
under HVM, since Linux guests can use paravirtual mechanisms for faster
memory management.
 * Xen may be running in 2-, 3-, or 4-level paging mode.  While booting,
guests may be in direct-access mode (no paging), or any paging level less
than or equal to Xen's current paging level.  This means that we must
support 2-on-2, 2-on-3, 3-on-3, 3-on-4, and 4-on-4 paging modes.
 * While bringing up secondary vcpus in an SMP system, the vcpus may all be
in different paging modes.  We must support these simultaneously.
 * Logdirty mode for live migration.
 * We must work with paravirtualized drivers for HVM domains.
 * We must work for guest superpages.

With this in mind, we have made several design choices:
* Do away with the "out-of-sync" mechanism to begin with.  After a page is
promoted, emulate all writes to it until it is demoted again.  This makes the
logic a lot simpler, and also reduces the overhead of demand paging, which
is one of the most common Windows modes.  (See below for more information
on demand paging.)
* In the case of a size mismatch between guest pagetable entries and host
pagetable entries (i.e., 2-on-3 or 2-on-4, where guest pagetable entries
are 32 bits and host pagetable entires are 64 bits), a single guest page
may need to be shadowed by multiple shadow pages.  In this case, we always
shadow the entire guest pagetable, rather than shadowing only part at a
time.  We also keep the multiple backing shadow pagetables physically
contiguous in memory using a "buddy" allocator.  This allows us to use only
one mfn value to designate the entire group of mfns.
* We allocate a fixed amount of shadow memory at domain creation. This is
shared by all vcpus.  When we need more shadow pages, we begin to unshadow
pages to free up more memory in approximately an LRU fashion.
* We keep the p2m maps for HVM domains in a pagetable format, so that we
can use them as the pagetables fo HVM guests in paging-disabled mode.

So far, we have had several successes.  Demand-paging accesses have been
sped up by doing emulated writes rather than using the out-of-sync
mechanism.  The out-of-sync mechanism requires three page faults, two of
which entail relative expensive shadow operations: marking a page out of
sync, and bringing it back into sync.  In the case of HVM guests, the
faults also cause three expensive vmexit/vmenter cycles.  Our emulated
writes requires only two page faults, and each fault is less expensive.

Also, the overhead of many individual shadow operations is less in the newer
code than in the old code.

We have a number of potential optimizations in mind for the near future:

* Removing writable mappings.  As with the old code, when a
guest pfn is promoted to be a pagetable, we need to find and remove all
writable mappings to it, so that we can detect changes.  Following the
"start simple, then optimize" principle, our current code does a
brute-force search through the shadows.  Our tests indicate that when a
page is promoted to a pagetable, it generally has exactly one writable
mapping outstanding. This is true both for Windows and for Linux.  We plan
to use this fact to keep a back-pointer to the last writable shadow pte of
a page in the page_info struct of a page.  The few exceptions to the rule
can still be handled using brute-force search.

* Fast-pathing some faults.  By storing the guest present / writable flags
in some of the spare bits of the guest pagetable, we can fast-path certain
operations, such as propagating a fault to the guest or updating guest
dirty and accessed bits, without needing to map the guest pagetables.  This
should speed up some common faults, as well as reduce cache footprint.

* Batch updates.  There are times when guests do batch updates to
pagetables.  At these times, it makes sense to give the guest write access
to the pagetables.  At first this can be done simply by unshadowing the
page entirely. In the future, we can explore whether a a "mark out of sync"
mechanism would speed things up.  We may be able to have a more extreme
optimization for Linux fork(): when we detect Linux doing a fork(), we can
unshadow the entire user portion of the guest address space, to save having
to detect a "batch update" and unshadow each guest pagetable individually.

* Full emulation of shadow page accesses.  Currently, we allow read-only
access to guest pagetables.  This requires us to emulate the dirty and
accessed bits of the guest pagetables, in turn requiring us to take page
faults.  But how many of these dirty/accessed bits are actually read?  It
may be more efficient, in certain circumstances, to emualte reads to guest
page tables as well as writes, taking the dirty and accessed bits from the
shadow pagetables.

* Teardown heuristics.  If we can determine when a guest is destroying a
process, we can unshadow the whole address space at once.  Failure to
detect when a process is being torn down will cause unnecessary overhead:
if the guest pagetables of the destroyed process are recycled as data
pages, all writes to the pages will be emulated (in a rather expensive
manner) until the page is unshadowed.  Even if the guest pagetables are
re-used for new process pagetables, constructing the address space will be
faster if unshadowed.

**************
Code Structure
**************

Our code must deal differently with all the different combinations of
shadow modes.  However, we expect that once a guest reaches its target
paging mode, it will stay in that mode for a long time; and the host will
never change its paging mode.  Rather than having a whole string of ifs in
the code based on the current guest and host paging modes, we compile
different code to deal with each pair of modes (2-on-2, 2-on-3, 2-on-4,
3-on-3, 3-on-4, 4-on-4).  (Direct mode is implemented as a special case of
m-on-m, where m is the host's current paging level.)  While increasing the
size of the hypervisor overall, this should greatly decrease both the cache
footprint of the shadow code and reduce pipeline flushes from mispredicted
branches.

To keep from having to maintain duplicate logic across 6 different bits of
code, we use a single source code file, and compiler directives to specify
mode-specific code.  This file is shadow2.c, and is built once with
GUEST_PAGING_LEVELS and SHADOW_PAGING_LEVELS  set to the appropriate
combination.  The compiler is set to redefine the functions from
  sh2_[function_name]()
to
  sh2_[function_name]__shadow_[m]_guest_[n]
for n-on-m mode.

At the end of shadow2.c is a structure containing function pointers for
each of the mode-specific functions; this is called shadow2_entry (and is
expanded by preprocessor directives using the __shadow_[m]_guest_[n] naming
convention).  When a guest vcpu is put into a particular shadow mode, an
element of the vcpu struct is pointed to the appropriate shadow2_entry
struct.  To call the appropriate function, one generally calls
shadow2_[function_name](v, [args]), which is generally implemented after
the following template:

[rettype] shadow2_[function_name](v, [args]) {
        return v->arch.shadow2->[function_name](v, [args]);
}


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
WARNING - OLD ARCHIVES

xen-devel

[Xen-devel] [RFC] New shadow paging code