Keir, please hg add the attached doc as docs/misc/tmem-internals.html
The info applies for 4.0 as well, so the doc can also go into 4.0-testing.
Signed-off-by: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
Transcendent Memory Internals in Xen
by Dan Magenheimer, Oracle Corp.
Draft 0.1 -- Updated: 20100324
This document focuses on the internal implementation of
Transcendent Memory (tmem) on Xen. It assumes
that the reader has a basic knowledge of the terminology, objectives, and
functionality of tmem and also has access to the Xen source code.
It corresponds to the Xen 4.0 release, with
patch added to support page deduplication (V2).
The primary responsibilities of the tmem implementation are to:
- manage a potentially huge and extremely dynamic
number of memory pages from a potentially large number of clients (domains)
with low memory overhead and proper isolation
- provide quick and efficient access to these
pages with as much concurrency as possible
- enable efficient reclamation and eviction of pages (e.g. when
memory is fully utilized)
- optionally, increase page density through compression and/or
- where necessary, properly assign and account for
memory belonging to guests to avoid malicious and/or accidental unfairness
- record utilization statistics and make them available to management tools
Source Code Organization
The source code in Xen that provides the tmem functionality
is divided up into four files: tmem.c, tmem.h, tmem_xen.c, and tmem_xen.h.
The files tmem.c and tmem.h are intended to
be implementation- (and hypervisor-) independent and the other two files
provide the Xen-specific code. This
division is intended to make it easier to port tmem functionality to other
hypervisors, though at this time porting to other hypervisors has not been
attempted. Together, these four files
total less than 4000 lines of C code.
Even ignoring the implementation-specific functionality, the
implementation-independent part of tmem has several dependencies on
library functionality (Xen source filenames in parentheses):
a good fast general-purpose dynamic memory
allocator with bounded response time and efficient use of memory for a very
large number of sub-page allocations. To
achieve this in Xen, the bad old memory allocator was replaced with a
slightly-modified version of TLSF (xmalloc_tlsf.c), first ported to Linux by
Nitin Gupta for compcache.
good tree data structure libraries, specifically
red-black trees (rbtree.c) and radix trees (radix-tree.c).
Code for these was borrowed for Linux and adapted for tmem and Xen.
good locking and list code. Both of these existed in Xen and required
little or no change.
optionally, a good fast lossless compression
library. The Xen implementation added to
support tmem uses LZO1X (lzo.c), also ported for Linux by Nitin Gupta.
More information about the specific functionality of these
libraries can easily be found through a search engine, via wikipedia, or in the
Xen or Linux source logs so we will not elaborate further here.
The tmem code uses several prefixes and abbreviations.
Knowledge of these will improve code readability:
transcendent memory host. Functions or
data structures that are defined by the implementation-specific code, i.e. the
Xen host code
== transcendent memory control.
Functions or data structures that provide management tool functionality,
rather than core tmem operations.
client == client.
The tmem generic term for a domain or a guest OS.
When used in prose, common tmem operations are indicated
with a different font, such as put
Key Data Structures
To manage a huge number of pages, efficient data structures
must be carefully selected.
Recall that a tmem-enabled guest OS may create one or more
pools with different attributes. It then
puts and gets
pages to/from this pool, identifying the page
with a handle that consists of a pool_id, an
object_id, and a page_id (sometimes
called an index).
This suggests a few obvious core data
When a guest OS first calls tmem, a client_t is created to contain
and track all uses of tmem by that guest OS. Among
other things, a client_t keeps pointers
to a fixed number of pools (16 in the current Xen implementation).
When a guest OS requests a new pool, a pool_t is created.
Some pools are shared and are kept in a
sharelist (sharelist_t) which points
to all the clients that are sharing the pool.
Since an object_id is 64-bits,
a pool_t must be able to keep track
of a potentially very large number of objects.
To do so, it maintains a number of parallel trees (256 in the current
Xen implementation) and a hash algorithm is applied to the object_id
to select the correct tree.
Each tree element points to an object.
Because an object_id usually represents an inode
(a unique file number identifier), and inode numbers
are fairly random, though often "clumpy", a red-black tree
When a guest first
puts a page to a pool with an as-yet-unused object_id, an
obj_t is created. Since a page_id is usually an index into a file,
it is often a small number, but may sometimes be very large (up to
32-bits). A radix tree is a good data structure to contain items
with this kind of index distribution.
When a page is
put, a page descriptor, or pgp_t, is created, which
among other things will point to the storage location where the data is kept.
In the normal case the pointer is to a pfp_t, which is an
implementation-specific datatype representing a physical pageframe in memory
(which in Xen is a "struct page_info").
When deduplication is enabled, it points to
yet another data structure, a pcd_t
(see below). When compression is enabled
(and deduplication is not), the pointer points directly to the compressed data.
For reasons we will see shortly, each pgp_t that represents
an ephemeral page (that is, a page placed
in an ephemeral pool) is also placed
into two doubly-linked linked lists, one containing all ephemeral pages
put by the same client and one
containing all ephemeral pages across all clients ("global").
When deduplication is enabled, multiple pgp_t's may need to point to
the same data, so another data structure (and level of indirection) is used
called a page content descriptor, or pcd_t.
Multiple page descriptors (pgp_t's) may point to the same pcd_t.
The pcd_t, in turn, points to either a pfp_t
(if a full page of data), directly to a
location in memory (if the page has been compressed or trailing zeroes have
been eliminated), or even a NULL pointer (if the page contained all zeroes and
trailing zero elimination is enabled).
The most apparent usage of this multi-layer web of data structures
is "top-down" because, in normal operation, the vast majority of tmem
operations invoked by a client are
puts and gets, which require the various
data structures to be walked starting with the client_t, then
a pool_t, then an obj_t, then a pgd_t.
However, there is another highly frequent tmem operation that is not
visible from a client: memory reclamation.
Since tmem attempts to use all spare memory in the system, it must
frequently free up, or evict,
pages. The eviction algorithm will be
explained in more detail later but, in brief, to free memory, ephemeral pages
are removed from the tail of one of the doubly-linked lists, which means that
all of the data structures associated with that page-to-be-removed must be
updated or eliminated and freed. As a
result, each data structure also contains a back-pointer
to its parent, for example every obj_t
contains a pointer to its containing pool_t.
This complex web of interconnected data structures is updated constantly and
thus extremely sensitive to careless code changes which, for example, may
result in unexpected hypervisor crashes or non-obvious memory leaks.
On the other hand, the code is fairly well
modularized so, once understood, it is possible to relatively easily switch out
one kind of data structure for another.
To catch problems as quickly as possible when debug is enabled, most of
the data structures are equipped with sentinelsand many inter-function
assumptions are documented and tested dynamically
While these clutter and lengthen the tmem
code substantially, their presence has proven invaluable on many occasions.
For completeness, we should also describe a key data structure in the Xen
implementation-dependent code: the tmh_page_list. For security and
performance reasons, pages that are freed due to tmem operations (such
as get) are not immediately put back into Xen's pool
of free memory (aka the Xen heap).
Tmem pages may contain guest-private data that must be scrubbed before
those memory pages are released for the use of other guests.
But if a page is immediately re-used inside of tmem itself, the entire
page is overwritten with new data, so need not be scrubbed.
Since tmem is usually the most frequent
customer of the Xen heap allocation code, it would be a waste of time to scrub
a page, release it to the Xen heap, and then immediately re-allocate it
again. So, instead, tmem maintains
currently-unused pages of memory on its own free list, tmh_page_list,
and returns the pages to Xen only when non-tmem Xen
heap allocation requests would otherwise fail.
Tmem has been designed to be highly scalable.
Since tmem access is invoked similarly in
many ways to asynchronous disk access, a "big SMP" tmem-aware guest
OS can, and often will, invoke tmem hypercalls simultaneously on many different
physical CPUs. And, of course, multiple
tmem-aware guests may independently and simultaneously invoke tmem
hypercalls. While the normal frequency
of tmem invocations is rarely extremely high, some tmem operations such as data
compression or lookups in a very large tree may take tens of thousands of
cycles or more to complete. Measurements
have shown that normal workloads spend no more than about 0.2% (2% with
compression enabled) of CPU time executing tmem operations.
But those familiar with OS scalability issues
recognize that even this limited execution time can create concurrency problems
in large systems and result in poorly-scalable performance.
A good locking strategy is critical to concurrency, but also
must be designed carefully to avoid deadlock and livelock problems. For
debugging purposes, tmem supports a "big kernel lock" which disables
concurrency altogether (enabled in Xen with "tmem_lock", but note
that this functionality is rarely tested and likely has bit-rotted). Infrequent
but invasive tmem hypercalls, such as pool creation or the control operations,
are serialized on a single read-write lock, called tmem_rwlock,
which must be held for writing. All other tmem operations must hold this lock
for reading, so frequent operations such as
put and get flush can execute simultaneously
as long as no invasive operations are occurring.
Once a pool has been selected, there is a per-pool
read-write lock (pool_rwlock) which
must be held for writing if any transformative operations might occur within
that pool, such as when an obj_t is
created or destroyed. For the highly
frequent operation of finding an obj_t
within a pool, pool_rwlock must be held for reading.
Once an object has been selected, there is a per-object
This is a spinlock rather than a read-write
lock because nearly all of the most frequent tmem operations (e.g.
put and get flush)
are transformative, in
that they add or remove a page within the object.
This lock is generally taken whenever an
object lookup occurs and released when the tmem operation is complete.
Next, the per-client and global ephemeral lists are
protected by a single global spinlock (eph_lists_spinlock)
and the per-client persistent lists are also protected by a single global
And to complete the description of
implementation-independent locks, if page deduplication is enabled, all pages
for which the first byte match are contained in one of 256 trees that are
protected by one of 256 corresponding read-write locks
In the Xen-specific code (tmem_xen.c), page frames (e.g. struct page_info)
that have been released are kept in a list (tmh_page_list) that
is protected by a spinlock (tmh_page_list_lock).
There is also an "implied" lock
associated with compression, which is likely the most time-consuming operation
in all of tmem (of course, only when compression is enabled): A compression
buffer is allocated one-per-physical-cpu early in Xen boot and a pointer to
this buffer is returned to implementation-independent code and used without a
The proper method to avoid deadlocks is to take and release
locks in a very specific predetermined order.
Unfortunately, since tmem data structures must simultaneously be
accessed "top-down" (
put and get)
(memory reclamation), more complex methods must be employed:
A trylockmechanism is used (c.f. tmem_try_to_evict_pgp()),
which takes the lock if it is available but returns immediately (rather than
spinning and waiting) if the lock is not available.
When walking the ephemeral list to identify
pages to free, any page that belongs to an object that is locked is simply
skipped. Further, if the page is the
last page belonging to an object, and the pool read-write lock for the pool the
object belongs to is not available (for writing), that object is skipped.
These constraints modify the LRU algorithm
somewhat, but avoid the potential for deadlock.
Unfortunately, a livelock was still discovered in this approach:
When memory is scarce and each client is
putting a large number of pages
for exactly one object (and thus holding the object spinlock for that object),
memory reclamation takes a very long time to determine that it is unable to
free any pages, and so the time to do a
put (which eventually fails) becomes linear to the
number of pages in the object! To avoid
this situation, a workaround was added to always ensure a minimum amount of
memory (1MB) is available before any object lock is taken for the client
invoking tmem (see tmem_ensure_avail_pages()).
Other such livelocks (and perhaps deadlocks)
may be lurking.
A last issue related to concurrency is atomicity of counters.
Tmem gathers a large number of
statistics. Some of these counters are
informational only, while some are critical to tmem operation and must be
incremented and decremented atomically to ensure, for example, that the number
of pages in a tree never goes negative if two concurrent tmem operations access
the counter exactly simultaneously. Some
of the atomic counters are used for debugging (in assertions) and perhaps need
not be atomic; fixing these may increase performance slightly by reducing
cache-coherency traffic. Similarly, some
of the non-atomic counters may yield strange results to management tools, such
as showing the total number of successful
puts as being higher than the number of
These are left as exercises for future tmem implementors.
Control and Manageability
Tmem has a control interface to, for example, set various
parameters and obtain statistics. All
tmem control operations funnel through do_tmem_control()
and other functions supporting tmem control operations are prefixed
During normal operation, even if only one tmem-aware guest
is running, tmem may absorb nearly all free memory in the system for its own
use. Then if a management tool wishes to
create a new guest (or migrate a guest from another system to this one), it may
notice that there is insufficient "free" memory and fail the creation
(or migration). For this reason, tmem
introduces a new tool-visible class of memory -- freeable memory --
and provides a control interface to access
it. All ephemeral memory and all pages on the tmh_page_list
are freeable. To properly access freeable
memory, a management tool must follow a sequence of steps:
tmem:When tmem is frozen, all
puts fail, which ensures that no
additional memory may be absorbed by tmem.
(See tmemc_freeze_pools(), and
note that individual clients may be frozen, though this functionality may be
used only rarely.)
query freeable MB: If all freeable memory were released to the Xen
heap, this is the amount of memory (in MB) that would be freed.
Tmem may be requested to flush, or relinquish, a certain amount of memory, e.g.
back to the Xen heap. This amount is
specified in KB. See tmemc_flush_mem() and tmem_relinquish_npages().
At this point the management tool may allocate
the memory, e.g. using Xen's published interfaces.
tmem: This terminates the freeze, allowing tmem to accept
Extensive tmem statistics are available through tmem's
control interface (see tmemc_list and
the separate source for the "xm tmem-list" command and the
xen-tmem-list-parse tool). To maximize
forward/backward compatibility with future tmem and tools versions, statistical
information is passed via an ASCII interface where each individual counter is
identified by an easily parseable two-letter ASCII sequence.
Another piece of functionality that has a major impact on
the tmem code is support for save/restore of a tmem client and, highly related,
live migration of a tmem client.
Ephemeral pages, by definition, do not need to be saved or
live-migrated, but persistent pages are part of the state of a running VM and
so must be properly preserved.
When a save (or live-migrate) of a tmem-enabled VM is initiated, the first step
is for the tmem client to be frozen (see the manageability section).
Next, tmem API version information is
recorded (to avoid possible incompatibility issues as the tmem spec evolves in
the future). Then, certain high-level
tmem structural information specific to the client is recorded, including
information about the existing pools.
Finally, the contents of all persistent pages are recorded.
For live-migration, the process is somewhat more complicated.
Ignoring tmem for a moment, recall that in
live migration, the vast majority of the VM's memory is transferred while the
VM is still fully operational. During
each phase, memory pages belonging to the VM that are changed are marked and
then retransmitted during a later phase.
Eventually only a small amount of memory remains, the VM is paused, the
remaining memory is transmitted, and the VM is unpaused on the target machine.
The number of persistent tmem pages may be quite large,
possibly even larger than all the other memory used by the VM; so it is
unacceptable to transmit persistent tmem pages during the "paused"
phase of live migration. But if the VM
is still operational, it may be making calls to tmem:
A frozen tmem client will reject any
put operations, but tmem must
still correctly process flushes
(page and object), including implicit flushes due to duplicate
Fortunately, these operations can only
invalidate tmem pages, not overwrite tmem pages or create new pages.
So, when a live-migrate has been initiated,
the client is frozen. Then during the
"live" phase, tmem transmits all persistent pages, but also records
the handle of all persistent pages that are invalidated.
Then, during the "paused" phase,
only the handles of invalidated persistent pages are transmitted, resulting in
the invalidation on the target machine of any matching pages that were
previously transmitted during the "live" phase.
For restore (and on the target machine of a live migration),
tmem must be capable of reconstructing the internal state of the client from
the saved/migrated data. However, it is
not the client itself that is put'ing
the pages but the management tools conducting the restore/migration.
This slightly complicates tmem by requiring
new API calls and new functions in the implementation, but the code is
structured so that duplication is minimized.
Once all tmem data structures for the client are reconstructed, all
persistent pages are recreated and, in the case of live-migration, all
invalidations have been processed and the client has been thawed, the restored
client can be resumed.
Finally, tmem's data structures must be cluttered a bit to
support save/restore/migration. Notably,
a per-pool list of persistent pages must be maintained and, during live
migration, a per-client list of invalidated pages must be logged.
A reader of the code will note that these
lists are overlaid into space-sensitive data structures as a union, which may
be more error-prone but eliminates significant space waste.
Miscellaneous Tmem Topics
One interesting corner case that
significantly complicates the tmem source code is the possibility
of a duplicate
which occurs when two
are requested with the same handle but with possibly different data.
The tmem API addresses
coherence explicitly: When a duplicate
put occurs, tmem may react one of two ways: (1) The
put may succeed with the old
data overwritten by the new data, or (2) the
put may be failed with the original data flushed and
neither the old nor the new data accessible.
Tmem may not fail the
put and leave the old data accessible.
When tmem has been actively working for an extended period,
system memory may be in short supply and it is possible for a memory allocation
for a page (or even a data structure such as a pgd_t) to fail. Thus,
for a duplicate
put, it may be impossible for tmem to temporarily
simultaneously maintain data structures and data for both the original
put and the duplicate
When the space required for the data is
identical, tmem may be able to overwrite in place the old data with
the new data (option 1). But in some circumstances, such as when data
is being compressed, overwriting is not always possible and option 2 must be
Page deduplication and trailing-zero elimination.
When page deduplication is enabled
("tmem_dedup" option to Xen), ephemeral pages for which the contents
are identical -- whether the pages belong
to the same client or different clients -- utilize the same pageframe of
memory. In Xen environments where
multiple domains have a highly similar workload, this can save a substantial
amount of memory, allowing a much larger number of ephemeral pages to be
used. Tmem page deduplication uses
methods similar to the KSM implementation in Linux [ref], but differences between
the two are sufficiently great that tmem does not directly leverage the
code. In particular, ephemeral pages in
tmem are never dirtied, so need never be copied-on-write.
Like KSM, however, tmem avoids hashing,
instead employing red-black trees
that use the entire page contents as the lookup
key. There may be better ways to implement this.
Dedup'ed pages may optionally be compressed
("tmem_compress" and "tmem_dedup" Xen options specified),
to save even more space, at the cost of more time.
Additionally, trailing zero elimination (tze) may be applied to dedup'ed
pages. With tze, pages that contain a
significant number of zeroes at the end of the page are saved without the trailing
zeroes; an all-zero page requires no data to be saved at all.
In certain workloads that utilize a large number
of small files (and for which the last partial page of a file is padded with
zeroes), a significant space savings can be realized without the high cost of
Both compression and tze significantly complicate memory
allocation. This will be discussed more below.
Accounting is boring, but poor accounting may
result in some interesting problems. In
the implementation-independent code of tmem, most data structures, page frames,
and partial pages (e.g. for compresssion) are billed to a pool,
and thus to a client. Some infrastructure data structures, such as
pools and clients, are allocated with tmh_alloc_infra(), which does not
require a pool to be specified. Two other
exceptions are page content descriptors (pcd_t)
and sharelists (sharelist_t) which
are explicitly not associated with a pool/client by specifying NULL instead of
(Note to self:
These should probably just use the tmh_alloc_infra() interface too.)
As we shall see, persistent pool pages and
data structures may need to be handled a bit differently, so the
implementation-independent layer calls a different allocation/free routine for
persistent pages (e.g. tmh_alloc_page_thispool())
than for ephemeral pages (e.g. tmh_alloc_page()).
In the Xen-specific layer, we
disregard the pool_t for ephemeral
pages, as we use the generic Xen heap for all ephemeral pages and data
can be handled in the implementation-independent layer because ephemeral pages
are kept in per-client queues each with a counted length.
See the discussion on weights and caps below.)
However we explicitly bill persistent pages
and data structures against the client/domain that is using them.
(See the calls to the Xen routine alloc_domheap_pages() in tmem_xen.h; of
the first argument is a domain, the pages allocated are billed by Xen to that
domain.)This means that a Xen domain
cannot allocate even a single tmem persistent page when it is currently utilizing
its maximum assigned memory allocation!
This is reasonable for persistent pages because, even though the data is
not directly accessible by the domain, the data is permanently saved until
either the domain flushes it or the domain dies.
Note that proper accounting requires (even for ephemeral pools) that the same
pool is referenced when memory is freed as when it was allocated, even if the
ownership of a pool has been moved from one client to another (c.f. shared_pool_reassign()).
The underlying Xen-specific information may
not always enforce this for ephemeral pools, but incorrect alloc/free matching
can cause some difficult-to-find memory leaks and bent pointers.
Page deduplication is not possible for persistent pools for
accounting reasons: Imagine a page that is created by persistent pool A, which
belongs to a domain that is currently well under its maximum allocation.
Then the pcd_tis matched by persistent pool B, which is
currently at its maximum.
Then the domain owning pool A is destroyed.
Is B beyond its maximum?
(There may be a clever way around this
problem. Exercise for the reader!)
Memory allocation. The implementation-independent layer assumes
there is a good fast general-purpose dynamic memory allocator with bounded
response time and efficient use of memory for a very large number of sub-page
allocations. The old xmalloc memory
allocator in Xen was not a good match for this purpose, so was replaced by the
TLSF allocator. Note that the TLSF
allocator is used only for allocations smaller than a page (and, more
precisely, no larger than tmem_subpage_maxsize());
full pages are allocated by Xen's normal heap allocator.
After the TLSF allocator was integrated into Xen, more work
was required so that each client could allocate memory from a separate
independent pool. (See the call to xmem_pool_create()in
This allows the data structures allocated for the
purpose of supporting persistent pages to be billed to the same client as the
pages themselves. It also allows partial
(e.g. compressed) pages to be properly billed.
Further, when partial page allocations cause internal fragmentation,
this fragmentation can be isolated per-client.
And, when a domain dies, full pages can be freed, rather than only
partial pages. One other change was
required in the TLSF allocator: In the original version, when a TLSF memory
pool was allocated, the first page of memory was also allocated.
Since, for a persistent pool, this page would
be billed to the client, the allocation of the first page failed if the domain
was started at its maximum memory, and this resulted in a failure to create the
memory pool. To avoid this, the code was
changed to delay the allocation of the first page until first use of the memory
Memory allocation interdependency.
As previously described,
pages of memory must be moveable back and forth between the Xen heap and the
tmem ephemeral lists (and page lists).
When tmem needs a page but doesn't have one, it requests one from the
Xen heap (either indirectly via xmalloc, or directly via Xen's alloc_domheap_pages()).
And when Xen needs a page but doesn't have
one, it requests one from tmem (via a call to tmem_relinquish_pages() in Xen's alloc_heap_pages() in page_alloc.c).
This leads to a potential infinite loop!
To break this loop, a new memory flag (MEMF_tmem) was added to Xen
to flag and disallow the loop.
Note that the tmem_relinquish_pages() interface allows for memory requests of
order > 0 (multiple contiguous pages), but the tmem implementation disallows
any requests larger than a single page.
LRU page reclamation.
Ephemeral pages generally age in
a queue, and the space associated with the oldest -- or least-recently-used -- page is reclaimed when tmem needs more
memory. But there are a few exceptions
to strict LRU queuing. First is when
removal from a queue is constrained by locks, as previously described above.
Second, when an ephemeral pool is shared, unlike a private ephemeral
does not imply a
Instead, in a shared pool, a
results in the page being promoted to the front of the queue.
Third, when a page that is deduplicated (i.e.
is referenced by more than one pgp_t)
reaches the end of the LRU queue, it is marked as eviction attempted and promoted to the front of the queue; if it
reaches the end of the queue a second time, eviction occurs.
Note that only the pgp_t is evicted; the actual data is only reclaimed if there is no
other pgp_t pointing to the data.
All of these modified- LRU algorithms deserve to be studied
carefully against a broad range of workloads.
compression or tze is enabled, allocations between a half-page and a full-page
in size are very common and this places a great deal of pressure on even the
best memory allocator. Additionally,
problems may be caused for memory reclamation: When one tmem ephemeral page is
evicted, only a fragment of a physical page of memory might be reclaimed.
As a result, when compression or tze is
enabled, it may take a very large number of eviction attempts to free up a full
contiguous page of memory and so, to avoid near-infinite loops and livelocks, eviction
must be assumed to be able to fail.
While all memory allocation paths in tmem are resilient to failure, very
complex corner cases may eventually occur.
As a result, compression and tze are disabled by default and should be
used with caution until they have been tested with a much broader set of
workloads.(Note to self: The
code needs work.)
Weights and caps.
of the just-discussed LRU-based eviction algorithms, a client that uses tmem at
a very high frequency can quickly swamp tmem so that it provides little benefit
to a client that uses it less frequently.
To reduce the possibility of this denial-of-service, limits can be
specified via management tools that are enforced internally by tmem.
On Xen, the "xm tmem-set" command
can specify "weight=<weight>" or "cap=<cap>"
for any client. If weight is non-zero
for a client and the current percentage of ephemeral pages in use by the client
exceeds its share (as measured by the sum of weights of all clients), the next
page chosen for eviction is selected from the requesting client's ephemeral
queue, instead of the global ephemeral queue that contains pages from all
Setting a cap for a client is currently a no-op.
Shared pools and authentication.
When tmem was first proposed to the linux kernel mailing list
(LKML), there was concern expressed about security of shared ephemeral
pools. The initial tmem implementation only
required a client to provide a 128-bit UUID to identify a shared pool, and the
linux-side tmem implementation obtained this UUID from the superblock of the
shared filesystem (in ocfs2). It was
pointed out on LKML that the UUID was essentially a security key and any
malicious domain that guessed it would have access to any data from the shared
filesystem that found its way into tmem.
Ocfs2 has only very limited security; it is assumed that anyone who can
access the filesystem bits on the shared disk can mount the filesystem and use
it. But in a virtualized data center,
higher isolation requirements may apply.
As a result, a Xen boot option -- "tmem_shared_auth" -- was
added. The option defaults to disabled,
but when it is enabled, management tools must explicitly authenticate (or may
explicitly deny) shared pool access to any client.
On Xen, this is done with the "xm
There was some effort put into getting tmem working on a 32-bit Xen.
However, the Xen heap is limited in size on
32-bit Xen so tmem did not work very well.
There are still 32-bit ifdefs in some places in the code, but things may
have bit-rotted so using tmem on a 32-bit Xen is not recommended.
The vast majority of the tmem
implementation is architecture-independent.
For tmem to run on Xen/ia64, it is believed that only one or two
routines needs to be written.(See the
#ifdef __ia64__ at cli_mfn_to_va().)
is active, all physically memory becomes fragmented
into individual pages. However, the Xen
memory allocator allows memory to be requested in multi-page contiguous
quantities, called order>0 allocations.
(e.g. 2order so
order==4 is sixteen contiguous pages.)
In some cases, a request for a larger order will fail gracefully if no
matching contiguous allocation is available from Xen.
As of Xen 4.0, however, there are several
critical order>0 allocation requests that do not fail gracefully.
Notably, when a domain is created, and
order==4 structure is required or the domain creation will fail.
And shadow paging requires many order==2
allocations; if these fail, a PV live-migration may fail.
There are likely other such issues.
But, fragmentation can occur even without tmem if any domU does
any extensive ballooning; tmem just accelerates the fragmentation.
So the fragmentation problem must be solved
anyway. The best solution is to disallow
order>0 allocations altogether in Xen -- or at least ensure that any attempt
to allocate order>0 can fail gracefully, e.g. by falling back to a sequence
of single page allocations. However this restriction may require a major rewrite
in some of Xen's most sensitive code.
(Note that order>0 allocations during Xen boot and early in domain0
launch are safe and, if dom0 does not enable tmem, any order>0 allocation by
dom0 is safe, until the first domU is created.)
Until Xen can be rewritten to be fragmentation-safe, a small hack
was added in the Xen page
allocator.(See the comment "
memory is scarce" in alloc_heap_pages().)
Briefly, a portion of memory is pre-reserved
for allocations where order>0 and order<9.
(Domain creation uses 2MB pages, but fails
gracefully, and there are no other known order==9 allocations or order>9
allocations currently in Xen.)
NUMA. Tmem assumes that
all memory pages are equal and any RAM page can store a page of data for any
client. This has potential performance
consequences in any NUMA machine where access to far memory is significantly slower than access to near memory.
On nearly all of today's servers, however,
access times to far memory is still
much faster than access to disk or network-based storage, and tmem's primary performance
advantage comes from the fact that paging and swapping are reduced.
So, the current tmem implementation ignores
NUMA-ness; future tmem design for NUMA machines is an exercise left for the