This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


[Xen-changelog] [xen-unstable] docs: Add tmem documentation

To: xen-changelog@xxxxxxxxxxxxxxxxxxx
Subject: [Xen-changelog] [xen-unstable] docs: Add tmem documentation
From: Xen patchbot-unstable <patchbot-unstable@xxxxxxxxxxxxxxxxxxx>
Date: Thu, 22 Apr 2010 04:30:24 -0700
Delivery-date: Thu, 22 Apr 2010 04:31:09 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-changelog-request@lists.xensource.com?subject=help>
List-id: BK change log <xen-changelog.lists.xensource.com>
List-post: <mailto:xen-changelog@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-changelog>, <mailto:xen-changelog-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-changelog>, <mailto:xen-changelog-request@lists.xensource.com?subject=unsubscribe>
Reply-to: xen-devel@xxxxxxxxxxxxxxxxxxx
Sender: xen-changelog-bounces@xxxxxxxxxxxxxxxxxxx
# HG changeset patch
# User Keir Fraser <keir.fraser@xxxxxxxxxx>
# Date 1271835091 -3600
# Node ID c7d7797656dfe6c41ab59954a7bba1a2748c6836
# Parent  b36467432effd0c7a94c2677df978b509c0497f7
docs: Add tmem documentation

Signed-off-by: Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>
 docs/misc/tmem-internals.html |  798 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 798 insertions(+)

diff -r b36467432eff -r c7d7797656df docs/misc/tmem-internals.html
--- /dev/null   Thu Jan 01 00:00:00 1970 +0000
+++ b/docs/misc/tmem-internals.html     Wed Apr 21 08:31:31 2010 +0100
@@ -0,0 +1,798 @@
+<h1>Transcendent Memory Internals in Xen</h1>
+by Dan Magenheimer, Oracle Corp.</p>
+Draft 0.1 -- Updated: 20100324
+This document focuses on the internal implementation of
+Transcendent Memory (tmem) on Xen.  It assumes
+that the reader has a basic knowledge of the terminology, objectives, and
+functionality of tmem and also has access to the Xen source code.
+It corresponds to the Xen 4.0 release, with
+patch added to support page deduplication (V2).
+The primary responsibilities of the tmem implementation are to:
+<li>manage a potentially huge and extremely dynamic
+number of memory pages from a potentially large number of clients (domains)
+with low memory overhead and proper isolation
+<li>provide quick and efficient access to these
+pages with as much concurrency as possible
+<li>enable efficient reclamation and <i>eviction</i> of pages (e.g. when
+memory is fully utilized)
+<li>optionally, increase page density through compression and/or
+<li>where necessary, properly assign and account for
+memory belonging to guests to avoid malicious and/or accidental unfairness
+and/or denial-of-service
+<li>record utilization statistics and make them available to management tools
+<h2>Source Code Organization</h2>
+The source code in Xen that provides the tmem functionality
+is divided up into four files: tmem.c, tmem.h, tmem_xen.c, and tmem_xen.h.
+The files tmem.c and tmem.h are intended to
+be implementation- (and hypervisor-) independent and the other two files
+provide the Xen-specific code.  This
+division is intended to make it easier to port tmem functionality to other
+hypervisors, though at this time porting to other hypervisors has not been
+attempted.  Together, these four files
+total less than 4000 lines of C code.
+Even ignoring the implementation-specific functionality, the
+implementation-independent part of tmem has several dependencies on
+library functionality (Xen source filenames in parentheses):
+a good fast general-purpose dynamic memory
+allocator with bounded response time and efficient use of memory for a very
+large number of sub-page allocations.  To
+achieve this in Xen, the bad old memory allocator was replaced with a
+slightly-modified version of TLSF (xmalloc_tlsf.c), first ported to Linux by
+Nitin Gupta for compcache.
+good tree data structure libraries, specifically
+<i>red-black</i> trees (rbtree.c) and <i>radix</i> trees (radix-tree.c).
+Code for these was borrowed for Linux and adapted for tmem and Xen.
+good locking and list code.  Both of these existed in Xen and required
+little or no change.
+optionally, a good fast lossless compression
+library.  The Xen implementation added to
+support tmem uses LZO1X (lzo.c), also ported for Linux by Nitin Gupta.
+More information about the specific functionality of these
+libraries can easily be found through a search engine, via wikipedia, or in the
+Xen or Linux source logs so we will not elaborate further here.
+The tmem code uses several prefixes and abbreviations.
+Knowledge of these will improve code readability:
+<i>tmh</i> ==
+transcendent memory host.  Functions or
+data structures that are defined by the implementation-specific code, i.e. the
+Xen host code
+== transcendent memory control.
+Functions or data structures that provide management tool functionality,
+rather than core tmem operations.
+<i>cli </i>or
+<i>client</i> == client.
+The tmem generic term for a domain or a guest OS.
+When used in prose, common tmem operations are indicated
+with a different font, such as <big><kbd>put</kbd></big>
+and <big><kbd>get</kbd></big>.
+<h2>Key Data Structures</h2>
+To manage a huge number of pages, efficient data structures
+must be carefully selected.
+Recall that a tmem-enabled guest OS may create one or more
+pools with different attributes.  It then
+<kbd>put</kbd></big>s and <kbd>get</kbd></big>s
+pages to/from this pool, identifying the page
+with a <i>handle</i> that consists of a <i>pool_id</i>, an <i>
+object_id</i>, and a <i>page_id </i>(sometimes
+called an <i>index</i>).
+This suggests a few obvious core data
+When a guest OS first calls tmem, a <i>client_t</i> is created to contain
+and track all uses of tmem by that guest OS.  Among
+other things, a <i>client_t</i> keeps pointers
+to a fixed number of pools (16 in the current Xen implementation).
+When a guest OS requests a new pool, a <i>pool_t</i> is created.
+Some pools are shared and are kept in a
+sharelist (<i>sharelist_t</i>) which points
+to all the clients that are sharing the pool.
+Since an <i>object_id</i> is 64-bits,
+a <i>pool_t</i> must be able to keep track
+of a potentially very large number of objects.
+To do so, it maintains a number of parallel trees (256 in the current
+Xen implementation) and a hash algorithm is applied to the <i>object_id</i>
+to select the correct tree.
+Each tree element points to an object.
+Because an <i>object_id</i> usually represents an <i>inode</i>
+(a unique file number identifier), and <i>inode</i> numbers
+are fairly random, though often &quot;clumpy&quot;, a <i>red-black tree</i>
+is used.
+When a guest first
+<kbd>put</kbd></big>s a page to a pool with an as-yet-unused <i>object_id,</i> 
+<i>obj_t</i> is created.  Since a <i
+>page_id</i> is usually an index into a file,
+it is often a small number, but may sometimes be very large (up to
+32-bits).  A <i>radix tree</i> is a good data structure to contain items
+with this kind of index distribution.
+When a page is
+<kbd>put</kbd></big>, a page descriptor, or <i>pgp_t</i>, is created, which
+among other things will point to the storage location where the data is kept.
+In the normal case the pointer is to a <i>pfp_t</i>, which is an
+implementation-specific datatype representing a physical pageframe in memory
+(which in Xen is a &quot;struct page_info&quot;).
+When deduplication is enabled, it points to
+yet another data structure, a <i>pcd_</i>t
+(see below).  When compression is enabled
+(and deduplication is not), the pointer points directly to the compressed data.
+For reasons we will see shortly, each <i>pgp_t</i> that represents
+an <i>ephemeral</i> page (that is, a page placed
+in an <i>ephemeral</i> pool) is also placed
+into two doubly-linked linked lists, one containing all ephemeral pages
+<kbd>put</kbd></big> by the same client and one
+containing all ephemeral pages across all clients (&quot;global&quot;).
+When deduplication is enabled, multiple <i>pgp_</i>t's may need to point to
+the same data, so another data structure (and level of indirection) is used
+called a page content descriptor, or <i>pcd_t</i>.
+Multiple page descriptors (<i>pgp_t</i>'s) may point to the same <i>pcd_t</i>.
+The <i>pcd_t</i>, in turn, points to either a <i>pfp_t</i>
+(if a full page of data), directly to a
+location in memory (if the page has been compressed or trailing zeroes have
+been eliminated), or even a NULL pointer (if the page contained all zeroes and
+trailing zero elimination is enabled).
+The most apparent usage of this multi-layer web of data structures
+is &quot;top-down&quot; because, in normal operation, the vast majority of tmem
+operations invoked by a client are
+<kbd>put</kbd></big>s and <kbd>get</kbd></big>s, which require the various
+data structures to be walked starting with the <i>client_t</i>, then
+a <i>pool_t</i>, then an <i>obj_t</i>, then a <i>pgd_t</i>.
+However, there is another highly frequent tmem operation that is not
+visible from a client: memory reclamation.
+Since tmem attempts to use all spare memory in the system, it must
+frequently free up, or <i>evict</i>,
+pages.  The eviction algorithm will be
+explained in more detail later but, in brief, to free memory, ephemeral pages
+are removed from the tail of one of the doubly-linked lists, which means that
+all of the data structures associated with that page-to-be-removed must be
+updated or eliminated and freed.  As a
+result, each data structure also contains a <i>back-pointer</i>
+to its parent, for example every <i>obj_t</i>
+contains a pointer to its containing <i>pool_t</i>.
+This complex web of interconnected data structures is updated constantly and
+thus extremely sensitive to careless code changes which, for example, may
+result in unexpected hypervisor crashes or non-obvious memory leaks.
+On the other hand, the code is fairly well
+modularized so, once understood, it is possible to relatively easily switch out
+one kind of data structure for another.
+To catch problems as quickly as possible when debug is enabled, most of
+the data structures are equipped with <i>sentinels</i>and many inter-function
+assumptions are documented and tested dynamically
+with <i>assertions</i>.
+While these clutter and lengthen the tmem
+code substantially, their presence has proven invaluable on many occasions.
+For completeness, we should also describe a key data structure in the Xen
+implementation-dependent code: the <i>tmh_page_list</i>. For security and
+performance reasons, pages that are freed due to tmem operations (such
+as <kbd>get</kbd></big>) are not immediately put back into Xen's pool
+of free memory (aka the Xen <i>heap</i>).
+Tmem pages may contain guest-private data that must be <i>scrubbed</i> before
+those memory pages are released for the use of other guests.
+But if a page is immediately re-used inside of tmem itself, the entire
+page is overwritten with new data, so need not be scrubbed.
+Since tmem is usually the most frequent
+customer of the Xen heap allocation code, it would be a waste of time to scrub
+a page, release it to the Xen heap, and then immediately re-allocate it
+again.  So, instead, tmem maintains
+currently-unused pages of memory on its own free list, <i>tmh_page_list</i>,
+and returns the pages to Xen only when non-tmem Xen
+heap allocation requests would otherwise fail.
+<P>Tmem has been designed to be highly scalable.
+Since tmem access is invoked similarly in
+many ways to asynchronous disk access, a &quot;big SMP&quot; tmem-aware guest
+OS can, and often will, invoke tmem hypercalls simultaneously on many different
+physical CPUs.  And, of course, multiple
+tmem-aware guests may independently and simultaneously invoke tmem
+hypercalls.  While the normal frequency
+of tmem invocations is rarely extremely high, some tmem operations such as data
+compression or lookups in a very large tree may take tens of thousands of
+cycles or more to complete.  Measurements
+have shown that normal workloads spend no more than about 0.2% (2% with
+compression enabled) of CPU time executing tmem operations.
+But those familiar with OS scalability issues
+recognize that even this limited execution time can create concurrency problems
+in large systems and result in poorly-scalable performance.
+A good locking strategy is critical to concurrency, but also
+must be designed carefully to avoid deadlock and <i>livelock</i> problems.  For
+debugging purposes, tmem supports a &quot;big kernel lock&quot; which disables
+concurrency altogether (enabled in Xen with &quot;tmem_lock&quot;, but note
+that this functionality is rarely tested and likely has bit-rotted). Infrequent
+but invasive tmem hypercalls, such as pool creation or the control operations,
+are serialized on a single <i>read-write lock</i>, called tmem_rwlock,
+which must be held for writing.  All other tmem operations must hold this lock
+for reading, so frequent operations such as
+<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big> can 
execute simultaneously
+as long as no invasive operations are occurring.
+Once a pool has been selected, there is a per-pool
+read-write lock (<i>pool_rwlock</i>) which
+must be held for writing if any transformative operations might occur within
+that pool, such as when an<i> obj_t</i> is
+created or destroyed.  For the highly
+frequent operation of finding an<i> obj_t</i>
+within a pool, pool_rwlock must be held for reading.
+Once an object has been selected, there is a per-object
+spinlock (<i>obj_spinlock)</i>.
+This is a spinlock rather than a read-write
+lock because nearly all of the most frequent tmem operations (e.g.
+<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big>)
+are transformative, in
+that they add or remove a page within the object.
+This lock is generally taken whenever an
+object lookup occurs and released when the tmem operation is complete.
+Next, the per-client and global ephemeral lists are
+protected by a single global spinlock (<i>eph_lists_</i>spinlock)
+and the per-client persistent lists are also protected by a single global
+spinlock (<i>pers_list_spinlock</i>).
+And to complete the description of
+implementation-independent locks, if page deduplication is enabled, all pages
+for which the first byte match are contained in one of 256 trees that are
+protected by one of 256 corresponding read-write locks
+In the Xen-specific code (tmem_xen.c), page frames (e.g.  struct page_info)
+that have been released are kept in a list (<i>tmh_page_list</i>) that
+is protected by a spinlock (<i>tmh_page_list_lock</i>).
+There is also an &quot;implied&quot; lock
+associated with compression, which is likely the most time-consuming operation
+in all of tmem (of course, only when compression is enabled): A compression
+buffer is allocated one-per-physical-cpu early in Xen boot and a pointer to
+this buffer is returned to implementation-independent code and used without a
+The proper method to avoid deadlocks is to take and release
+locks in a very specific predetermined order.
+Unfortunately, since tmem data structures must simultaneously be
+accessed &quot;top-down&quot; (
+<kbd>put</kbd></big> and <kbd>get</kbd></big>)
+and &quot;bottoms-up&quot;
+(memory reclamation), more complex methods must be employed:
+A <i>trylock</i>mechanism is used (c.f. <i>tmem_try_to_evict_pgp()</i>),
+which takes the lock if it is available but returns immediately (rather than
+spinning and waiting) if the lock is not available.
+When walking the ephemeral list to identify
+pages to free, any page that belongs to an object that is locked is simply
+skipped.  Further, if the page is the
+last page belonging to an object, and the pool read-write lock for the pool the
+object belongs to is not available (for writing), that object is skipped.
+These constraints modify the LRU algorithm
+somewhat, but avoid the potential for deadlock.
+Unfortunately, a livelock was still discovered in this approach:
+When memory is scarce and each client is
+<kbd>put</kbd></big>ting a large number of pages
+for exactly one object (and thus holding the object spinlock for that object),
+memory reclamation takes a very long time to determine that it is unable to
+free any pages, and so the time to do a
+<kbd>put</kbd></big> (which eventually fails) becomes linear to the
+number of pages in the object!  To avoid
+this situation, a workaround was added to always ensure a minimum amount of
+memory (1MB) is available before any object lock is taken for the client
+invoking tmem (see <i>tmem_ensure_avail_pages()</i>).
+Other such livelocks (and perhaps deadlocks)
+may be lurking.
+A last issue related to concurrency is atomicity of counters.
+Tmem gathers a large number of
+statistics.  Some of these counters are
+informational only, while some are critical to tmem operation and must be
+incremented and decremented atomically to ensure, for example, that the number
+of pages in a tree never goes negative if two concurrent tmem operations access
+the counter exactly simultaneously.  Some
+of the atomic counters are used for debugging (in assertions) and perhaps need
+not be atomic; fixing these may increase performance slightly by reducing
+cache-coherency traffic.  Similarly, some
+of the non-atomic counters may yield strange results to management tools, such
+as showing the total number of successful
+<kbd>put</kbd></big>s as being higher than the number of
+<kbd>put</kbd></big>s attempted.
+These are left as exercises for future tmem implementors.
+<h2>Control and Manageability</h2>
+Tmem has a control interface to, for example, set various
+parameters and obtain statistics.  All
+tmem control operations funnel through <i>do_tmem_control()</i>
+and other functions supporting tmem control operations are prefixed
+with <i>tmemc_</i>.
+During normal operation, even if only one tmem-aware guest
+is running, tmem may absorb nearly all free memory in the system for its own
+use.  Then if a management tool wishes to
+create a new guest (or migrate a guest from another system to this one), it may
+notice that there is insufficient &quot;free&quot; memory and fail the creation
+(or migration).  For this reason, tmem
+introduces a new tool-visible class of memory -- <i>freeable</i> memory --
+and provides a control interface to access
+it.  All ephemeral memory and all pages on the <i>tmh_page_list</i>
+are freeable. To properly access freeable
+memory, a management tool must follow a sequence of steps:
+tmem:When tmem is frozen, all 
+<kbd>put</kbd></big>s fail, which ensures that no
+additional memory may be absorbed by tmem.
+(See <i>tmemc_freeze_pools()</i>, and
+note that individual clients may be frozen, though this functionality may be
+used only rarely.)
+<i>query freeable MB: </i>If all freeable memory were released to the Xen
+heap, this is the amount of memory (in MB) that would be freed.
+See <i>tmh_freeable_pages()</i>.
+Tmem may be requested to flush, or relinquish, a certain amount of memory, e.g.
+back to the Xen heap.  This amount is
+specified in KB.  See <i
+>tmemc_flush_mem()</i> and <i
+At this point the management tool may allocate
+the memory, e.g. using Xen's published interfaces.
+tmem: This terminates the freeze, allowing tmem to accept 
+<kbd>put</kbd></big>s again.
+Extensive tmem statistics are available through tmem's
+control interface (see <i>tmemc_list </i>and
+the separate source for the &quot;xm tmem-list&quot; command and the
+xen-tmem-list-parse tool).  To maximize
+forward/backward compatibility with future tmem and tools versions, statistical
+information is passed via an ASCII interface where each individual counter is
+identified by an easily parseable two-letter ASCII sequence.
+Another piece of functionality that has a major impact on
+the tmem code is support for save/restore of a tmem client and, highly related,
+live migration of a tmem client.
+Ephemeral pages, by definition, do not need to be saved or
+live-migrated, but persistent pages are part of the state of a running VM and
+so must be properly preserved.
+When a save (or live-migrate) of a tmem-enabled VM is initiated, the first step
+is for the tmem client to be frozen (see the manageability section).
+Next, tmem API version information is
+recorded (to avoid possible incompatibility issues as the tmem spec evolves in
+the future).  Then, certain high-level
+tmem structural information specific to the client is recorded, including
+information about the existing pools.
+Finally, the contents of all persistent pages are recorded.
+For live-migration, the process is somewhat more complicated.
+Ignoring tmem for a moment, recall that in
+live migration, the vast majority of the VM's memory is transferred while the
+VM is still fully operational.  During
+each phase, memory pages belonging to the VM that are changed are marked and
+then retransmitted during a later phase.
+Eventually only a small amount of memory remains, the VM is paused, the
+remaining memory is transmitted, and the VM is unpaused on the target machine.
+The number of persistent tmem pages may be quite large,
+possibly even larger than all the other memory used by the VM; so it is
+unacceptable to transmit persistent tmem pages during the &quot;paused&quot;
+phase of live migration.  But if the VM
+is still operational, it may be making calls to tmem:
+A frozen tmem client will reject any 
+<big><kbd>put</kbd></big> operations, but tmem must
+still correctly process <big><kbd>flush</kbd></big>es
+(page and object), including implicit flushes due to duplicate 
+Fortunately, these operations can only
+invalidate tmem pages, not overwrite tmem pages or create new pages.
+So, when a live-migrate has been initiated,
+the client is frozen.  Then during the
+&quot;live&quot; phase, tmem transmits all persistent pages, but also records
+the handle of all persistent pages that are invalidated.
+Then, during the &quot;paused&quot; phase,
+only the handles of invalidated persistent pages are transmitted, resulting in
+the invalidation on the target machine of any matching pages that were
+previously transmitted during the &quot;live&quot; phase.
+For restore (and on the target machine of a live migration),
+tmem must be capable of reconstructing the internal state of the client from
+the saved/migrated data.  However, it is
+not the client itself that is <big><kbd>put</kbd></big>'ing
+the pages but the management tools conducting the restore/migration.
+This slightly complicates tmem by requiring
+new API calls and new functions in the implementation, but the code is
+structured so that duplication is minimized.
+Once all tmem data structures for the client are reconstructed, all
+persistent pages are recreated and, in the case of live-migration, all
+invalidations have been processed and the client has been thawed, the restored
+client can be resumed.
+Finally, tmem's data structures must be cluttered a bit to
+support save/restore/migration.  Notably,
+a per-pool list of persistent pages must be maintained and, during live
+migration, a per-client list of invalidated pages must be logged.
+A reader of the code will note that these
+lists are overlaid into space-sensitive data structures as a union, which may
+be more error-prone but eliminates significant space waste.
+<h2>Miscellaneous Tmem Topics</h2>
+<i><b>Duplicate <big><kbd>puts</kbd></big></b></i>.
+One interesting corner case that
+significantly complicates the tmem source code is the possibility
+of a <i>duplicate</i>
+which occurs when two
+are requested with the same handle but with possibly different data.
+The tmem API addresses
+coherence</i> explicitly: When a duplicate
+<big><kbd>put</kbd></big> occurs, tmem may react one of two ways: (1) The 
+<big><kbd>put</kbd></big> may succeed with the old
+data overwritten by the new data, or (2) the
+<big><kbd>put</kbd></big> may be failed with the original data flushed and
+neither the old nor the new data accessible.
+Tmem may <i>not</i> fail the 
+<big><kbd>put</kbd></big> and leave the old data accessible.
+When tmem has been actively working for an extended period,
+system memory may be in short supply and it is possible for a memory allocation
+for a page (or even a data structure such as a <i>pgd_t</i>) to fail. Thus,
+for a duplicate 
+<big><kbd>put</kbd></big>, it may be impossible for tmem to temporarily
+simultaneously maintain data structures and data for both the original 
+<big><kbd>put</kbd></big> and the duplicate 
+When the space required for the data is
+identical, tmem may be able to overwrite <i>in place </i>the old data with
+the new data (option 1).  But in some circumstances, such as when data
+is being compressed, overwriting is not always possible and option 2 must be
+<i><b>Page deduplication and trailing-zero elimination.</b></i>
+When page deduplication is enabled
+(&quot;tmem_dedup&quot; option to Xen), ephemeral pages for which the contents
+are identical -- whether the pages belong
+to the same client or different clients -- utilize the same pageframe of
+memory.  In Xen environments where
+multiple domains have a highly similar workload, this can save a substantial
+amount of memory, allowing a much larger number of ephemeral pages to be
+used.  Tmem page deduplication uses
+methods similar to the KSM implementation in Linux [ref], but differences 
+the two are sufficiently great that tmem does not directly leverage the
+code.  In particular, ephemeral pages in
+tmem are never dirtied, so need never be <i>copied-on-write</i>.
+Like KSM, however, tmem avoids hashing,
+instead employing <i>red-black trees</i>
+that use the entire page contents as the <i>lookup
+key</i>.  There may be better ways to implement this.
+Dedup'ed pages may optionally be compressed
+(&quot;tmem_compress&quot; and &quot;tmem_dedup&quot; Xen options specified),
+to save even more space, at the cost of more time.
+Additionally, <i>trailing zero elimination (tze)</i> may be applied to dedup'ed
+pages.  With tze, pages that contain a
+significant number of zeroes at the end of the page are saved without the 
+zeroes; an all-zero page requires no data to be saved at all.
+In certain workloads that utilize a large number
+of small files (and for which the last partial page of a file is padded with
+zeroes), a significant space savings can be realized without the high cost of
+Both compression and tze significantly complicate memory
+allocation.  This will be discussed more below.
+<b><i>Memory accounting</i>.</b>
+Accounting is boring, but poor accounting may
+result in some interesting problems.  In
+the implementation-independent code of tmem, most data structures, page frames,
+and partial pages (e.g. for compresssion) are <i>billed</i> to a pool,
+and thus to a client.  Some <i>infrastructure</i> data structures, such as
+pools and clients, are allocated with <i>tmh_alloc_infra()</i>, which does not
+require a pool to be specified.  Two other
+exceptions are page content descriptors (<i>pcd_t</i>)
+and sharelists (<i>sharelist_t</i>) which
+are explicitly not associated with a pool/client by specifying NULL instead of
+a <i>pool_t</i>.
+(Note to self:
+These should probably just use the <i>tmh_alloc_infra()</i> interface too.)
+As we shall see, persistent pool pages and
+data structures may need to be handled a bit differently, so the
+implementation-independent layer calls a different allocation/free routine for
+persistent pages (e.g. <i>tmh_alloc_page_thispool()</i>)
+than for ephemeral pages (e.g. <i>tmh_alloc_page()</i>).
+In the Xen-specific layer, we
+disregard the <i>pool_t</i> for ephemeral
+pages, as we use the generic Xen heap for all ephemeral pages and data
+structures.(Denial-of-service attacks
+can be handled in the implementation-independent layer because ephemeral pages
+are kept in per-client queues each with a counted length.
+See the discussion on weights and caps below.)
+However we explicitly bill persistent pages
+and data structures against the client/domain that is using them.
+(See the calls to the Xen routine <i>alloc_domheap_pages() </i>in tmem_xen.h; 
+the first argument is a domain, the pages allocated are billed by Xen to that
+domain.)This means that a Xen domain
+cannot allocate even a single tmem persistent page when it is currently 
+its maximum assigned memory allocation!
+This is reasonable for persistent pages because, even though the data is
+not directly accessible by the domain, the data is permanently saved until
+either the domain flushes it or the domain dies.
+Note that proper accounting requires (even for ephemeral pools) that the same
+pool is referenced when memory is freed as when it was allocated, even if the
+ownership of a pool has been moved from one client to another (c.f. <i
+The underlying Xen-specific information may
+not always enforce this for ephemeral pools, but incorrect alloc/free matching
+can cause some difficult-to-find memory leaks and bent pointers.
+Page deduplication is not possible for persistent pools for
+accounting reasons: Imagine a page that is created by persistent pool A, which
+belongs to a domain that is currently well under its maximum allocation.
+Then the <i>pcd_t</i>is matched by persistent pool B, which is
+currently at its maximum.
+Then the domain owning pool A is destroyed.
+Is B beyond its maximum?
+(There may be a clever way around this
+problem.  Exercise for the reader!)
+<b><i>Memory allocation.</i></b> The implementation-independent layer assumes
+there is a good fast general-purpose dynamic memory allocator with bounded
+response time and efficient use of memory for a very large number of sub-page
+allocations.  The old xmalloc memory
+allocator in Xen was not a good match for this purpose, so was replaced by the
+TLSF allocator.  Note that the TLSF
+allocator is used only for allocations smaller than a page (and, more
+precisely, no larger than <i>tmem_subpage_maxsize()</i>);
+full pages are allocated by Xen's normal heap allocator.
+After the TLSF allocator was integrated into Xen, more work
+was required so that each client could allocate memory from a separate
+independent pool. (See the call to <i>xmem_pool_create()</i>in
+This allows the data structures allocated for the
+purpose of supporting persistent pages to be billed to the same client as the
+pages themselves.  It also allows partial
+(e.g. compressed) pages to be properly billed.
+Further, when partial page allocations cause internal fragmentation,
+this fragmentation can be isolated per-client.
+And, when a domain dies, full pages can be freed, rather than only
+partial pages. One other change was
+required in the TLSF allocator: In the original version, when a TLSF memory
+pool was allocated, the first page of memory was also allocated.
+Since, for a persistent pool, this page would
+be billed to the client, the allocation of the first page failed if the domain
+was started at its maximum memory, and this resulted in a failure to create the
+memory pool.  To avoid this, the code was
+changed to delay the allocation of the first page until first use of the memory
+<b><i>Memory allocation interdependency.</i></b>
+As previously described,
+pages of memory must be moveable back and forth between the Xen heap and the
+tmem ephemeral lists (and page lists).
+When tmem needs a page but doesn't have one, it requests one from the
+Xen heap (either indirectly via xmalloc, or directly via Xen's <i
+And when Xen needs a page but doesn't have
+one, it requests one from tmem (via a call to <i
+>tmem_relinquish_pages()</i> in Xen's <i
+>alloc_heap_pages() </i>in page_alloc.c).
+This leads to a potential infinite loop!
+To break this loop, a new memory flag (<i>MEMF_tmem</i>) was added to Xen
+to flag and disallow the loop.
+See <i>tmh_called_from_tmem()</i>
+in <i>tmem_relinquish_pages()</i>.
+Note that the <i
+>tmem_relinquish_pages()</i> interface allows for memory requests of
+order &gt; 0 (multiple contiguous pages), but the tmem implementation disallows
+any requests larger than a single page.
+<b><i>LRU page reclamation</i></b>.
+Ephemeral pages generally <i>age </i>in
+a queue, and the space associated with the oldest -- or <i
+>least-recently-used -- </i>page is reclaimed when tmem needs more
+memory.  But there are a few exceptions
+to strict LRU queuing.  First is when
+removal from a queue is constrained by locks, as previously described above.
+Second, when an ephemeral pool is <i>shared,</i> unlike a private ephemeral
+pool, a
+does not imply a
+Instead, in a shared pool, a 
+results in the page being promoted to the front of the queue.
+Third, when a page that is deduplicated (i.e.
+is referenced by more than one <i>pgp_</i>t)
+reaches the end of the LRU queue, it is marked as <i
+>eviction attempted</i> and promoted to the front of the queue; if it
+reaches the end of the queue a second time, eviction occurs.
+Note that only the <i
+>pgp_</i>t is evicted; the actual data is only reclaimed if there is no
+other <i>pgp_t </i>pointing to the data.
+All of these modified- LRU algorithms deserve to be studied
+carefully against a broad range of workloads.
+<b><i>Internal fragmentation</i>.</b>
+compression or tze is enabled, allocations between a half-page and a full-page
+in size are very common and this places a great deal of pressure on even the
+best memory allocator.  Additionally,
+problems may be caused for memory reclamation: When one tmem ephemeral page is
+evicted, only a fragment of a physical page of memory might be reclaimed.
+As a result, when compression or tze is
+enabled, it may take a very large number of eviction attempts to free up a full
+contiguous page of memory and so, to avoid near-infinite loops and livelocks, 
+must be assumed to be able to fail.
+While all memory allocation paths in tmem are resilient to failure, very
+complex corner cases may eventually occur.
+As a result, compression and tze are disabled by default and should be
+used with caution until they have been tested with a much broader set of
+workloads.(Note to self: The 
+code needs work.)
+<b><i>Weights and caps</i>.</b>
+of the just-discussed LRU-based eviction algorithms, a client that uses tmem at
+a very high frequency can quickly swamp tmem so that it provides little benefit
+to a client that uses it less frequently.
+To reduce the possibility of this denial-of-service, limits can be
+specified via management tools that are enforced internally by tmem.
+On Xen, the &quot;xm tmem-set&quot; command
+can specify &quot;weight=&lt;weight&gt;&quot; or &quot;cap=&lt;cap&gt;&quot;
+for any client.  If weight is non-zero
+for a client and the current percentage of ephemeral pages in use by the client
+exceeds its share (as measured by the sum of weights of all clients), the next
+page chosen for eviction is selected from the requesting client's ephemeral
+queue, instead of the global ephemeral queue that contains pages from all
+clients.(See <i>client_over_quota().</i>)
+Setting a cap for a client is currently a no-op.
+<b><i>Shared pools and authentication.</i></b>
+When tmem was first proposed to the linux kernel mailing list
+(LKML), there was concern expressed about security of shared ephemeral
+pools.  The initial tmem implementation only
+required a client to provide a 128-bit UUID to identify a shared pool, and the
+linux-side tmem implementation obtained this UUID from the superblock of the
+shared filesystem (in ocfs2).  It was
+pointed out on LKML that the UUID was essentially a security key and any
+malicious domain that guessed it would have access to any data from the shared
+filesystem that found its way into tmem.
+Ocfs2 has only very limited security; it is assumed that anyone who can
+access the filesystem bits on the shared disk can mount the filesystem and use
+it.  But in a virtualized data center,
+higher isolation requirements may apply.
+As a result, a Xen boot option -- &quot;tmem_shared_auth&quot; -- was
+added.  The option defaults to disabled,
+but when it is enabled, management tools must explicitly authenticate (or may
+explicitly deny) shared pool access to any client.
+On Xen, this is done with the &quot;xm
+tmem-shared-auth&quot; command.
+<b><i>32-bit implementation</i>.</b>
+There was some effort put into getting tmem working on a 32-bit Xen.
+However, the Xen heap is limited in size on
+32-bit Xen so tmem did not work very well.
+There are still 32-bit ifdefs in some places in the code, but things may
+have bit-rotted so using tmem on a 32-bit Xen is not recommended.
+<b><i>IA-64 implementation. </i></b>
+The vast majority of the tmem
+implementation is architecture-independent.
+For tmem to run on Xen/ia64, it is believed that only one or two
+routines needs to be written.(See the
+#ifdef __ia64__ at <i>cli_mfn_to_va()</i>.)
+<h2>Known Issues</h2>
+<p><b><i>Fragmentation.</i></b>When tmem
+is active, all physically memory becomes <i>fragmented</i>
+into individual pages.  However, the Xen
+memory allocator allows memory to be requested in multi-page contiguous
+quantities, called order&gt;0 allocations.
+(e.g. 2<sup>order</sup> so
+order==4 is sixteen contiguous pages.)
+In some cases, a request for a larger order will fail gracefully if no
+matching contiguous allocation is available from Xen.
+As of Xen 4.0, however, there are several
+critical order&gt;0 allocation requests that do not fail gracefully.
+Notably, when a domain is created, and
+order==4 structure is required or the domain creation will fail.
+And shadow paging requires many order==2
+allocations; if these fail, a PV live-migration may fail.
+There are likely other such issues.
+But, fragmentation can occur even without tmem if any domU does
+any extensive ballooning; tmem just accelerates the fragmentation.
+So the fragmentation problem must be solved
+anyway.  The best solution is to disallow
+order&gt;0 allocations altogether in Xen -- or at least ensure that any attempt
+to allocate order&gt;0 can fail gracefully, e.g. by falling back to a sequence
+of single page allocations. However this restriction may require a major 
+in some of Xen's most sensitive code.
+(Note that order&gt;0 allocations during Xen boot and early in domain0
+launch are safe and, if dom0 does not enable tmem, any order&gt;0 allocation by
+dom0 is safe, until the first domU is created.)
+Until Xen can be rewritten to be <i>fragmentation-safe</i>, a small hack
+was added in the Xen page
+allocator.(See the comment &quot;
+memory is scarce&quot; in <i>alloc_heap_pages()</i>.)
+Briefly, a portion of memory is pre-reserved
+for allocations where order&gt;0 and order&lt;9.
+(Domain creation uses 2MB pages, but fails
+gracefully, and there are no other known order==9 allocations or order&gt;9
+allocations currently in Xen.)
+<b><i>NUMA</i></b>.  Tmem assumes that
+all memory pages are equal and any RAM page can store a page of data for any
+client.  This has potential performance
+consequences in any NUMA machine where access to <i
+>far memory</i> is significantly slower than access to <i
+>near memory</i>.
+On nearly all of today's servers, however,
+access times to <i>far memory</i> is still
+much faster than access to disk or network-based storage, and tmem's primary 
+advantage comes from the fact that paging and swapping are reduced.
+So, the current tmem implementation ignores
+NUMA-ness; future tmem design for NUMA machines is an exercise left for the
+(needs work)<b style='mso-bidi-font-weight:>

Xen-changelog mailing list

<Prev in Thread] Current Thread [Next in Thread>
  • [Xen-changelog] [xen-unstable] docs: Add tmem documentation, Xen patchbot-unstable <=