> From: Joanna Rutkowska [mailto:joanna@xxxxxxxxxxxxxxxxxxxxxx]
> Subject: Re: Q about System-wide Memory Management Strategies
> On 08/03/10 01:57, Dan Magenheimer wrote:
> > Hi Joanna --
> > The slides you refer to are over two years old, and there's
> > been a lot of progress in this area since then. I suggest
> > you google for "Transcendent Memory" and especially
> > my presentation at the most recent Xen Summit North America
> > and/or http://oss.oracle.com/projects/tmem
> Thanks Dan. I've been aware of tmem, but I've been skeptical about it
> for two reasons: it's complex, and seems rather unportable to other
> OSes, specifically Windows, which is a concern for us, as we plan to
> support Windows AppVMs in the future in Qubes.
Thanks for the comments and review. It's definitely complex.
If it were easy, the problem would have been solved long ago. :-)
> (Hhm, is it really unportable? Perhaps one could create
> pseudo-filesystem driver that would behave like precache, and a
> pseudo-disk driver that would behave like preswap?)
I know nothing about Windows drivers. I think tmem could
definitely be implemented on Windows, with source code changes
("enlightenments"). It could probably be implemented in drivers
but would likely lose a lot of its value and take a performance
> From reading the papers on tmem (the hogs were really cute :), I
> understand now that the single most important advantage of using tmem
> vs. just-ballooning is: no memory inertia for needy VMs, correct? I'm
> tempted to think that this might not be such a big deal for the
> Qubes-specific types of workload -- after all, if some apps starts
> slowing down, the user will temporarily stop "operating" them, and let
> the system recover within a few seconds, when the balloon will return
> some more memory. Or am I wrong here, and the recovery is not so easy
> in practice?
If you have a perfect "directed ballooning" daemon in dom0 that
can correctly predict the future, moving memory that won't be
needed (in the future) by guest A to guest B (that does need
it real soon now), neither self-ballooning nor tmem is necessary.
Sadly, crystal balls are hard to come by, even for one single
guest. And when you are dealing with multiple dynamically-changing
guests, you quickly get to a bin-packing problem (which I am
pretty sure is NP-complete).
One partial solution is to "pad" the amount of memory given
to each guest, but then you are trying to predict how much
padding is needed... also unguessable.
My 2008 solution was to "aggressively" take memory away from each
guest to approach a knowable per-guest target (which can be done
from dom0 via xenstore or in the guest itself). But this
sometimes/frequently causes the same problems as just giving
each guest less memory to start with, including both performance
issues like lots of paging and swapping, but also bad things like
OOMs and swapstorms.
IMHO, this is sometimes "not so easy to recover from in practice".
Tmem is designed to complement aggressive ballooning (regardless
of where the ballooning decisions are made) by reducing or
eliminating the problems that result from it and at the same
time reduce "memory inertia" so that a large amount of memory
can be quickly moved to where it is most needed (including,
when necessary, launching or migrate-receiving more guests).
> > Specifically, I now have "selfballooning" built into
> > the guest kernel...
> In your latest presentation you mention selfballooning implemented in
> kernel, rather than via a userland daemon -- any significant benefit of
> this? I've been thinking of trying selfballooning using 2.6.34-xenlinux
> kernel with usermode balloond...
It's all a question of response time. If the policy/mechanism
is in dom0, it's difficult to react quickly enough to one guest,
let alone "many". If the policy/mechanism is in the guest but
in userland, well, sometimes user processes don't get much
attention (other than being gratuitously killed) when the kernel
is under memory pressure.
So, since, tmem requires kernel changes anyway, I moved the
selfballooning policy into the Xen balloon driver, with a lot
of tunables in sysfs that can be tweaked.
> How to initially provision the VMs in selfballooning, i.e. how to set
> mem and memmax? I'm tempted to set memmax to the amount of all physical
> memory minus memory reserved for Dom0, and other service VMs (which
> would get fixed, small, amount). The rationale behind this is that we
> don't know what type of tasks the user will end up doing in any given
> VM, and she might very well end up with something reaaally memory-
> (sure, we will not let any other VMs to run at the same time in that
> case, but we should still be able to handle this I think).
Memmax for each guest can be essentially unlimited, since Xen reserves
its memory and dom0 memory. Only the ballooning policy cares.
But in practice, I think users think "physical", e.g. how much RAM
does this physical machine need, so tend to prefer to think about
memory as one single value. As a result, everything should work
properly when mem=memmax.
> > I don't see direct ballooning as feasible (certainly without other
> > guest changes such as cleancache and frontswap).
> Why is that? Intuitively it sounds like the most straightforward
> solution -- only Dom0 can see the system-wide picture of all the VM
> needs (and priorities).
It is straightforward. And it will work most of the time
for many workloads. But it responds too slowly for many
> What happens if too many guests would request too much memory, i.e.
> within their maxmem limits, but such that the overall total exceeds the
> total available in the system? I guess then whoever was first and lucky
> would get the memory, but the last ones would get nothing, right? While
> if we had centrally-managed allocation, we would be able to e.g. scale
> down the target memory sizes equally, or tell the user that some VMs
> must be closed for smooth operation of the others (or close them
"First and lucky" creates problems when all the guests are
happy to absorb as much memory as you give them.
Tmem has some built-in policy to avoid the worst of this and
some tool-specifiable parameters to optionally enforce load
balancing with prioritization.
But if, in your product environment, users can just be told to
shut down a VM, sure, that's a good solution.
> > Anyway, I have limited availability in the next couple of
> > weeks but would love to talk (or email) more about
> > this topic after that (but would welcome clarification
> > questions in the meantime).
> No problem. Hopefully some of the above questions would fall into the
> "clarification" category :) And maybe others will answer the others :)
Since this topic is near and dear to me (having spent the
better part of the last two years on it), I tend to get
long-winded in my answers... and procrastinate on other things
that are higher priority :-(
> > Dan
> >> -----Original Message-----
> >> From: Joanna Rutkowska [mailto:joanna@xxxxxxxxxxxxxxxxxxxxxx]
> >> Sent: Monday, August 02, 2010 3:39 PM
> >> To: xen-devel@xxxxxxxxxxxxxxxxxxx; Dan Magenheimer
> >> Cc: qubes-devel@xxxxxxxxxxxxxxxx
> >> Subject: Q about System-wide Memory Management Strategies
> >> Dan, Xen.org'ers,
> >> I have a few questions regarding strategies for optimal memory
> >> assignment among VMs (PV DomU and Dom0, all Linux-based).
> >> We've been thinking about implementing a "Direct Ballooning"
> >> (as described on slide #20 in Dan's slides ), i.e. to write a
> >> that would be running in Dom0 and, based on the statistics provided
> >> ballond daemons running in DomUs, to adjust memory assigned to all
> >> in the system (via xm mem-set).
> >> Rather than trying to maximize the number of VMs we could run at the
> >> same time, in Qubes OS we are more interested in optimizing user
> >> experience for running "reasonable number" of VMs (i.e.
> >> minimizing/eliminating swapping). In other words, given the number
> >> VMs that the user feels the need to run at the same time (in
> >> usually between 3-6), and given the amount of RAM in the system (4-6
> >> in practice today), how to optimally distribute it among the VMs? In
> >> our
> >> model we assume the disk backend(s) are in Dom0.
> >> Some specific questions:
> >> 1) What is the best estimator of the "ideal" amount of RAM each VM
> >> would
> >> like to have? Dan mentions  the Commited_AS value from
> >> /proc/meminfo,
> >> but what about the fs cache? I would expect that we should (ideally)
> >> allocate Commited_AS + some_cache amount of RAM, no?
> >> 2) What's the best estimator for "minimal reasonable" amount of RAM
> >> VM (below which the swapping would kill the performance for good)?
> >> rationale behind this, is that if we couldn't allocate "ideal"
> >> of
> >> RAM (point 1 above), then we would be scaling the available RAM
> >> until this "reasonable minimum" value. Below this, we would display
> >> message to the user that they should close some VMs (or will close
> >> "inactive" one automatically), and also we would refuse to start any
> >> new
> >> AppVMs.
> >> 3) Assuming we have enough RAM to satisfy all the VMs' "ideal"
> >> requests,
> >> what should we do with the excessive RAM -- options are:
> >> a) distribute among all the VMs (more per-VM RAM, means larger FS
> >> caches, means faster I/O), or
> >> b) assign it to Dom0, where the disk backend is running (larger FS
> >> cache
> >> means faster disk backends, means faster I/O in each VM?)
> >> Thanks,
> >> joanna.
> >> 
> >> http://www.xen.org/files/xensummitboston08/MemoryOvercommit-
> >> XenSummit2008.pdf
Xen-devel mailing list