AW> Okay -- I think there may have been a disconnect on what the
AW> assumptions driving dscow's design were. Based on your clarifying
AW> emails these seem to be that an administrator has a block device
AW> that they want to apply cow to, and that they have oodles of
AW> space. They'll just hard-allocate a second block device of the
AW> same size as the original on a per-cow basis, and use this (plus
AW> the bitmap header) to write updates into.
Correct. Again, let me reiterate that I am not claiming that dscow is
the best format for anything other than a few small situations that
we are currently targeting :)
We definitely want to eventually develop a sparse allocation method
that will allow us to take a big block store and carve it up (like LVM
does) for on-demand cow volumes.
AW> All of the CoW formats that I've seen use some form of
AW> pagetable-style lookup hierarchy to represent sparseness,
AW> frequently a combination of a lookup tree and leaf bitmaps -- your
AW> scheme is just the extreme of this... a zero-level tree.
Indeed, many use a pagetable approach. Development of the above idea
would definitely require it. FWIW, I believe cowloop uses a format
similar to dscow.
AW> It seems like a possibly useful thing to have in some environments
AW> to use as a fast-image-copy operation, although it would be cool
AW> to have something that ran in the background and lazily copied all
AW> the other blocks over and eventually resulted in a fully linear
AW> disk image.
Yes, I have discussed this on-list a few times, in reference to
live-copy of LVMs and building a local version of a network-accessible
image, such as an nbd device.
AW> Perhaps you'd consider adding that and porting the format as a
AW> plugin to the blktap tools as well? ;)
I do not really see the direct value of that. If the functionality
exists with dm-userspace and cowd, then dm-userspace could be used to
slowly build the image, while blktap could provide access to that
image for a domain (in direct mode, as Julian pointed out).
Building the functionality into dm-userspace would allow it to be
generally applicable to vanilla linux systems. Why build it into a
xen-specific component?
AW> Yes -- we've seen comments from users who are very pleased with
AW> the better-than-disk write throughput that they achieve with the
AW> loopback driver ;) -- basically the same effect of letting the
AW> buffer cache step in and play things a little more "fast and
AW> loose".
Heh, right. I was actually talking about increased performance
against a block device. However, for this kind of transient domain
model, a file will work as well.
AW> 1. You are allocating a per-disk mapping cache in the driver. Do
AW> you have any sense of how big this needs to be to be useful for
AW> common workloads?
My latest version (which we will post soon) puts a cap on the number
of remaps each device can maintain. Changing from a 4096-map limit to
a 16384-map limit makes some difference, but it does not appear to be
significant. We will post concrete numbers when we send the latest
version.
AW> Generally speaking, this seems like a strange thing to add -- I
AW> understand the desire for an in-kernel cache to avoid context
AW> switches, but why make it implicit and LRU.
Well, if you do not keep that kind of data in the kernel, I think
performance would suffer significantly. The idea here is to have, at
steady-state, a block device that behaves almost exactly like a
device-mapper device (read: LVM) does right now. All block
redirections happen in-kernel. Remember that the userspace side can
invalidate any mapping cached in the kernel at any time. If userspace
wanted to do cache management, it could do so. I have also discussed
the possibility of feeding some coarse statistics back to userspace so
it can make more informed decisions.
I would not say that the caching is implicit. If you set the
DMU_FLAG_TEMPORARY bit on a response, the kernel will not remember the
mapping and thus will fault the next access back to userspace again.
AW> Wouldn't it simplify the code considerably to allow the userspace
AW> stuff to manage the size and contents of the cache so that they
AW> can do replacement based on their knowledge of the block layout?
I am not sure why this would be much better than letting the kernel
manage it. The kernel knows two things that userspace does not:
low-memory pressure and access statistics. I do not see why it would
make sense to have the kernel collect and communicate access
statistics for each block to userspace and then rely on it to evict
unused mappings. Further, the kernel can run without the userspace
component if no unmapped blocks are accessed. This allows a restart
or upgrade of the userspace component without disturbing the device.
It is entirely possible that I do not understand your point, so feel
free to correct me :)
AW> 2. There are a heck of a lot of spin locks in that driver. Did
AW> you run into a lot of stability problems that led to aggressively
AW> conservative locking?
I think I have said this before, but: no performance analysis has been
done on dm-userspace to identify areas of contention. The use of
spinlocks was the best way (for me) to get things working and stable.
Most of the spinlocks are used to protect linked lists, which I think
is relatively valid.
I should also point out that the structure of the entire thing has
been a moving target up until recently. It is definitely possible
that some of the places that a spinlock was used could be refactored
under the current model.
AW> Would removing the cache and/or switching some of the locks to
AW> refcounts simplify things at all?
Moving to something other than spinlocks for a few of the data
structures may be possible; we can investigate and post some numbers
on the next go-round.
--
Dan Smith
IBM Linux Technology Center
Open Hypervisor Team
email: danms@xxxxxxxxxx
pgpEOSNYktouW.pgp
Description: PGP signature
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|