* [RFC] dm-userspace
@ 2006-04-26 22:45 Dan Smith
2006-04-26 22:55 ` [dm-devel] " Ming Zhang
0 siblings, 1 reply; 8+ messages in thread
From: Dan Smith @ 2006-04-26 22:45 UTC (permalink / raw)
To: linux-kernel; +Cc: device-mapper development
[-- Attachment #1: Type: text/plain, Size: 2368 bytes --]
Xen needs to be able to directly access disk formats such as QEMU's
qcow, VMware's vmdk, and possibly others. Most of these formats are
based on copy-on-write ideas, and thus have a base image and a bunch
of modified blocks stored elsewhere. Presenting this to a virtual
machine transparently as a normal block device would be ideal. The
solution I propose is to use device-mapper for redirecting block
accesses to the appropriate locations within either the base image or
the COW space, with the following constraints:
1. The block-allocation algorithm and formatting scheme should not be
in the kernel. This gives the most flexibility and puts the
complexity in userspace.
2. Actual data flow should happen only in the kernel, and userspace
should be able to control it without the blocks being passed back
and forth.
So, I developed a generic device-mapper target called dm-userspace
which allows a userspace application to control the block mapping in a
mostly generic way. With the functionality it provides, I was able to
write a userspace daemon that handles the mapping of blocks such that
a qcow file could be presented as a single block device, mounted and
accessed as if it were a normal disk. If/when VMware releases their
vmdk spec under the GPL, adding support for it would be relatively
simple. This would give us a unified block device to export to the
virtual machine, that would be backed by a complex format such as vmdk
or qcow.
In addition to providing support for the above scenario, dm-userspace
could be used for other things as well. It's possible that new
device-mapper targets could be developed in userspace using a special
application that used dm-userspace to simulate the kernel
environment. Additionally, filesystem debuggers may be able to use
dm-userspace to provide interactive control and logging of disk
writes.
A patch against 2.6.16.9 to add dm-userspace to the kernel is
available here:
http://static.danplanet.com/dm-userspace/dmu-2.6.16.9.patch
After you have a patched kernel, you can build the (very tiny) helper
library and example program, available here:
http://static.danplanet.com/dm-userspace/libdmu-0.1.tar.gz
Comments would be appreciated :)
--
Dan Smith
IBM Linux Technology Center
Open Hypervisor Team
email: danms@us.ibm.com
[-- Attachment #2: Type: application/pgp-signature, Size: 190 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dm-devel] [RFC] dm-userspace
2006-04-26 22:45 [RFC] dm-userspace Dan Smith
@ 2006-04-26 22:55 ` Ming Zhang
2006-04-26 23:07 ` Dan Smith
2006-05-09 23:02 ` Dan Smith
0 siblings, 2 replies; 8+ messages in thread
From: Ming Zhang @ 2006-04-26 22:55 UTC (permalink / raw)
To: device-mapper development; +Cc: linux-kernel
just curious, will the speed be a problem here? considering each time it
needs to contact user space for mapping a piece of data. and the size
unit is per sector in dm?
do u have any benchmark results about overhead?
ming
On Wed, 2006-04-26 at 15:45 -0700, Dan Smith wrote:
> Xen needs to be able to directly access disk formats such as QEMU's
> qcow, VMware's vmdk, and possibly others. Most of these formats are
> based on copy-on-write ideas, and thus have a base image and a bunch
> of modified blocks stored elsewhere. Presenting this to a virtual
> machine transparently as a normal block device would be ideal. The
> solution I propose is to use device-mapper for redirecting block
> accesses to the appropriate locations within either the base image or
> the COW space, with the following constraints:
>
> 1. The block-allocation algorithm and formatting scheme should not be
> in the kernel. This gives the most flexibility and puts the
> complexity in userspace.
> 2. Actual data flow should happen only in the kernel, and userspace
> should be able to control it without the blocks being passed back
> and forth.
>
> So, I developed a generic device-mapper target called dm-userspace
> which allows a userspace application to control the block mapping in a
> mostly generic way. With the functionality it provides, I was able to
> write a userspace daemon that handles the mapping of blocks such that
> a qcow file could be presented as a single block device, mounted and
> accessed as if it were a normal disk. If/when VMware releases their
> vmdk spec under the GPL, adding support for it would be relatively
> simple. This would give us a unified block device to export to the
> virtual machine, that would be backed by a complex format such as vmdk
> or qcow.
>
> In addition to providing support for the above scenario, dm-userspace
> could be used for other things as well. It's possible that new
> device-mapper targets could be developed in userspace using a special
> application that used dm-userspace to simulate the kernel
> environment. Additionally, filesystem debuggers may be able to use
> dm-userspace to provide interactive control and logging of disk
> writes.
>
> A patch against 2.6.16.9 to add dm-userspace to the kernel is
> available here:
>
> http://static.danplanet.com/dm-userspace/dmu-2.6.16.9.patch
>
> After you have a patched kernel, you can build the (very tiny) helper
> library and example program, available here:
>
> http://static.danplanet.com/dm-userspace/libdmu-0.1.tar.gz
>
> Comments would be appreciated :)
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dm-devel] [RFC] dm-userspace
2006-04-26 22:55 ` [dm-devel] " Ming Zhang
@ 2006-04-26 23:07 ` Dan Smith
2006-04-26 23:41 ` Ming Zhang
2006-05-09 23:02 ` Dan Smith
1 sibling, 1 reply; 8+ messages in thread
From: Dan Smith @ 2006-04-26 23:07 UTC (permalink / raw)
To: mingz; +Cc: device-mapper development, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1609 bytes --]
MZ> just curious, will the speed be a problem here?
I'm glad you asked... :)
MZ> considering each time it needs to contact user space for mapping a
MZ> piece of data.
Actually, that's not the case. The idea is for mappings to be cached
in the kernel module so that the communication with userspace only
needs to happen once per block. The thought is to ask once for a
read, and then remember that mapping until a write happens, which
might change the story. If so, we ask userspace again.
Right now, the kernel module expires mappings in a pretty brain-dead
way to make sure the list doesn't get too long. An intelligent data
structure and expiration method would probably improve performance
quite a bit.
I don't have any benchmark data to post right now. I did some quick
analysis a while back and found it to be not too bad. When using loop
devices as a backing store, I achieved performance as high as a little
under 50% of native.
MZ> and the size unit is per sector in dm?
Well, for qcow it is a sector, yes. The module itself, however, can
use any block size (as long as it is a multiple of a sector). Before
I started work on qcow support, I wrote a test application that used
2MiB blocks, which is where I got the approximately 50% performance
value I described above.
Our thought is that this would mostly be used for the OS images of
virtual machines, which shouldn't change much, which would help to
prevent constantly asking userspace to map blocks.
--
Dan Smith
IBM Linux Technology Center
Open Hypervisor Team
email: danms@us.ibm.com
[-- Attachment #2: Type: application/pgp-signature, Size: 190 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dm-devel] [RFC] dm-userspace
2006-04-26 23:07 ` Dan Smith
@ 2006-04-26 23:41 ` Ming Zhang
2006-04-27 2:22 ` Dan Smith
0 siblings, 1 reply; 8+ messages in thread
From: Ming Zhang @ 2006-04-26 23:41 UTC (permalink / raw)
To: Dan Smith; +Cc: device-mapper development, linux-kernel
On Wed, 2006-04-26 at 16:07 -0700, Dan Smith wrote:
> MZ> just curious, will the speed be a problem here?
>
> I'm glad you asked... :)
>
> MZ> considering each time it needs to contact user space for mapping a
> MZ> piece of data.
>
> Actually, that's not the case. The idea is for mappings to be cached
> in the kernel module so that the communication with userspace only
> needs to happen once per block. The thought is to ask once for a
> read, and then remember that mapping until a write happens, which
> might change the story. If so, we ask userspace again.
sounds reasonable. saw the caching now.
>
> Right now, the kernel module expires mappings in a pretty brain-dead
> way to make sure the list doesn't get too long. An intelligent data
> structure and expiration method would probably improve performance
> quite a bit.
>
> I don't have any benchmark data to post right now. I did some quick
> analysis a while back and found it to be not too bad. When using loop
> devices as a backing store, I achieved performance as high as a little
> under 50% of native.
o. :P 50% is a considerable amount. anyway, good start. ;)
>
> MZ> and the size unit is per sector in dm?
>
> Well, for qcow it is a sector, yes. The module itself, however, can
> use any block size (as long as it is a multiple of a sector). Before
> I started work on qcow support, I wrote a test application that used
> 2MiB blocks, which is where I got the approximately 50% performance
> value I described above.
pure read or read and write mixed?
>
> Our thought is that this would mostly be used for the OS images of
> virtual machines, which shouldn't change much, which would help to
> prevent constantly asking userspace to map blocks.
>
if this is the scenario, then may be more aggressive mapping can be used
here.
u might have interest on this. some developers are working on a general
scsi target layer that pass scsi cdb to user space for processing while
keep data transfer in kernel space. so both of u will meet same overhead
here. so 2 projects might learn from each other on this.
ps, trivial thing, the userspace_request is frequently used and can use
a slab cache.
ming
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dm-devel] [RFC] dm-userspace
2006-04-26 23:41 ` Ming Zhang
@ 2006-04-27 2:22 ` Dan Smith
2006-04-27 13:09 ` Ming Zhang
0 siblings, 1 reply; 8+ messages in thread
From: Dan Smith @ 2006-04-27 2:22 UTC (permalink / raw)
To: mingz; +Cc: device-mapper development, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 2267 bytes --]
MZ> o. :P 50% is a considerable amount. anyway, good start. ;)
Indeed, it is a considerable performance hit, but I haven't really
done much in the way of a serious performance analysis.
MZ> pure read or read and write mixed?
Actually IIRC, that was the write performance only (I used bonnie++ to
get the numbers). I believe the read performance is generally good
for large blocks. If the block is already mapped for write, then you
get the reads for free. I really should resurrect my older tests and
see if I can produce something more current :)
My previous numbers were gathered by using an additional step of
actually rewriting the device-mapper table periodically, using
dm-linear to statically map blocks that were mapped for writing. I
think that with a better data structure in dm-userspace (i.e. better
than a linked-list), performance will be better without the need to
constantly suspend and resume the device to change tables.
MZ> if this is the scenario, then may be more aggressive mapping can
MZ> be used here.
Right, so the userspace side may be able to improve performance by
mapping blocks in advance. If it is believed that the next several
blocks will be written to sequentially, the userspace app can push
mappings for those in the same message as the response to the initial
block, which would eliminate several additional requests.
Perhaps something could be done with certain CoW formats that would
allow the userspace app to push a bunch of mappings that it believes
might be needed, and then have the kernel report back later which were
actually used. In that case, you could reclaim space in the CoW
device that you incorrectly predicted would be needed.
MZ> u might have interest on this. some developers are working on a
MZ> general scsi target layer that pass scsi cdb to user space for
MZ> processing while keep data transfer in kernel space. so both of u
MZ> will meet same overhead here. so 2 projects might learn from each
MZ> other on this.
Great!
MZ> ps, trivial thing, the userspace_request is frequently used and
MZ> can use a slab cache.
Ah, ok, good point... thanks ;)
--
Dan Smith
IBM Linux Technology Center
Open Hypervisor Team
email: danms@us.ibm.com
[-- Attachment #2: Type: application/pgp-signature, Size: 190 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dm-devel] [RFC] dm-userspace
2006-04-27 2:22 ` Dan Smith
@ 2006-04-27 13:09 ` Ming Zhang
0 siblings, 0 replies; 8+ messages in thread
From: Ming Zhang @ 2006-04-27 13:09 UTC (permalink / raw)
To: Dan Smith; +Cc: device-mapper development, linux-kernel
On Wed, 2006-04-26 at 19:22 -0700, Dan Smith wrote:
> MZ> o. :P 50% is a considerable amount. anyway, good start. ;)
>
> Indeed, it is a considerable performance hit, but I haven't really
> done much in the way of a serious performance analysis.
>
> MZ> pure read or read and write mixed?
>
> Actually IIRC, that was the write performance only (I used bonnie++ to
> get the numbers). I believe the read performance is generally good
> for large blocks. If the block is already mapped for write, then you
> get the reads for free. I really should resurrect my older tests and
> see if I can produce something more current :)
yes, considering you load a mapping for every 2MB data block, then it
should close to dm-linear for sequential read.
>
> My previous numbers were gathered by using an additional step of
> actually rewriting the device-mapper table periodically, using
> dm-linear to statically map blocks that were mapped for writing. I
> think that with a better data structure in dm-userspace (i.e. better
> than a linked-list), performance will be better without the need to
> constantly suspend and resume the device to change tables.
ic. sounds reasonable.
>
> MZ> if this is the scenario, then may be more aggressive mapping can
> MZ> be used here.
>
> Right, so the userspace side may be able to improve performance by
> mapping blocks in advance. If it is believed that the next several
> blocks will be written to sequentially, the userspace app can push
> mappings for those in the same message as the response to the initial
> block, which would eliminate several additional requests.
this is like the prefetch of mapping information.
>
> Perhaps something could be done with certain CoW formats that would
> allow the userspace app to push a bunch of mappings that it believes
> might be needed, and then have the kernel report back later which were
> actually used. In that case, you could reclaim space in the CoW
> device that you incorrectly predicted would be needed.
right. and i think this might be COW formats unrelated. this solely
depends on the mapping logic at user space to do intentional allocation,
tracing, and cleaning.
>
> MZ> u might have interest on this. some developers are working on a
> MZ> general scsi target layer that pass scsi cdb to user space for
> MZ> processing while keep data transfer in kernel space. so both of u
> MZ> will meet same overhead here. so 2 projects might learn from each
> MZ> other on this.
>
> Great!
project name is stgt, you can find it at berlios.de, which is down right
now. :P
>
> MZ> ps, trivial thing, the userspace_request is frequently used and
> MZ> can use a slab cache.
>
> Ah, ok, good point... thanks ;)
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dm-devel] [RFC] dm-userspace
2006-04-26 22:55 ` [dm-devel] " Ming Zhang
2006-04-26 23:07 ` Dan Smith
@ 2006-05-09 23:02 ` Dan Smith
2006-05-10 13:27 ` Ming Zhang
1 sibling, 1 reply; 8+ messages in thread
From: Dan Smith @ 2006-05-09 23:02 UTC (permalink / raw)
To: mingz; +Cc: device-mapper development, linux-kernel, Xen Developers
[-- Attachment #1: Type: text/plain, Size: 2828 bytes --]
(I'm including the xen-devel list on this, as things are starting to
get interesting).
MZ> do u have any benchmark results about overhead?
So, I've spent some time over the last week working to improve
performance and collect some benchmark data.
I moved to using slab caches for the request and remap objects, which
helped a little. I also added a poll() method to the control device,
which improved performance significantly. Finally, I changed the
internal remap storage data structure to a hash table, which had a
very large performance impact (about 8x).
Copying data to a device backed by dm-userspace presents a worst-case
scenario, especially with a small block-size like what qcow uses. In
one of my tests, I copy about 20MB of data to a dm-userspace device,
backed by files hooked up to the loopback driver. I compare this with
a "control" of a single loop-mounted image file (i.e., without
dm-userspace or CoW). I measured the time to mount, copy, and unmount
the device, which (with the recent performance improvements) are
approximately:
Normal Loop: 1 seconds
dm-userspace/qcow: 10 seconds
For comparison, before adding poll() and the hash table, the
dm-userspace number was over 70 seconds.
One of the most interesting cases for us, however, is providing a
CoW-based VM disk image, which is mostly used for reading, with a
small amount of writing for configuration data. To test this, I used
Xen to compare a fresh FC4 boot (firstboot, where things like SSH keys
are generated and written to disk) that used an LVM volume as root to
using dm-userspace (and loopback-files) as the root. The numbers are
approximately:
LVM root: 26 seconds
dm-userspace/qcow: 27 seconds
Note that this does not yet include any read-ahead type behavior, nor
does it include priming the kernel module with remaps at create-time
(which results in a few initial compulsory "misses"). Also, I removed
the remap expiration functionality while adding the hash table and
have not yet added it back, so that may further improve performance
for large amounts of remaps (and bucket collisions).
Here is a link to a patch against 2.6.16.14:
http://static.danplanet.com/dm-userspace/dmu-2.6.16.14-patch
Here are links to the userspace library, as well as the cow daemon,
which provides qcow support:
http://static.danplanet.com/dm-userspace/libdmu-0.2.tar.gz
http://static.danplanet.com/dm-userspace/cowd-0.1.tar.gz
(Note that the daemon is still rather rough, and the qcow
implementation has some bugs. However, it works for light testing and
the occasional luck-assisted heavy testing)
As always, comments welcome and appreciated :)
--
Dan Smith
IBM Linux Technology Center
Open Hypervisor Team
email: danms@us.ibm.com
[-- Attachment #2: Type: application/pgp-signature, Size: 190 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dm-devel] [RFC] dm-userspace
2006-05-09 23:02 ` Dan Smith
@ 2006-05-10 13:27 ` Ming Zhang
0 siblings, 0 replies; 8+ messages in thread
From: Ming Zhang @ 2006-05-10 13:27 UTC (permalink / raw)
To: Dan Smith; +Cc: device-mapper development, linux-kernel, Xen Developers
On Tue, 2006-05-09 at 16:02 -0700, Dan Smith wrote:
> (I'm including the xen-devel list on this, as things are starting to
> get interesting).
>
> MZ> do u have any benchmark results about overhead?
>
> So, I've spent some time over the last week working to improve
> performance and collect some benchmark data.
>
> I moved to using slab caches for the request and remap objects, which
> helped a little. I also added a poll() method to the control device,
> which improved performance significantly. Finally, I changed the
> internal remap storage data structure to a hash table, which had a
> very large performance impact (about 8x).
why need a poll here? ask a dumb question.
this is interesting. have u ever check the average loop up path length
with single queue and has table? this can improve by 8X, quite
impressive.
>
> Copying data to a device backed by dm-userspace presents a worst-case
> scenario, especially with a small block-size like what qcow uses. In
> one of my tests, I copy about 20MB of data to a dm-userspace device,
> backed by files hooked up to the loopback driver. I compare this with
> a "control" of a single loop-mounted image file (i.e., without
> dm-userspace or CoW). I measured the time to mount, copy, and unmount
> the device, which (with the recent performance improvements) are
> approximately:
>
> Normal Loop: 1 seconds
> dm-userspace/qcow: 10 seconds
>
> For comparison, before adding poll() and the hash table, the
> dm-userspace number was over 70 seconds.
nice improvement!
>
> One of the most interesting cases for us, however, is providing a
> CoW-based VM disk image, which is mostly used for reading, with a
> small amount of writing for configuration data. To test this, I used
> Xen to compare a fresh FC4 boot (firstboot, where things like SSH keys
> are generated and written to disk) that used an LVM volume as root to
> using dm-userspace (and loopback-files) as the root. The numbers are
> approximately:
>
> LVM root: 26 seconds
> dm-userspace/qcow: 27 seconds
this is quite impressive, i think application take most of the time and
some time are overlapped with io. and with little io here, this little
difference is what u can get. i think this will be very helpful for
diskless san boot.
>
> Note that this does not yet include any read-ahead type behavior, nor
> does it include priming the kernel module with remaps at create-time
> (which results in a few initial compulsory "misses"). Also, I removed
> the remap expiration functionality while adding the hash table and
> have not yet added it back, so that may further improve performance
> for large amounts of remaps (and bucket collisions).
>
> Here is a link to a patch against 2.6.16.14:
>
> http://static.danplanet.com/dm-userspace/dmu-2.6.16.14-patch
>
> Here are links to the userspace library, as well as the cow daemon,
> which provides qcow support:
>
> http://static.danplanet.com/dm-userspace/libdmu-0.2.tar.gz
> http://static.danplanet.com/dm-userspace/cowd-0.1.tar.gz
>
> (Note that the daemon is still rather rough, and the qcow
> implementation has some bugs. However, it works for light testing and
> the occasional luck-assisted heavy testing)
>
> As always, comments welcome and appreciated :)
>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2006-05-10 13:28 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-26 22:45 [RFC] dm-userspace Dan Smith
2006-04-26 22:55 ` [dm-devel] " Ming Zhang
2006-04-26 23:07 ` Dan Smith
2006-04-26 23:41 ` Ming Zhang
2006-04-27 2:22 ` Dan Smith
2006-04-27 13:09 ` Ming Zhang
2006-05-09 23:02 ` Dan Smith
2006-05-10 13:27 ` Ming Zhang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox