* [PATCH] Blktap: Userspace file-based image support. (RFC)
@ 2006-06-19 16:19 Andrew Warfield
2006-06-19 16:51 ` NAHieu
` (3 more replies)
0 siblings, 4 replies; 44+ messages in thread
From: Andrew Warfield @ 2006-06-19 16:19 UTC (permalink / raw)
To: Xen Developers; +Cc: Julian Chesterfield
[-- Attachment #1: Type: text/plain, Size: 7186 bytes --]
Attached to this email is a patch containing the (new and improved)
blktap Linux driver and associated userspace tools for Xen. In
addition to being more flavourful, containing half the fat, and
removing stains twice as well as the old driver, this stuff adds a
userspace block backend and let you use raw (without loopback), qcow,
and vmdk-based image files for your domUs. There's also a fun little
driver that provides a shared-memory block device which, in
combination with OCFS2, represents a cheap-and-cheerful fast shared
filesystem between multiple domUs.
This code has been (somewhat lackadaisically) developed over the past
few years at Cambridge and has recently enjoyed massive improvements
thanks to the considerable efforts of Julian Chesterfield.
The code "works for us" and has been tested on a grand total of about
three machines. We would love to have feedback from a broader
audience, in terms of both trying out the tools and inspecting the code.
We'll plan to release new patches at about 1-week intervals based on
comments.
Performance is quite good, and we intend to focus on this a bit more
over the next few weeks, releasing updated patches as they are
available. Bonnie results this morning are as follows (64-bit results
compare against linux blkback+loopback file, Julian can follow up with
loopback results for 32-bit later if anyone's interested):
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
64-bit:
xen0 4096 40115 93.4 41067 12.7 22757 1.2 32532 56.7 53724 0.4 121.4 0.0
img-sp 4096 20291 86.0 38091 18.1 19939 8.2 30854 69.0 47779 4.2 95.3 0.4
loop-sp 4096 33421 77.6 33663 13.1 18546 5.1 28606 59.2 46659 6.0 85.2 0.1
32-Bit:
xen0 1024 33857 94.0 45804 9.0 23269 0.0 25825 52.0 55628 0 185.0 0.0
img-sp 1448 32743 92.0 40703 8.0 23281 0.0 31139 75.0 56585 0 208.1 0.0
The patch is against cset 0426:840f33e54054 -- but is unlikely to
conflict with anything recent. You'll need libaio and libaio-devel on
your build machine for the tools to compile.
Blktap readme follows.)
Thanks!
a.
---
Blktap Userspace Tools + Library
================================
Andrew Warfield and Julian Chesterfield
16th June 2006
{firstname.lastname}@cl.cam.ac.uk
The blktap userspace toolkit provides a user-level disk I/O
interface. The blktap mechanism involves a kernel driver that acts
similarly to the existing Xen/Linux blkback driver, and a set of
associated user-level libraries. Using these tools, blktap allows
virtual block devices presented to VMs to be implemented in userspace
and to be backed by raw partitions, files, network, etc.
The key benefit of blktap is that it makes it easy and fast to write
arbitrary block backends, and that these user-level backends actually
perform very well. Specifically:
- Metadata disk formats such as Copy-on-Write, encrypted disks, sparse
formats and other compression features can be easily implemented.
O_DIRECT and libaio allow high-performance implementation of even
sparse image formats such as QCoW, while still preserving the safe
ordering of metadata and data writes to ensure data integrity.
(As opposed to, for instance, both the loopback driver and LVM snaps
which both have very dangerous failure cases.)
- Accessing file-based images from userspace avoids problems related
to flushing dirty pages which are present in the Linux loopback
driver. (Specifically, doing a large number of writes to an
NFS-backed image don't result in the OOM killer going berserk.)
- Per-disk handler processes enable easier userspace policing of block
resources, and process-granularity QoS techniques (disk scheduling
and related tools) may be trivially applied to block devices.
- It's very easy to take advantage of userspace facilities such as
networking libraries, compression utilities, peer-to-peer
file-sharing systems and so on to build more complex block backends.
- Crashes are contained -- incremental development/debugging is very
fast.
- All block data is forwarded in a zero-copy fashion, allowing for
low-overhead userspace implementations.
How it works (in one paragraph):
Working in conjunction with the kernel blktap driver, all disk I/O
requests from VMs are passed to the userspace deamon (using a shared
memory interface) through a character device. Each active disk is
mappd to an individual device node, allowing per-disk processes to
implement individual block devices where desired. The userspace
drivers are implemented using asynchronous (Linux libaio),
O_DIRECT-based calls to preserve the unbuffered, batched and
asynchronous request dispatch achieved with the existing blockback
code. We provide a simple, asynchronous virtual disk interface that
makes it quite easy to add new disk implementations.
As of June 2006 the current supported disk formats are:
- Raw Images (both on partitions and in image files)
- File-backed Qcow disks (sparse qcow overlay on a raw image/patrition).
- Standalone sparse Qcow disks (sparse disks, not backed by a parent image).
- Fast shareable RAM disk between VMs (requires some form of cluster-based
filesystem support e.g. OCFS2 in the guest kernel)
- Some VMDK images - your mileage may vary
Raw and QCow images have asynchronous backends and so should perform
fairly well. VMDK is based directly on the qemu vmdk driver, which is
synchronous (a.k.a. slow).
The qcow backends support existing qcow disks. There are also a set
of tools to generate and convert qcow images. With these tools (and
driver support), we maintain the qcow file format but adjust
parameters for higher performance with Xen -- using a larger segment
size (4096 instead of 512) and more coarsely allocating metadata
regions. We are continuing to improve this work and expect qcow
performance to improve a great deal over the newxt few weeks.
Build and Installation Instructions
===================================
You will need libaio >= 0.3.104 on your target system to build the
tools (if you are installing RPMs, this means libaio and
libaio-devel).
Make to configure the blktap backend driver in your dom0 kernel. It
will cooperate fine with the existing backend driver, so you can
experiment with tap disks without breaking existing VM configs.
To build the tools separately, "make && make install" in
tools/blktap_user.
Using the Tools
===============
Prepare the image for booting. For qcow files use the qcow utilities
installed earlier. e.g. qcow-create generates a blank standalone image
or a file-backed CoW image. img2qcow takes an existing image or
partition and creates a sparse, standalone qcow-based file.
Start the userspace disk agent either on system boot (e.g. via an init
script) or manually => 'blktapctrl'
Customise the VM config file to use the 'tap' handler, followed by the
driver type. e.g. for a raw image such as a file or partition:
disk = ['tap:aio:<FILENAME>,sda1,w']
e.g. for a qcow image:
disk = ['tap:qcow:<FILENAME>,sda1,w']
[-- Attachment #2: blktap.patch.gz --]
[-- Type: application/x-gzip, Size: 82205 bytes --]
[-- Attachment #3: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-19 16:19 [PATCH] Blktap: Userspace file-based image support. (RFC) Andrew Warfield
@ 2006-06-19 16:51 ` NAHieu
2006-06-19 17:22 ` Andrew Warfield
2006-06-19 18:55 ` Anthony Liguori
` (2 subsequent siblings)
3 siblings, 1 reply; 44+ messages in thread
From: NAHieu @ 2006-06-19 16:51 UTC (permalink / raw)
To: Andrew Warfield; +Cc: Xen Developers, Julian Chesterfield
Wonderful!! Now we have dm-userspace and blktap, and these two seems
to do the similar things. So what are the pros/cons of blktap compared
to dm-userspace?
Perhaps blktap will have a better performance? Did you have any
benchmark to compare dm-userspace & blktap?
Thanks.
H
On 6/20/06, Andrew Warfield <andrew.warfield@cl.cam.ac.uk> wrote:
> Attached to this email is a patch containing the (new and improved)
> blktap Linux driver and associated userspace tools for Xen. In
> addition to being more flavourful, containing half the fat, and
> removing stains twice as well as the old driver, this stuff adds a
> userspace block backend and let you use raw (without loopback), qcow,
> and vmdk-based image files for your domUs. There's also a fun little
> driver that provides a shared-memory block device which, in
> combination with OCFS2, represents a cheap-and-cheerful fast shared
> filesystem between multiple domUs.
>
> This code has been (somewhat lackadaisically) developed over the past
> few years at Cambridge and has recently enjoyed massive improvements
> thanks to the considerable efforts of Julian Chesterfield.
>
> The code "works for us" and has been tested on a grand total of about
> three machines. We would love to have feedback from a broader
> audience, in terms of both trying out the tools and inspecting the code.
> We'll plan to release new patches at about 1-week intervals based on
> comments.
>
> Performance is quite good, and we intend to focus on this a bit more
> over the next few weeks, releasing updated patches as they are
> available. Bonnie results this morning are as follows (64-bit results
> compare against linux blkback+loopback file, Julian can follow up with
> loopback results for 32-bit later if anyone's interested):
>
> -------Sequential Output-------- ---Sequential Input-- --Random--
> -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
> Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
> 64-bit:
> xen0 4096 40115 93.4 41067 12.7 22757 1.2 32532 56.7 53724 0.4 121.4 0.0
> img-sp 4096 20291 86.0 38091 18.1 19939 8.2 30854 69.0 47779 4.2 95.3 0.4
> loop-sp 4096 33421 77.6 33663 13.1 18546 5.1 28606 59.2 46659 6.0 85.2 0.1
>
> 32-Bit:
> xen0 1024 33857 94.0 45804 9.0 23269 0.0 25825 52.0 55628 0 185.0 0.0
> img-sp 1448 32743 92.0 40703 8.0 23281 0.0 31139 75.0 56585 0 208.1 0.0
>
> The patch is against cset 0426:840f33e54054 -- but is unlikely to
> conflict with anything recent. You'll need libaio and libaio-devel on
> your build machine for the tools to compile.
>
>
> Blktap readme follows.)
>
> Thanks!
> a.
>
> ---
>
>
> Blktap Userspace Tools + Library
> ================================
>
> Andrew Warfield and Julian Chesterfield
> 16th June 2006
>
> {firstname.lastname}@cl.cam.ac.uk
>
> The blktap userspace toolkit provides a user-level disk I/O
> interface. The blktap mechanism involves a kernel driver that acts
> similarly to the existing Xen/Linux blkback driver, and a set of
> associated user-level libraries. Using these tools, blktap allows
> virtual block devices presented to VMs to be implemented in userspace
> and to be backed by raw partitions, files, network, etc.
>
> The key benefit of blktap is that it makes it easy and fast to write
> arbitrary block backends, and that these user-level backends actually
> perform very well. Specifically:
>
> - Metadata disk formats such as Copy-on-Write, encrypted disks, sparse
> formats and other compression features can be easily implemented.
> O_DIRECT and libaio allow high-performance implementation of even
> sparse image formats such as QCoW, while still preserving the safe
> ordering of metadata and data writes to ensure data integrity.
> (As opposed to, for instance, both the loopback driver and LVM snaps
> which both have very dangerous failure cases.)
>
> - Accessing file-based images from userspace avoids problems related
> to flushing dirty pages which are present in the Linux loopback
> driver. (Specifically, doing a large number of writes to an
> NFS-backed image don't result in the OOM killer going berserk.)
>
> - Per-disk handler processes enable easier userspace policing of block
> resources, and process-granularity QoS techniques (disk scheduling
> and related tools) may be trivially applied to block devices.
>
> - It's very easy to take advantage of userspace facilities such as
> networking libraries, compression utilities, peer-to-peer
> file-sharing systems and so on to build more complex block backends.
>
> - Crashes are contained -- incremental development/debugging is very
> fast.
>
> - All block data is forwarded in a zero-copy fashion, allowing for
> low-overhead userspace implementations.
>
> How it works (in one paragraph):
>
> Working in conjunction with the kernel blktap driver, all disk I/O
> requests from VMs are passed to the userspace deamon (using a shared
> memory interface) through a character device. Each active disk is
> mappd to an individual device node, allowing per-disk processes to
> implement individual block devices where desired. The userspace
> drivers are implemented using asynchronous (Linux libaio),
> O_DIRECT-based calls to preserve the unbuffered, batched and
> asynchronous request dispatch achieved with the existing blockback
> code. We provide a simple, asynchronous virtual disk interface that
> makes it quite easy to add new disk implementations.
>
>
> As of June 2006 the current supported disk formats are:
>
> - Raw Images (both on partitions and in image files)
> - File-backed Qcow disks (sparse qcow overlay on a raw image/patrition).
> - Standalone sparse Qcow disks (sparse disks, not backed by a parent image).
> - Fast shareable RAM disk between VMs (requires some form of cluster-based
> filesystem support e.g. OCFS2 in the guest kernel)
> - Some VMDK images - your mileage may vary
>
> Raw and QCow images have asynchronous backends and so should perform
> fairly well. VMDK is based directly on the qemu vmdk driver, which is
> synchronous (a.k.a. slow).
>
> The qcow backends support existing qcow disks. There are also a set
> of tools to generate and convert qcow images. With these tools (and
> driver support), we maintain the qcow file format but adjust
> parameters for higher performance with Xen -- using a larger segment
> size (4096 instead of 512) and more coarsely allocating metadata
> regions. We are continuing to improve this work and expect qcow
> performance to improve a great deal over the newxt few weeks.
>
> Build and Installation Instructions
> ===================================
>
> You will need libaio >= 0.3.104 on your target system to build the
> tools (if you are installing RPMs, this means libaio and
> libaio-devel).
>
> Make to configure the blktap backend driver in your dom0 kernel. It
> will cooperate fine with the existing backend driver, so you can
> experiment with tap disks without breaking existing VM configs.
>
> To build the tools separately, "make && make install" in
> tools/blktap_user.
>
>
> Using the Tools
> ===============
>
> Prepare the image for booting. For qcow files use the qcow utilities
> installed earlier. e.g. qcow-create generates a blank standalone image
> or a file-backed CoW image. img2qcow takes an existing image or
> partition and creates a sparse, standalone qcow-based file.
>
> Start the userspace disk agent either on system boot (e.g. via an init
> script) or manually => 'blktapctrl'
>
> Customise the VM config file to use the 'tap' handler, followed by the
> driver type. e.g. for a raw image such as a file or partition:
>
> disk = ['tap:aio:<FILENAME>,sda1,w']
>
> e.g. for a qcow image:
>
> disk = ['tap:qcow:<FILENAME>,sda1,w']
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>
>
>
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-19 16:51 ` NAHieu
@ 2006-06-19 17:22 ` Andrew Warfield
2006-06-19 18:41 ` NAHieu
2006-06-19 21:16 ` Dan Smith
0 siblings, 2 replies; 44+ messages in thread
From: Andrew Warfield @ 2006-06-19 17:22 UTC (permalink / raw)
To: NAHieu; +Cc: Xen Developers, Julian Chesterfield
> Wonderful!! Now we have dm-userspace and blktap, and these two seems
> to do the similar things. So what are the pros/cons of blktap compared
> to dm-userspace?
I'm sure that Dan can comment on this as well. The main technical
difference is that (as I understand it at least) dm-userspace doesn't
bring block data through userspace, just the block request addresses,
which may be redirected. The current tap code maps the entire request
up, so you can potentially change the data and you can issue block I/O
using normal unix file access functions.
My intuition is that an approach like dm-userspace can be made more
efficient in the long run, but right now it's going to be slower as
you need to do copies of guest data pages as requests go through the
device mapper kernel code. This should be fixable though. I'm also
not sure how carefully dm-u watches block completion responses to
ensure safety of metadata updates relative to data writes. This too
should be fixable -- i just don't know if the user-level tools can
currently request completion notifications on requests that they've
processed. A benefit to the dm-user patch is that it is more of a
linux approach than a xen+linux approach. Dm-user will be generally
useful in the linux tree, whereas our stuff takes advantage of
Xen-specific things to get high performance (i.e. zero-copy data
movement).
dm-user also has the benefit of being able to map images directly in
dom0, which is very useful for tools and is something we haven't yet
added. Similarly though, one downside of dm-user, that is absolutely
no fault of the developers, is the dependency on the linux loopback
driver which has some bad failure characteristics which can result in
both data being acknowledged as written even though it hasn't been,
and the OOM killer going insane. I think some fixes to loop probably
need to be applied in the near future given how much people are
generally depending on the code with VMs.
Blktap is a bit of a bigger hammer -- requests are moved to userspace
and the current backends do everything there. This gives you a lot
more flexibility in terms of developing virtual block devices. Take a
look at tools/blktap_user/block-*.c to see what plugins look like,
they're pretty tidy imo. ;) The current code has the immediate benefit
of being fully integrated with the tools and so on, so should be easy
to play with and extend. Having access to block contents also makes
it possible to do things like compression, encryption,
content-adressable storage, and memory-backed block devices.
I suspect that the ideal answer lies somewhere in between the two.
Julian and I have talked about extending the tap driver to combine it
with blkback and allow block address translation without access to
request contents.
That's my biased view ;) -- I'm sure Dan can clear things up a bit.
Now that we have this code to the list, Julian and I are hoping to
take a closer look at dm-user and get a better sense of the relative
merits of the two approaches.
a.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-19 17:22 ` Andrew Warfield
@ 2006-06-19 18:41 ` NAHieu
2006-06-19 21:07 ` Andrew Warfield
2006-06-19 21:16 ` Dan Smith
1 sibling, 1 reply; 44+ messages in thread
From: NAHieu @ 2006-06-19 18:41 UTC (permalink / raw)
To: Andrew Warfield; +Cc: Xen Developers, Julian Chesterfield
Andrew, I am compiling the code, but I got the below error:
......
make[3]: Entering directory
`/home/hieu/projects/xen/blktap/tools/blktap_user/aiotools'
gcc -O2 -fomit-frame-pointer -DNDEBUG -m32 -march=i686 -Wall
-Wstrict-prototypes -Wdeclaration-after-statement -D__XEN_TOOLS__
-fPIC -Wall -Werror -Wno-unused -g3 -fno-strict-aliasing -I
../../../tools/libxc -I.. -I. -I../../xenstore -D_FILE_OFFSET_BITS=64
-D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_GNU_SOURCE
-Wp,-MD,.tapdisk.d -o tapdisk -L../../../tools/libxc
\
-L../../../tools/xenstore -lxenstore -lblktap block-aio.o
block-sync.o block-vmdk.o block-ram.o block-qcow.o aes.o tapdisk.c
-L. -L.. -laio -lz
tapdisk.c:19:23: error: db.h: No such file or directory
make[3]: *** [tapdisk] Error 1
make[3]: Leaving directory
`/home/hieu/projects/xen/blktap/tools/blktap_user/aiotools'
make[2]: *** [all] Error 2
make[2]: Leaving directory `/home/hieu/projects/xen/blktap/tools/blktap_user'
make[1]: *** [install] Error 2
make[1]: Leaving directory `/home/hieu/projects/xen/blktap/tools'
make: *** [install-tools] Error 2
I have no idea what is the file db.h? Did you miss smt?
Thanks.
H
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-19 18:41 ` NAHieu
@ 2006-06-19 21:07 ` Andrew Warfield
0 siblings, 0 replies; 44+ messages in thread
From: Andrew Warfield @ 2006-06-19 21:07 UTC (permalink / raw)
To: NAHieu; +Cc: Xen Developers, Julian Chesterfield
> Andrew, I am compiling the code, but I got the below error:
> ...
> tapdisk.c:19:23: error: db.h: No such file or directory
Oops -- that's there from an old version of the code that used
berkeley db as a test. Just remove that line and you should be in
business -- it should be completely unnecessary.
a.
> ......
> make[3]: Entering directory
> `/home/hieu/projects/xen/blktap/tools/blktap_user/aiotools'
> gcc -O2 -fomit-frame-pointer -DNDEBUG -m32 -march=i686 -Wall
> -Wstrict-prototypes -Wdeclaration-after-statement -D__XEN_TOOLS__
> -fPIC -Wall -Werror -Wno-unused -g3 -fno-strict-aliasing -I
> ../../../tools/libxc -I.. -I. -I../../xenstore -D_FILE_OFFSET_BITS=64
> -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -D_GNU_SOURCE
> -Wp,-MD,.tapdisk.d -o tapdisk -L../../../tools/libxc
> \
> -L../../../tools/xenstore -lxenstore -lblktap block-aio.o
> block-sync.o block-vmdk.o block-ram.o block-qcow.o aes.o tapdisk.c
> -L. -L.. -laio -lz
> make[3]: *** [tapdisk] Error 1
> make[3]: Leaving directory
> `/home/hieu/projects/xen/blktap/tools/blktap_user/aiotools'
> make[2]: *** [all] Error 2
> make[2]: Leaving directory `/home/hieu/projects/xen/blktap/tools/blktap_user'
> make[1]: *** [install] Error 2
> make[1]: Leaving directory `/home/hieu/projects/xen/blktap/tools'
> make: *** [install-tools] Error 2
>
>
> I have no idea what is the file db.h? Did you miss smt?
>
>
> Thanks.
> H
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-19 17:22 ` Andrew Warfield
2006-06-19 18:41 ` NAHieu
@ 2006-06-19 21:16 ` Dan Smith
1 sibling, 0 replies; 44+ messages in thread
From: Dan Smith @ 2006-06-19 21:16 UTC (permalink / raw)
To: Andrew Warfield; +Cc: NAHieu, Xen Developers, Julian Chesterfield
[-- Attachment #1.1: Type: text/plain, Size: 3718 bytes --]
AW> I'm sure that Dan can comment on this as well. The main technical
AW> difference is that (as I understand it at least) dm-userspace
AW> doesn't bring block data through userspace, just the block request
AW> addresses, which may be redirected. The current tap code maps the
AW> entire request up, so you can potentially change the data and you
AW> can issue block I/O using normal unix file access functions.
Yup, that's a correct assessment.
AW> My intuition is that an approach like dm-userspace can be made
AW> more efficient in the long run, but right now it's going to be
AW> slower as you need to do copies of guest data pages as requests go
AW> through the device mapper kernel code.
Why do you say that? I would imagine that blkback provides the domU
pages as the target pages in the request, is that right? In that
case, the data coming off of the disk should go directly into the domU
page. Remember that dm-userspace doesn't do anything other than
rewriting of the destination device and sector of a request. So,
however it works for blkback now, is how it works with dm-userspace in
the mix.
AW> This should be fixable though. I'm also not sure how carefully
AW> dm-u watches block completion responses to ensure safety of
AW> metadata updates relative to data writes. This too should be
AW> fixable -- i just don't know if the user-level tools can currently
AW> request completion notifications on requests that they've
AW> processed.
So, right now, we're a little optimistic about metadata writing. It
will be relatively easy to hijack the callback routine for the disk
request (a technique which is heavily used in the rest of the block
layer) to get a completion trigger. We can then notify userspace for
the metadata write and then trigger the original callback routine for
completion.
AW> A benefit to the dm-user patch is that it is more of a linux
AW> approach than a xen+linux approach. Dm-user will be generally
AW> useful in the linux tree
Right, this is a huge advantage, I think. Being able to mount images
as if they were disks will be quite helpful. Another benefit is the
ability to easily convert between formats. Converting a vmdk to a
qcow is as easy as mounting both and doing a "cp -R" between them.
AW> Similarly though, one downside of dm-user, that is absolutely no
AW> fault of the developers, is the dependency on the linux loopback
AW> driver
Just a clarification, this is only if file images are used. If using
LVMs or partitions or some other block device, we don't use the loop
driver.
AW> which has some bad failure characteristics which can result in
AW> both data being acknowledged as written even though it hasn't
AW> been, and the OOM killer going insane. I think some fixes to loop
AW> probably need to be applied in the near future given how much
AW> people are generally depending on the code with VMs.
Can you elaborate about what specifically is wrong with the loop
driver?
AW> Julian and I have talked about extending the tap driver to combine
AW> it with blkback and allow block address translation without access
AW> to request contents.
Since the kernel already has a block address translation solution
(i.e. device-mapper), is there a benefit to adding another
xen-specific one?
Another question I have is this: doesn't the dependence on libaio
limit you to certain filesystems? For example, the page for libaio
doesn't mention reisferfs as supported. Does that mean that SLES
users won't be able to use ublkback?
Thanks for posting your code Andrew!
--
Dan Smith
IBM Linux Technology Center
Open Hypervisor Team
email: danms@us.ibm.com
[-- Attachment #1.2: Type: application/pgp-signature, Size: 190 bytes --]
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-19 16:19 [PATCH] Blktap: Userspace file-based image support. (RFC) Andrew Warfield
2006-06-19 16:51 ` NAHieu
@ 2006-06-19 18:55 ` Anthony Liguori
2006-06-19 19:22 ` Andrew Warfield
2006-06-19 19:26 ` Andrew Warfield
2006-06-19 19:15 ` Anthony Liguori
2006-06-29 3:35 ` Rusty Russell
3 siblings, 2 replies; 44+ messages in thread
From: Anthony Liguori @ 2006-06-19 18:55 UTC (permalink / raw)
To: Andrew Warfield; +Cc: Xen Developers, Julian Chesterfield
Hi Andy,
>
> Performance is quite good, and we intend to focus on this a bit more
> over the next few weeks, releasing updated patches as they are
> available. Bonnie results this morning are as follows (64-bit results
> compare against linux blkback+loopback file, Julian can follow up with
> loopback results for 32-bit later if anyone's interested):
>
> -------Sequential Output-------- ---Sequential Input--
> --Random--
> -Per Char- --Block--- -Rewrite-- -Per Char- --Block---
> --Seeks---
> Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU
> /sec %CPU
> 64-bit:
> xen0 4096 40115 93.4 41067 12.7 22757 1.2 32532 56.7 53724 0.4
> 121.4 0.0
> img-sp 4096 20291 86.0 38091 18.1 19939 8.2 30854 69.0 47779 4.2
> 95.3 0.4
> loop-sp 4096 33421 77.6 33663 13.1 18546 5.1 28606 59.2 46659 6.0
> 85.2 0.1
>
> 32-Bit:
> xen0 1024 33857 94.0 45804 9.0 23269 0.0 25825 52.0 55628 0
> 185.0 0.0
> img-sp 1448 32743 92.0 40703 8.0 23281 0.0 31139 75.0 56585 0
> 208.1 0.0
What is img-sp? Is this blktap + a physical device or is this blktap
with something like qcow?
The numbers a tad worse than I'd expect them to be if it was a physical
device. Theoretically, linux-aio is inserting requests directly into
the backend. I expect there to be a certain amount of CPU overhead from
context switching but since it's still zero-copy, I wouldn't expect less
CPU usage and less throughput.
Any idea why this is or am I just totally misunderstanding how things
should behave :-)
> Working in conjunction with the kernel blktap driver, all disk I/O
> requests from VMs are passed to the userspace deamon (using a shared
> memory interface) through a character device. Each active disk is
> mappd to an individual device node, allowing per-disk processes to
> implement individual block devices where desired. The userspace
> drivers are implemented using asynchronous (Linux libaio),
> O_DIRECT-based calls to preserve the unbuffered, batched and
> asynchronous request dispatch achieved with the existing blockback
> code. We provide a simple, asynchronous virtual disk interface that
> makes it quite easy to add new disk implementations.
>
A very much like the idea of a userspace block device backend. Have you
considered what it would take to completely replace blkback with a
userspace backend? I'm also curious why you choose a character device
to interact with the ring queue instead of just attaching to the ring
queue directly in userspace.
I think the whole discussion of COW support is orthogonal to a userspace
backend FWIW so I'll save that part of the discussion for another thread :-)
Regards,
Anthony Liguori
>
> As of June 2006 the current supported disk formats are:
>
> - Raw Images (both on partitions and in image files)
> - File-backed Qcow disks (sparse qcow overlay on a raw image/patrition).
> - Standalone sparse Qcow disks (sparse disks, not backed by a parent
> image).
> - Fast shareable RAM disk between VMs (requires some form of
> cluster-based
> filesystem support e.g. OCFS2 in the guest kernel)
> - Some VMDK images - your mileage may vary
>
> Raw and QCow images have asynchronous backends and so should perform
> fairly well. VMDK is based directly on the qemu vmdk driver, which is
> synchronous (a.k.a. slow).
>
> The qcow backends support existing qcow disks. There are also a set
> of tools to generate and convert qcow images. With these tools (and
> driver support), we maintain the qcow file format but adjust
> parameters for higher performance with Xen -- using a larger segment
> size (4096 instead of 512) and more coarsely allocating metadata
> regions. We are continuing to improve this work and expect qcow
> performance to improve a great deal over the newxt few weeks.
>
> Build and Installation Instructions
> ===================================
>
> You will need libaio >= 0.3.104 on your target system to build the
> tools (if you are installing RPMs, this means libaio and
> libaio-devel).
>
> Make to configure the blktap backend driver in your dom0 kernel. It
> will cooperate fine with the existing backend driver, so you can
> experiment with tap disks without breaking existing VM configs.
>
> To build the tools separately, "make && make install" in
> tools/blktap_user.
>
>
> Using the Tools
> ===============
>
> Prepare the image for booting. For qcow files use the qcow utilities
> installed earlier. e.g. qcow-create generates a blank standalone image
> or a file-backed CoW image. img2qcow takes an existing image or
> partition and creates a sparse, standalone qcow-based file.
>
> Start the userspace disk agent either on system boot (e.g. via an init
> script) or manually => 'blktapctrl'
>
> Customise the VM config file to use the 'tap' handler, followed by the
> driver type. e.g. for a raw image such as a file or partition:
>
> disk = ['tap:aio:<FILENAME>,sda1,w']
>
> e.g. for a qcow image:
>
> disk = ['tap:qcow:<FILENAME>,sda1,w']
> ------------------------------------------------------------------------
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-19 18:55 ` Anthony Liguori
@ 2006-06-19 19:22 ` Andrew Warfield
2006-06-19 19:26 ` Andrew Warfield
1 sibling, 0 replies; 44+ messages in thread
From: Andrew Warfield @ 2006-06-19 19:22 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Xen Developers, Julian Chesterfield
Hi Anthony,
> What is img-sp? Is this blktap + a physical device or is this blktap
> with something like qcow?
Oops, good question. This is blktap backed off of a sparse image file
(generated with something along the lines of "dd if=/dev/zero
of=./scratch.img bs=1024 seek=<big number>"). So this test is
directly comparable to blkback with a loopback-mounted image, which is
what's shown on the next line.
> The numbers a tad worse than I'd expect them to be if it was a physical
> device. Theoretically, linux-aio is inserting requests directly into
> the backend. I expect there to be a certain amount of CPU overhead from
> context switching but since it's still zero-copy, I wouldn't expect less
> CPU usage and less throughput.
>
> Any idea why this is or am I just totally misunderstanding how things
> should behave :-)
Performance on raw devices is certainly better than on images -- I
didn't have a spare partition to work with on my test box this morning
(maybe I can use that as an excuse to get an extra disk), but will get
some results posted on this asap.
> A very much like the idea of a userspace block device backend. Have you
> considered what it would take to completely replace blkback with a
> userspace backend? I'm also curious why you choose a character device
> to interact with the ring queue instead of just attaching to the ring
> queue directly in userspace.
The current blktap code has functional parity with blkback. Just
change 'phys:' to 'tap:aio:' in your vm config files and you're set.
> I think the whole discussion of COW support is orthogonal to a userspace
> backend FWIW so I'll save that part of the discussion for another thread :-)
Fair enough, I'll look forward to it. ;)
a.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-19 18:55 ` Anthony Liguori
2006-06-19 19:22 ` Andrew Warfield
@ 2006-06-19 19:26 ` Andrew Warfield
2006-06-19 19:51 ` Anthony Liguori
1 sibling, 1 reply; 44+ messages in thread
From: Andrew Warfield @ 2006-06-19 19:26 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Xen Developers, Julian Chesterfield
> A very much like the idea of a userspace block device backend. Have you
> considered what it would take to completely replace blkback with a
> userspace backend? I'm also curious why you choose a character device
> to interact with the ring queue instead of just attaching to the ring
> queue directly in userspace.
Oops (again), missed answering your char device question. We just use
a char device to pin up a region of virtual address space for each
disk as it's presented in userspace. Anyone familiar with blkback
will recognise the technique. In our case, the first page is a
request/response ring between tap driver and application, and the
remainder is a sparsely populated address space where data pages are
mapped as they fly through. We signal down with ioctl()s, and up
using poll().
a.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-19 19:26 ` Andrew Warfield
@ 2006-06-19 19:51 ` Anthony Liguori
0 siblings, 0 replies; 44+ messages in thread
From: Anthony Liguori @ 2006-06-19 19:51 UTC (permalink / raw)
To: Andrew Warfield; +Cc: Xen Developers, Julian Chesterfield
Andrew Warfield wrote:
>> A very much like the idea of a userspace block device backend. Have you
>> considered what it would take to completely replace blkback with a
>> userspace backend? I'm also curious why you choose a character device
>> to interact with the ring queue instead of just attaching to the ring
>> queue directly in userspace.
>
> Oops (again), missed answering your char device question. We just use
> a char device to pin up a region of virtual address space for each
> disk as it's presented in userspace.
Is this strictly needed though? My current understanding (which may be
totally off) of this device is that it contains:
- first page is ring/queue
- rest of file is mmap()'able and as requests come in over the blkfront
queue, you map them into that address space
- poll/ioctl is used for event channel notification
Couldn't you do all of this in pure userspace though with privcmd and
evtchn?
Regards,
Anthony Liguori
> Anyone familiar with blkback
> will recognise the technique. In our case, the first page is a
> request/response ring between tap driver and application, and the
> remainder is a sparsely populated address space where data pages are
> mapped as they fly through. We signal down with ioctl()s, and up
> using poll().
>
> a.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-19 16:19 [PATCH] Blktap: Userspace file-based image support. (RFC) Andrew Warfield
2006-06-19 16:51 ` NAHieu
2006-06-19 18:55 ` Anthony Liguori
@ 2006-06-19 19:15 ` Anthony Liguori
2006-06-19 19:31 ` Andrew Warfield
2006-06-29 3:35 ` Rusty Russell
3 siblings, 1 reply; 44+ messages in thread
From: Anthony Liguori @ 2006-06-19 19:15 UTC (permalink / raw)
To: Andrew Warfield; +Cc: Xen Developers, Julian Chesterfield
Couple general comments on the code:
Please don't introduce more (ab)uses of /proc. Sure it's just for
debugging but there's no reason to not make that sysfs.
I'm not an expert here, but the nopage handlers that I've seen return
NOPAGE_SIGBUS instead of manually causing a SIGBUS on current.
I think it's better to use C99 initialization than GCC:
owner: ..., => .owner = ...,
Some of the indenting is a bit off from Linux CodingStyle. Stuff like
if( => if ( and some random spaces after an (.
There's some code commented out with C++ comments too.
What's the significance of /**BLKTAP**/ and /**TAPEND**/?
I'm a little surprised to see these conversion tools too. Wouldn't it
be easier to just add some parameters to qemu-img?
Pretty interesting stuff, thanks for posting.
Regards,
Anthony Liguori
Andrew Warfield wrote:
> Attached to this email is a patch containing the (new and improved)
> blktap Linux driver and associated userspace tools for Xen. In
> addition to being more flavourful, containing half the fat, and
> removing stains twice as well as the old driver, this stuff adds a
> userspace block backend and let you use raw (without loopback), qcow,
> and vmdk-based image files for your domUs. There's also a fun little
> driver that provides a shared-memory block device which, in
> combination with OCFS2, represents a cheap-and-cheerful fast shared
> filesystem between multiple domUs.
>
> This code has been (somewhat lackadaisically) developed over the past
> few years at Cambridge and has recently enjoyed massive improvements
> thanks to the considerable efforts of Julian Chesterfield.
>
> The code "works for us" and has been tested on a grand total of about
> three machines. We would love to have feedback from a broader
> audience, in terms of both trying out the tools and inspecting the code.
> We'll plan to release new patches at about 1-week intervals based on
> comments.
>
> Performance is quite good, and we intend to focus on this a bit more
> over the next few weeks, releasing updated patches as they are
> available. Bonnie results this morning are as follows (64-bit results
> compare against linux blkback+loopback file, Julian can follow up with
> loopback results for 32-bit later if anyone's interested):
>
> -------Sequential Output-------- ---Sequential Input--
> --Random--
> -Per Char- --Block--- -Rewrite-- -Per Char- --Block---
> --Seeks---
> Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU
> /sec %CPU
> 64-bit:
> xen0 4096 40115 93.4 41067 12.7 22757 1.2 32532 56.7 53724 0.4
> 121.4 0.0
> img-sp 4096 20291 86.0 38091 18.1 19939 8.2 30854 69.0 47779 4.2
> 95.3 0.4
> loop-sp 4096 33421 77.6 33663 13.1 18546 5.1 28606 59.2 46659 6.0
> 85.2 0.1
>
> 32-Bit:
> xen0 1024 33857 94.0 45804 9.0 23269 0.0 25825 52.0 55628 0
> 185.0 0.0
> img-sp 1448 32743 92.0 40703 8.0 23281 0.0 31139 75.0 56585 0
> 208.1 0.0
>
> The patch is against cset 0426:840f33e54054 -- but is unlikely to
> conflict with anything recent. You'll need libaio and libaio-devel on
> your build machine for the tools to compile.
>
>
> Blktap readme follows.)
>
> Thanks!
> a.
>
> ---
>
>
> Blktap Userspace Tools + Library
> ================================
>
> Andrew Warfield and Julian Chesterfield
> 16th June 2006
>
> {firstname.lastname}@cl.cam.ac.uk
>
> The blktap userspace toolkit provides a user-level disk I/O
> interface. The blktap mechanism involves a kernel driver that acts
> similarly to the existing Xen/Linux blkback driver, and a set of
> associated user-level libraries. Using these tools, blktap allows
> virtual block devices presented to VMs to be implemented in userspace
> and to be backed by raw partitions, files, network, etc.
>
> The key benefit of blktap is that it makes it easy and fast to write
> arbitrary block backends, and that these user-level backends actually
> perform very well. Specifically:
>
> - Metadata disk formats such as Copy-on-Write, encrypted disks, sparse
> formats and other compression features can be easily implemented.
> O_DIRECT and libaio allow high-performance implementation of even
> sparse image formats such as QCoW, while still preserving the safe
> ordering of metadata and data writes to ensure data integrity.
> (As opposed to, for instance, both the loopback driver and LVM snaps
> which both have very dangerous failure cases.)
>
> - Accessing file-based images from userspace avoids problems related
> to flushing dirty pages which are present in the Linux loopback
> driver. (Specifically, doing a large number of writes to an
> NFS-backed image don't result in the OOM killer going berserk.)
>
> - Per-disk handler processes enable easier userspace policing of block
> resources, and process-granularity QoS techniques (disk scheduling
> and related tools) may be trivially applied to block devices.
>
> - It's very easy to take advantage of userspace facilities such as
> networking libraries, compression utilities, peer-to-peer
> file-sharing systems and so on to build more complex block backends.
>
> - Crashes are contained -- incremental development/debugging is very
> fast.
>
> - All block data is forwarded in a zero-copy fashion, allowing for
> low-overhead userspace implementations.
>
> How it works (in one paragraph):
>
> Working in conjunction with the kernel blktap driver, all disk I/O
> requests from VMs are passed to the userspace deamon (using a shared
> memory interface) through a character device. Each active disk is
> mappd to an individual device node, allowing per-disk processes to
> implement individual block devices where desired. The userspace
> drivers are implemented using asynchronous (Linux libaio),
> O_DIRECT-based calls to preserve the unbuffered, batched and
> asynchronous request dispatch achieved with the existing blockback
> code. We provide a simple, asynchronous virtual disk interface that
> makes it quite easy to add new disk implementations.
>
>
> As of June 2006 the current supported disk formats are:
>
> - Raw Images (both on partitions and in image files)
> - File-backed Qcow disks (sparse qcow overlay on a raw image/patrition).
> - Standalone sparse Qcow disks (sparse disks, not backed by a parent
> image).
> - Fast shareable RAM disk between VMs (requires some form of
> cluster-based
> filesystem support e.g. OCFS2 in the guest kernel)
> - Some VMDK images - your mileage may vary
>
> Raw and QCow images have asynchronous backends and so should perform
> fairly well. VMDK is based directly on the qemu vmdk driver, which is
> synchronous (a.k.a. slow).
>
> The qcow backends support existing qcow disks. There are also a set
> of tools to generate and convert qcow images. With these tools (and
> driver support), we maintain the qcow file format but adjust
> parameters for higher performance with Xen -- using a larger segment
> size (4096 instead of 512) and more coarsely allocating metadata
> regions. We are continuing to improve this work and expect qcow
> performance to improve a great deal over the newxt few weeks.
>
> Build and Installation Instructions
> ===================================
>
> You will need libaio >= 0.3.104 on your target system to build the
> tools (if you are installing RPMs, this means libaio and
> libaio-devel).
>
> Make to configure the blktap backend driver in your dom0 kernel. It
> will cooperate fine with the existing backend driver, so you can
> experiment with tap disks without breaking existing VM configs.
>
> To build the tools separately, "make && make install" in
> tools/blktap_user.
>
>
> Using the Tools
> ===============
>
> Prepare the image for booting. For qcow files use the qcow utilities
> installed earlier. e.g. qcow-create generates a blank standalone image
> or a file-backed CoW image. img2qcow takes an existing image or
> partition and creates a sparse, standalone qcow-based file.
>
> Start the userspace disk agent either on system boot (e.g. via an init
> script) or manually => 'blktapctrl'
>
> Customise the VM config file to use the 'tap' handler, followed by the
> driver type. e.g. for a raw image such as a file or partition:
>
> disk = ['tap:aio:<FILENAME>,sda1,w']
>
> e.g. for a qcow image:
>
> disk = ['tap:qcow:<FILENAME>,sda1,w']
> ------------------------------------------------------------------------
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-19 19:15 ` Anthony Liguori
@ 2006-06-19 19:31 ` Andrew Warfield
0 siblings, 0 replies; 44+ messages in thread
From: Andrew Warfield @ 2006-06-19 19:31 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Xen Developers, Julian Chesterfield
Excellent comments, thanks.
> Please don't introduce more (ab)uses of /proc. Sure it's just for
> debugging but there's no reason to not make that sysfs.
>
> I'm not an expert here, but the nopage handlers that I've seen return
> NOPAGE_SIGBUS instead of manually causing a SIGBUS on current.
>
> I think it's better to use C99 initialization than GCC:
>
> owner: ..., => .owner = ...,
>
> Some of the indenting is a bit off from Linux CodingStyle. Stuff like
> if( => if ( and some random spaces after an (.
>
> There's some code commented out with C++ comments too.
All good -- I'll take a pass through and fix all these this week.
> What's the significance of /**BLKTAP**/ and /**TAPEND**/?
For a while we were maintaining the kernel tap driver as a diff
against blkback, to pick up fixes quickly -- those markers were just
to mark differing regions. I think the current code has diverged
enough to make this approach untenable.
> I'm a little surprised to see these conversion tools too. Wouldn't it
> be easier to just add some parameters to qemu-img?
The image tools use our plugins (rather than qemu's) to build disks --
most importantly they adjust the layout to build better-performing
images. Fair point though, I think julian was going to look and see
if we could get away with just using the qemu tools.
> Pretty interesting stuff, thanks for posting.
Thanks for the feedback!
a.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-19 16:19 [PATCH] Blktap: Userspace file-based image support. (RFC) Andrew Warfield
` (2 preceding siblings ...)
2006-06-19 19:15 ` Anthony Liguori
@ 2006-06-29 3:35 ` Rusty Russell
2006-06-29 5:24 ` Andrew Warfield
2006-06-29 11:49 ` Anthony Liguori
3 siblings, 2 replies; 44+ messages in thread
From: Rusty Russell @ 2006-06-29 3:35 UTC (permalink / raw)
To: Andrew Warfield; +Cc: Xen Developers, Julian Chesterfield
On Mon, 2006-06-19 at 09:19 -0700, Andrew Warfield wrote:
> Attached to this email is a patch containing the (new and improved)
> blktap Linux driver and associated userspace tools for Xen. In
> addition to being more flavourful, containing half the fat, and
> removing stains twice as well as the old driver, this stuff adds a
> userspace block backend and let you use raw (without loopback), qcow,
> and vmdk-based image files for your domUs. There's also a fun little
> driver that provides a shared-memory block device which, in
> combination with OCFS2, represents a cheap-and-cheerful fast shared
> filesystem between multiple domUs.
Hi Andrew,
I like the idea of block servers in userspace, but I'm curious. When I
wrote the simple share block server I couldn't see an obvious
justification for multiple outstanding requests (with AIO/threads and
all that entails), so I went for the trivial single request approach.
It seems to me that the backend doesn't have much information the front
end doesn't have.
Just wondered if you'd tried a naive approach first...
Thanks!
Rusty.
--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-29 3:35 ` Rusty Russell
@ 2006-06-29 5:24 ` Andrew Warfield
2006-06-29 6:31 ` Rusty Russell
2006-06-29 11:49 ` Anthony Liguori
1 sibling, 1 reply; 44+ messages in thread
From: Andrew Warfield @ 2006-06-29 5:24 UTC (permalink / raw)
To: Rusty Russell; +Cc: Xen Developers, Julian Chesterfield
> I like the idea of block servers in userspace, but I'm curious. When I
> wrote the simple share block server I couldn't see an obvious
> justification for multiple outstanding requests (with AIO/threads and
> all that entails), so I went for the trivial single request approach.
> It seems to me that the backend doesn't have much information the front
> end doesn't have.
Hi Rusty,
not sure I see what you are asking. A very early version of the
code just did synchronous dispatch (one blocking request at a time)
and was, as you might expect, very slow. You clearly want to keep the
block request queues as full as possible to amortize seeks... AIO just
lets me issue batches of requests at once, and so minimizes context
switching through userland -- which was something I was worried about
causing overhead on x86_64. I don't really think it adds that much
complexity.
The process-per-disk thing is optional in the current code, you
could just as easily build a single-threaded user backend. The
current model hopefully buys you a bit of resiliency against crashes
and maps per-disk request streams in a fairly clean way down onto the
block sheduler.
Am I missing your point?
a.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-29 5:24 ` Andrew Warfield
@ 2006-06-29 6:31 ` Rusty Russell
2006-06-29 14:34 ` Andrew Warfield
0 siblings, 1 reply; 44+ messages in thread
From: Rusty Russell @ 2006-06-29 6:31 UTC (permalink / raw)
To: Andrew Warfield; +Cc: Xen Developers, Julian Chesterfield
On Wed, 2006-06-28 at 22:24 -0700, Andrew Warfield wrote:
> > I like the idea of block servers in userspace, but I'm curious. When I
> > wrote the simple share block server I couldn't see an obvious
> > justification for multiple outstanding requests (with AIO/threads and
> > all that entails), so I went for the trivial single request approach.
> > It seems to me that the backend doesn't have much information the front
> > end doesn't have.
>
> Hi Rusty,
>
> not sure I see what you are asking. A very early version of the
> code just did synchronous dispatch (one blocking request at a time)
> and was, as you might expect, very slow. You clearly want to keep the
> block request queues as full as possible to amortize seeks...
Last I looked the blkif front end, it uses a noop I/O scheduler, which
means that the only one doing scheduling is the backend. I can easily
imagine that if the backend is synchronous, this would be slow.
However, it's not clear to me that doing scheduling in the backend will
generally be faster than doing it in the front end. I suppose it should
be, if the backend domain were serving multiple frontends from the same
device.
> AIO just
> lets me issue batches of requests at once, and so minimizes context
> switching through userland -- which was something I was worried about
> causing overhead on x86_64. I don't really think it adds that much
> complexity.
Sure, I would have used a pool of processes because I'm old-fashioned,
but AIO is probably a better choice for multiple requests at once.
Cheers!
Rusty.
--
Help! Save Australia from the worst of the DMCA: http://linux.org.au/law
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-29 6:31 ` Rusty Russell
@ 2006-06-29 14:34 ` Andrew Warfield
2006-06-30 13:35 ` Stephen C. Tweedie
0 siblings, 1 reply; 44+ messages in thread
From: Andrew Warfield @ 2006-06-29 14:34 UTC (permalink / raw)
To: Rusty Russell; +Cc: Xen Developers, Julian Chesterfield
> Last I looked the blkif front end, it uses a noop I/O scheduler, which
> means that the only one doing scheduling is the backend. I can easily
> imagine that if the backend is synchronous, this would be slow.
>
> However, it's not clear to me that doing scheduling in the backend will
> generally be faster than doing it in the front end. I suppose it should
> be, if the backend domain were serving multiple frontends from the same
> device.
Well, scheduling across multiple VM request streams is certainly one
reason for exposing as big a request aperture to the physical
(backend) disk scheduler as possible. The fact that the frontend
doesn't necessarily have any idea how its blocks are actually laid out
on the disk is another -- in the case of file-backed images for
instance.
> > AIO just
> > lets me issue batches of requests at once, and so minimizes context
> > switching through userland -- which was something I was worried about
> > causing overhead on x86_64. I don't really think it adds that much
> > complexity.
>
> Sure, I would have used a pool of processes because I'm old-fashioned,
> but AIO is probably a better choice for multiple requests at once.
My older code was written without the benefit of working AIO for xen
linux. I knocked up a thread pool to improve performance and it
worked reasonably well, although I found that you needed a fairly
large number of threads to saturate the disk (with blocking i/o, which
was a little naive ;) ), and it represented a fairly large chunk of
unnecessary moving parts.
The linux libaio stuff is pretty good actually. Requests map rather
directly down onto the kernel bio interface, so with aio the userland
block back code is doing a very similar thing to the in-kernel driver.
As Anthony points out, libaio is unthreaded, you just fill out a
batch of request structs and shove it down. It's very fast indeed and
quite low-overhead. My only real complaint is that despite a couple
of years discussing ways to do it on libaio-devel, the AIO developers
haven't settled a unified way to pool on aio completions and normal
file handles, which is a bit of an inconvenience when you want to do
both.
a.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-29 14:34 ` Andrew Warfield
@ 2006-06-30 13:35 ` Stephen C. Tweedie
2006-06-30 14:17 ` Julian Chesterfield
0 siblings, 1 reply; 44+ messages in thread
From: Stephen C. Tweedie @ 2006-06-30 13:35 UTC (permalink / raw)
To: Andrew Warfield
Cc: Jeff Moyer, Rusty Russell, xen-devel@lists.xensource.com,
Julian Chesterfield
Hi,
On Thu, 2006-06-29 at 07:34 -0700, Andrew Warfield wrote:
> The linux libaio stuff is pretty good actually. Requests map rather
> directly down onto the kernel bio interface, so with aio the userland
> block back code is doing a very similar thing to the in-kernel driver.
Yep. I noticed that the blktap patch includes adding EPOLL to kernel
aio, though, and that has not (yet) been accepted upstream; is that
something that is absolutely necessary for blktap, or could you live
without it?
Is there any movement towards getting that upstream, since otherwise
we're introducing dependencies on core kernel infrastructure that is not
guaranteed to persist upstream? It looks like the sort of thing that
would be entirely reasonable upstream: EPOLL for aio seems to make a ton
of sense.
Cheers,
Stephen
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-30 13:35 ` Stephen C. Tweedie
@ 2006-06-30 14:17 ` Julian Chesterfield
2006-06-30 18:41 ` Jeff Moyer
0 siblings, 1 reply; 44+ messages in thread
From: Julian Chesterfield @ 2006-06-30 14:17 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Andrew Warfield, Jeff Moyer, Rusty Russell,
xen-devel@lists.xensource.com, Julian Chesterfield
[-- Attachment #1.1: Type: text/plain, Size: 1740 bytes --]
On 30 Jun 2006, at 14:35, Stephen C. Tweedie wrote:
> Hi,
>
> On Thu, 2006-06-29 at 07:34 -0700, Andrew Warfield wrote:
>
>> The linux libaio stuff is pretty good actually. Requests map rather
>> directly down onto the kernel bio interface, so with aio the userland
>> block back code is doing a very similar thing to the in-kernel driver.
>
> Yep. I noticed that the blktap patch includes adding EPOLL to kernel
> aio, though, and that has not (yet) been accepted upstream; is that
> something that is absolutely necessary for blktap, or could you live
> without it?
Without the completion event poll an alternative was to block on
io_getevents for the batch to complete, or to periodically test for
queued responses. This approach was definitely preferable since it fit
very nicely into the asynch architecture we were working towards for
the userspace drivers. Without the completion poll, the performance
would most likely degrade, although we haven't done any tests to
measure by how much.
>
> Is there any movement towards getting that upstream, since otherwise
> we're introducing dependencies on core kernel infrastructure that is
> not
> guaranteed to persist upstream? It looks like the sort of thing that
> would be entirely reasonable upstream: EPOLL for aio seems to make a
> ton
> of sense.
Agreed. We'd like to see the EPOLL facility adopted in the mainstream
AIO architecture. The current patch was submitted on the linux-aio list
in repsonse to a query we sent about a month ago, however I don't
believe there has been any movement to officially add it. It's on our
agenda to follow-up with the AIO folks since it definitely should
belong in the mainstream kernel rather than as a xen patch.
- Julian
[-- Attachment #1.2: Type: text/enriched, Size: 1800 bytes --]
On 30 Jun 2006, at 14:35, Stephen C. Tweedie wrote:
<excerpt>Hi,
On Thu, 2006-06-29 at 07:34 -0700, Andrew Warfield wrote:
<excerpt>The linux libaio stuff is pretty good actually. Requests map
rather
directly down onto the kernel bio interface, so with aio the userland
block back code is doing a very similar thing to the in-kernel driver.
</excerpt>
Yep. I noticed that the blktap patch includes adding EPOLL to kernel
aio, though, and that has not (yet) been accepted upstream; is that
something that is absolutely necessary for blktap, or could you live
without it?
</excerpt>
<fixed>Without the completion event poll an alternative was to block
on io_getevents for the batch to complete, or to periodically test for
queued responses. This approach was definitely preferable since it fit
very nicely into the asynch architecture we were working towards for
the userspace drivers. Without the completion poll, the performance
would most likely degrade, although we haven't done any tests to
measure by how much.</fixed>
<excerpt>
Is there any movement towards getting that upstream, since otherwise
we're introducing dependencies on core kernel infrastructure that is
not
guaranteed to persist upstream? It looks like the sort of thing that
would be entirely reasonable upstream: EPOLL for aio seems to make a
ton
of sense.
</excerpt>
<fixed>Agreed. We'd like to see the EPOLL facility adopted in the
mainstream AIO architecture. The current patch was submitted on the
linux-aio list in repsonse to a query we sent about a month ago,
however I don't believe there has been any movement to officially add
it. It's on our agenda to follow-up with the AIO folks since it
definitely should belong in the mainstream kernel rather than as a xen
patch.
- Julian
</fixed>
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-30 14:17 ` Julian Chesterfield
@ 2006-06-30 18:41 ` Jeff Moyer
0 siblings, 0 replies; 44+ messages in thread
From: Jeff Moyer @ 2006-06-30 18:41 UTC (permalink / raw)
To: Julian Chesterfield
Cc: Julian Chesterfield, Rusty Russell, xen-devel@lists.xensource.com,
Andrew Warfield
==> Regarding Re: [Xen-devel] [PATCH] Blktap: Userspace file-based image support. (RFC); Julian Chesterfield <jac90@cl.cam.ac.uk> adds:
jac90> Agreed. We'd like to see the EPOLL facility adopted in the
jac90> mainstream AIO architecture. The current patch was submitted on the
jac90> linux-aio list in repsonse to a query we sent about a month ago,
jac90> however I don't believe there has been any movement to officially
jac90> add it. It's on our agenda to follow-up with the AIO folks since it
jac90> definitely should belong in the mainstream kernel rather than as a
jac90> xen patch.
Are you planning on sending it along soonish? When you post it, I'll set
aside some time to kick the tires and provide feedback.
Thanks!
Jeff
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-29 3:35 ` Rusty Russell
2006-06-29 5:24 ` Andrew Warfield
@ 2006-06-29 11:49 ` Anthony Liguori
2006-06-29 12:26 ` Laurent Vivier
1 sibling, 1 reply; 44+ messages in thread
From: Anthony Liguori @ 2006-06-29 11:49 UTC (permalink / raw)
To: Rusty Russell; +Cc: Andrew Warfield, Xen Developers, Julian Chesterfield
Rusty Russell wrote:
> On Mon, 2006-06-19 at 09:19 -0700, Andrew Warfield wrote:
>
>> Attached to this email is a patch containing the (new and improved)
>> blktap Linux driver and associated userspace tools for Xen. In
>> addition to being more flavourful, containing half the fat, and
>> removing stains twice as well as the old driver, this stuff adds a
>> userspace block backend and let you use raw (without loopback), qcow,
>> and vmdk-based image files for your domUs. There's also a fun little
>> driver that provides a shared-memory block device which, in
>> combination with OCFS2, represents a cheap-and-cheerful fast shared
>> filesystem between multiple domUs.
>>
>
> Hi Andrew,
>
> I like the idea of block servers in userspace, but I'm curious. When I
> wrote the simple share block server I couldn't see an obvious
> justification for multiple outstanding requests (with AIO/threads and
> all that entails),
Are you thinking of posix-aio? posix-aio is "emulated" with threads and
normal read/select calls. The performance isn't that great.
I believe blktap is using linux-aio which doesn't use threads (it uses
the linux specific interface). I've seen a number of benchmarks where
linux-aio is significantly better than posix-aio.
Regards,
Anthony Liguori
> so I went for the trivial single request approach.
> It seems to me that the backend doesn't have much information the front
> end doesn't have.
>
> Just wondered if you'd tried a naive approach first...
>
> Thanks!
> Rusty.
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-29 11:49 ` Anthony Liguori
@ 2006-06-29 12:26 ` Laurent Vivier
0 siblings, 0 replies; 44+ messages in thread
From: Laurent Vivier @ 2006-06-29 12:26 UTC (permalink / raw)
To: Anthony Liguori
Cc: Andrew Warfield, Rusty Russell, Xen Developers,
Julian Chesterfield
[-- Attachment #1.1: Type: text/plain, Size: 1446 bytes --]
Anthony Liguori wrote:
> Rusty Russell wrote:
>> On Mon, 2006-06-19 at 09:19 -0700, Andrew Warfield wrote:
>>
>>> Attached to this email is a patch containing the (new and improved)
>>> blktap Linux driver and associated userspace tools for Xen. In
>>> addition to being more flavourful, containing half the fat, and
>>> removing stains twice as well as the old driver, this stuff adds a
>>> userspace block backend and let you use raw (without loopback), qcow,
>>> and vmdk-based image files for your domUs. There's also a fun little
>>> driver that provides a shared-memory block device which, in
>>> combination with OCFS2, represents a cheap-and-cheerful fast shared
>>> filesystem between multiple domUs.
>>>
>>
>> Hi Andrew,
>>
>> I like the idea of block servers in userspace, but I'm curious.
>> When I
>> wrote the simple share block server I couldn't see an obvious
>> justification for multiple outstanding requests (with AIO/threads and
>> all that entails),
>
> Are you thinking of posix-aio? posix-aio is "emulated" with threads and
> normal read/select calls. The performance isn't that great.
Hi,
We develop another implementation of posix I/O for linux with better
performance, based on linux kernel AIO, have a look at:
http://www.bullopensource.org/posix/index.html
Laurent
--
Laurent Vivier
Bull, Architect of an Open World (TM)
http://www.bullopensource.org/ext4
[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 44+ messages in thread
[parent not found: <C0BCD26E.5C31%julian@xensource.com>]
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
[not found] <C0BCD26E.5C31%julian@xensource.com>
@ 2006-06-19 21:42 ` Julian Chesterfield
2006-06-19 21:56 ` Anthony Liguori
0 siblings, 1 reply; 44+ messages in thread
From: Julian Chesterfield @ 2006-06-19 21:42 UTC (permalink / raw)
To: aliguori; +Cc: xen-devel
>
> On 19/6/06 8:15 pm, "Anthony Liguori" <aliguori@us.ibm.com> wrote:
>
>> Couple general comments on the code:
>>
>> Please don't introduce more (ab)uses of /proc. Sure it's just for
>> debugging but there's no reason to not make that sysfs.
>>
>> I'm not an expert here, but the nopage handlers that I've seen return
>> NOPAGE_SIGBUS instead of manually causing a SIGBUS on current.
>>
>> I think it's better to use C99 initialization than GCC:
>>
>> owner: ..., => .owner = ...,
>>
>> Some of the indenting is a bit off from Linux CodingStyle. Stuff like
>> if( => if ( and some random spaces after an (.
>>
>> There's some code commented out with C++ comments too.
>>
>> What's the significance of /**BLKTAP**/ and /**TAPEND**/?
>>
>> I'm a little surprised to see these conversion tools too. Wouldn't it
>> be easier to just add some parameters to qemu-img?
Thanks for the comments anthony. When we initially played with qcow
images it was easier to knock-up our own frontend to the plugins for
converting between the different image types and testing features like
image sparseness. We added an optimisation feature in the xen qcow
plugin which would allocate full extents for non backing file based
images as well as the asynchronous callback architecture to enable
request batching for AIO.
We could certainly adapt qemu-img to use these and other features. Not
sure what the best approach for keeping the toolsets in synch between
the 2 projects would be though.
Thanks,
Julian Chesterfield
>>
>> Pretty interesting stuff, thanks for posting.
>>
>> Regards,
>>
>> Anthony Liguori
>>
>> Andrew Warfield wrote:
>>> Attached to this email is a patch containing the (new and improved)
>>> blktap Linux driver and associated userspace tools for Xen. In
>>> addition to being more flavourful, containing half the fat, and
>>> removing stains twice as well as the old driver, this stuff adds a
>>> userspace block backend and let you use raw (without loopback), qcow,
>>> and vmdk-based image files for your domUs. There's also a fun little
>>> driver that provides a shared-memory block device which, in
>>> combination with OCFS2, represents a cheap-and-cheerful fast shared
>>> filesystem between multiple domUs.
>>>
>>> This code has been (somewhat lackadaisically) developed over the past
>>> few years at Cambridge and has recently enjoyed massive improvements
>>> thanks to the considerable efforts of Julian Chesterfield.
>>>
>>> The code "works for us" and has been tested on a grand total of about
>>> three machines. We would love to have feedback from a broader
>>> audience, in terms of both trying out the tools and inspecting the
>>> code.
>>> We'll plan to release new patches at about 1-week intervals based on
>>> comments.
>>>
>>> Performance is quite good, and we intend to focus on this a bit more
>>> over the next few weeks, releasing updated patches as they are
>>> available. Bonnie results this morning are as follows (64-bit
>>> results
>>> compare against linux blkback+loopback file, Julian can follow up
>>> with
>>> loopback results for 32-bit later if anyone's interested):
>>>
>>> -------Sequential Output-------- ---Sequential Input--
>>> --Random--
>>> -Per Char- --Block--- -Rewrite-- -Per Char- --Block---
>>> --Seeks---
>>> Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU
>>> /sec %CPU
>>> 64-bit:
>>> xen0 4096 40115 93.4 41067 12.7 22757 1.2 32532 56.7 53724 0.4
>>> 121.4 0.0
>>> img-sp 4096 20291 86.0 38091 18.1 19939 8.2 30854 69.0 47779 4.2
>>> 95.3 0.4
>>> loop-sp 4096 33421 77.6 33663 13.1 18546 5.1 28606 59.2 46659 6.0
>>> 85.2 0.1
>>>
>>> 32-Bit:
>>> xen0 1024 33857 94.0 45804 9.0 23269 0.0 25825 52.0 55628 0
>>> 185.0 0.0
>>> img-sp 1448 32743 92.0 40703 8.0 23281 0.0 31139 75.0 56585 0
>>> 208.1 0.0
>>>
>>> The patch is against cset 0426:840f33e54054 -- but is unlikely to
>>> conflict with anything recent. You'll need libaio and libaio-devel
>>> on
>>> your build machine for the tools to compile.
>>>
>>>
>>> Blktap readme follows.)
>>>
>>> Thanks!
>>> a.
>>>
>>> ---
>>>
>>>
>>> Blktap Userspace Tools + Library
>>> ================================
>>>
>>> Andrew Warfield and Julian Chesterfield
>>> 16th June 2006
>>>
>>> {firstname.lastname}@cl.cam.ac.uk
>>>
>>> The blktap userspace toolkit provides a user-level disk I/O
>>> interface. The blktap mechanism involves a kernel driver that acts
>>> similarly to the existing Xen/Linux blkback driver, and a set of
>>> associated user-level libraries. Using these tools, blktap allows
>>> virtual block devices presented to VMs to be implemented in userspace
>>> and to be backed by raw partitions, files, network, etc.
>>>
>>> The key benefit of blktap is that it makes it easy and fast to write
>>> arbitrary block backends, and that these user-level backends actually
>>> perform very well. Specifically:
>>>
>>> - Metadata disk formats such as Copy-on-Write, encrypted disks,
>>> sparse
>>> formats and other compression features can be easily implemented.
>>> O_DIRECT and libaio allow high-performance implementation of even
>>> sparse image formats such as QCoW, while still preserving the safe
>>> ordering of metadata and data writes to ensure data integrity.
>>> (As opposed to, for instance, both the loopback driver and LVM snaps
>>> which both have very dangerous failure cases.)
>>>
>>> - Accessing file-based images from userspace avoids problems related
>>> to flushing dirty pages which are present in the Linux loopback
>>> driver. (Specifically, doing a large number of writes to an
>>> NFS-backed image don't result in the OOM killer going berserk.)
>>>
>>> - Per-disk handler processes enable easier userspace policing of
>>> block
>>> resources, and process-granularity QoS techniques (disk scheduling
>>> and related tools) may be trivially applied to block devices.
>>>
>>> - It's very easy to take advantage of userspace facilities such as
>>> networking libraries, compression utilities, peer-to-peer
>>> file-sharing systems and so on to build more complex block backends.
>>>
>>> - Crashes are contained -- incremental development/debugging is very
>>> fast.
>>>
>>> - All block data is forwarded in a zero-copy fashion, allowing for
>>> low-overhead userspace implementations.
>>>
>>> How it works (in one paragraph):
>>>
>>> Working in conjunction with the kernel blktap driver, all disk I/O
>>> requests from VMs are passed to the userspace deamon (using a shared
>>> memory interface) through a character device. Each active disk is
>>> mappd to an individual device node, allowing per-disk processes to
>>> implement individual block devices where desired. The userspace
>>> drivers are implemented using asynchronous (Linux libaio),
>>> O_DIRECT-based calls to preserve the unbuffered, batched and
>>> asynchronous request dispatch achieved with the existing blockback
>>> code. We provide a simple, asynchronous virtual disk interface that
>>> makes it quite easy to add new disk implementations.
>>>
>>>
>>> As of June 2006 the current supported disk formats are:
>>>
>>> - Raw Images (both on partitions and in image files)
>>> - File-backed Qcow disks (sparse qcow overlay on a raw
>>> image/patrition).
>>> - Standalone sparse Qcow disks (sparse disks, not backed by a parent
>>> image).
>>> - Fast shareable RAM disk between VMs (requires some form of
>>> cluster-based
>>> filesystem support e.g. OCFS2 in the guest kernel)
>>> - Some VMDK images - your mileage may vary
>>>
>>> Raw and QCow images have asynchronous backends and so should perform
>>> fairly well. VMDK is based directly on the qemu vmdk driver, which
>>> is
>>> synchronous (a.k.a. slow).
>>>
>>> The qcow backends support existing qcow disks. There are also a set
>>> of tools to generate and convert qcow images. With these tools (and
>>> driver support), we maintain the qcow file format but adjust
>>> parameters for higher performance with Xen -- using a larger segment
>>> size (4096 instead of 512) and more coarsely allocating metadata
>>> regions. We are continuing to improve this work and expect qcow
>>> performance to improve a great deal over the newxt few weeks.
>>>
>>> Build and Installation Instructions
>>> ===================================
>>>
>>> You will need libaio >= 0.3.104 on your target system to build the
>>> tools (if you are installing RPMs, this means libaio and
>>> libaio-devel).
>>>
>>> Make to configure the blktap backend driver in your dom0 kernel. It
>>> will cooperate fine with the existing backend driver, so you can
>>> experiment with tap disks without breaking existing VM configs.
>>>
>>> To build the tools separately, "make && make install" in
>>> tools/blktap_user.
>>>
>>>
>>> Using the Tools
>>> ===============
>>>
>>> Prepare the image for booting. For qcow files use the qcow utilities
>>> installed earlier. e.g. qcow-create generates a blank standalone
>>> image
>>> or a file-backed CoW image. img2qcow takes an existing image or
>>> partition and creates a sparse, standalone qcow-based file.
>>>
>>> Start the userspace disk agent either on system boot (e.g. via an
>>> init
>>> script) or manually => 'blktapctrl'
>>>
>>> Customise the VM config file to use the 'tap' handler, followed by
>>> the
>>> driver type. e.g. for a raw image such as a file or partition:
>>>
>>> disk = ['tap:aio:<FILENAME>,sda1,w']
>>>
>>> e.g. for a qcow image:
>>>
>>> disk = ['tap:qcow:<FILENAME>,sda1,w']
>>> ---------------------------------------------------------------------
>>> ---
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>
>
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
2006-06-19 21:42 ` Julian Chesterfield
@ 2006-06-19 21:56 ` Anthony Liguori
0 siblings, 0 replies; 44+ messages in thread
From: Anthony Liguori @ 2006-06-19 21:56 UTC (permalink / raw)
To: Julian Chesterfield; +Cc: xen-devel
Julian Chesterfield wrote:
>>
>> On 19/6/06 8:15 pm, "Anthony Liguori" <aliguori@us.ibm.com> wrote:
>>
>>> Couple general comments on the code:
>>>
>>> Please don't introduce more (ab)uses of /proc. Sure it's just for
>>> debugging but there's no reason to not make that sysfs.
>>>
>>> I'm not an expert here, but the nopage handlers that I've seen return
>>> NOPAGE_SIGBUS instead of manually causing a SIGBUS on current.
>>>
>>> I think it's better to use C99 initialization than GCC:
>>>
>>> owner: ..., => .owner = ...,
>>>
>>> Some of the indenting is a bit off from Linux CodingStyle. Stuff like
>>> if( => if ( and some random spaces after an (.
>>>
>>> There's some code commented out with C++ comments too.
>>>
>>> What's the significance of /**BLKTAP**/ and /**TAPEND**/?
>>>
>>> I'm a little surprised to see these conversion tools too. Wouldn't it
>>> be easier to just add some parameters to qemu-img?
>
> Thanks for the comments anthony. When we initially played with qcow
> images it was easier to knock-up our own frontend to the plugins for
> converting between the different image types and testing features like
> image sparseness. We added an optimisation feature in the xen qcow
> plugin which would allocate full extents for non backing file based
> images as well as the asynchronous callback architecture to enable
> request batching for AIO.
>
> We could certainly adapt qemu-img to use these and other features. Not
> sure what the best approach for keeping the toolsets in synch between
> the 2 projects would be though.
It may be worth just bringing up the changes on qemu-devel. I know why
you'd want to change the cluster size (it's a pain to work with clusters
< block size). I saw another comment about making metadata more
coarse. Can you clarify the reasons for that?
I can't imagine there would be that push back in changing the default
cluster size in qemu-img from 512 to 4096.. Most OS's are going to
write in that granularity anyway I imagine :-)
Regards,
Anthony Liguori
>
> Thanks,
> Julian Chesterfield
^ permalink raw reply [flat|nested] 44+ messages in thread
* RE: [PATCH] Blktap: Userspace file-based image support.(RFC)
@ 2006-06-20 11:07 Ian Pratt
2006-06-20 21:10 ` Dan Smith
0 siblings, 1 reply; 44+ messages in thread
From: Ian Pratt @ 2006-06-20 11:07 UTC (permalink / raw)
To: Dan Smith, Andrew Warfield; +Cc: NAHieu, Xen Developers, Julian Chesterfield
> AW> This should be fixable though. I'm also not sure how carefully
> AW> dm-u watches block completion responses to ensure safety of
> AW> metadata updates relative to data writes. This too should be
> AW> fixable -- i just don't know if the user-level tools can currently
> AW> request completion notifications on requests that they've
> AW> processed.
>
> So, right now, we're a little optimistic about metadata writing. It
> will be relatively easy to hijack the callback routine for the disk
> request (a technique which is heavily used in the rest of the block
> layer) to get a completion trigger. We can then notify userspace for
> the metadata write and then trigger the original callback routine for
> completion.
Yep, dm-userspace is certainly going to need to have a way of
intercepting IO completions and then choosing when it's actually going
to propagate the completion to the backend. That's quite a big change to
the current code (incidentally, the dm-snap code is pretty shocking in
this respect too).
> AW> A benefit to the dm-user patch is that it is more of a linux
> AW> approach than a xen+linux approach. Dm-user will be generally
> AW> useful in the linux tree
>
> Right, this is a huge advantage, I think. Being able to mount images
> as if they were disks will be quite helpful. Another benefit is the
> ability to easily convert between formats. Converting a vmdk to a
> qcow is as easy as mounting both and doing a "cp -R" between them.
I think the blktap code should definitely export a kernel device at the
top so that the same property holds. Should be easy to add.
> AW> which has some bad failure characteristics which can result in
> AW> both data being acknowledged as written even though it hasn't
> AW> been, and the OOM killer going insane. I think some fixes to loop
> AW> probably need to be applied in the near future given how much
> AW> people are generally depending on the code with VMs.
>
> Can you elaborate about what specifically is wrong with the loop
> driver?
It doesn't bypass the buffer cache (so all bets are off for data
integrity) and can end up consuming all of dom0 memory with dirty
buffers -- just create a few loop devices and do a few parallel dd's to
them and watch the oomkiller go on the rampage. It's even worse if the
filesystem the file lives on is slow e.g. NFS.
> AW> Julian and I have talked about extending the tap driver to combine
> AW> it with blkback and allow block address translation without access
> AW> to request contents.
>
> Since the kernel already has a block address translation solution
> (i.e. device-mapper), is there a benefit to adding another
> xen-specific one?
I think blktap and dm-userspace are quite complementary, so I don't see
a problem with having them both in the tree. Right now, blktap looks to
be the more mature solution, but dm-userspace could catch up. Blktap
will obviously still be preferable when its necessary to actually touch
the data.
Ian
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-06-20 11:07 [PATCH] Blktap: Userspace file-based image support.(RFC) Ian Pratt
@ 2006-06-20 21:10 ` Dan Smith
2006-06-21 14:45 ` Anthony Liguori
2006-06-30 13:41 ` Stephen C. Tweedie
0 siblings, 2 replies; 44+ messages in thread
From: Dan Smith @ 2006-06-20 21:10 UTC (permalink / raw)
To: Ian Pratt; +Cc: Andrew Warfield, NAHieu, Xen Developers, Julian Chesterfield
[-- Attachment #1.1: Type: text/plain, Size: 1781 bytes --]
IP> Yep, dm-userspace is certainly going to need to have a way of
IP> intercepting IO completions and then choosing when it's actually
IP> going to propagate the completion to the backend. That's quite a
IP> big change to the current code (incidentally, the dm-snap code is
IP> pretty shocking in this respect too).
I'm not sure if I agree that it will be a big change. It's going to
require keeping track of a few additional states for each remap, as
well as a couple more message types. Hijacking the callback function
of each request is done quite a bit in the rest of the block
subsystem. My testing shows that communication between kernel and
userspace for the additional handshaking will not add significant
additional overhead. Definitely some work, but not a huge change,
IMHO.
IP> It doesn't bypass the buffer cache (so all bets are off for data
IP> integrity) and can end up consuming all of dom0 memory with dirty
IP> buffers -- just create a few loop devices and do a few parallel
IP> dd's to them and watch the oomkiller go on the rampage. It's even
IP> worse if the filesystem the file lives on is slow e.g. NFS.
Ok, it seems like this should be addressed in the upstream loop
driver. I imagine quite a few people are depending on the loop driver
right now, expecting it to maintain data integrity.
Could the loop driver make use of the routines that do direct IO
instead of the normal routines to solve this when it's an issue?
This brings me to another question: Will people really be using
file-based images for their VMs? It seems to me that the performance
of using a block device overshadows the convenience of a file image.
--
Dan Smith
IBM Linux Technology Center
Open Hypervisor Team
email: danms@us.ibm.com
[-- Attachment #1.2: Type: application/pgp-signature, Size: 190 bytes --]
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-06-20 21:10 ` Dan Smith
@ 2006-06-21 14:45 ` Anthony Liguori
2006-06-30 13:41 ` Stephen C. Tweedie
1 sibling, 0 replies; 44+ messages in thread
From: Anthony Liguori @ 2006-06-21 14:45 UTC (permalink / raw)
To: xen-devel
On Tue, 20 Jun 2006 14:10:30 -0700, Dan Smith wrote:
> IP> It doesn't bypass the buffer cache (so all bets are off for data
> IP> integrity) and can end up consuming all of dom0 memory with dirty
> IP> buffers -- just create a few loop devices and do a few parallel
> IP> dd's to them and watch the oomkiller go on the rampage. It's even
> IP> worse if the filesystem the file lives on is slow e.g. NFS.
>
> Ok, it seems like this should be addressed in the upstream loop
> driver. I imagine quite a few people are depending on the loop driver
> right now, expecting it to maintain data integrity.
It's probably worth spending some cycles trying to improve the loop driver
itself.
> Could the loop driver make use of the routines that do direct IO
> instead of the normal routines to solve this when it's an issue?
It appears that the loop driver is split between two threads using a
producer/consumer queue. The main thread gets the bio requests and queues
them for the consumer thread.
The consumer thread can do a number of things depending on properties of
the fd. It may use address ops, use fops->write, or do a transform of the
data. It should be possible to, if the fd is opened with O_DIRECT and
fops has a valid aio_{read,write}, use proper aio calls to queue the
requests. You'll probably have to get clever about how the thread blocks
(has to wake up either on the queue mutex or when an aio request completes).
I suspect that this will have a pretty noticable performance improvement
in the loop driver (especially on SCSI/SATA storage).
The loop driver still has issues though. It cannot grow and it has a
pretty odd hardcoded limit (256 devices) which quickly becomes a
scalability issue.
The former problem could possibly be address by having a parameter for
SET_STATUS that let's you set the size of the device to be greater than
the size of the underlying file. If a bio comes for an offset greater
than the underlying file, it would have to be smart enough to ftruncate
the file. The error handling is a bit tough (you'll have to make sure
that if ftruncate fails, you fail the read/write--extra points if the
failure is temporary such that later on if space is freed up you succeed).
The hardcoded limit is a bit larger of a problem. The driver would likely
need a bit of reworking. Since 256 is the limit based on minor number
allocation, you would have to either get some more device number space for
it or just have the ability to allocate dynamic numbers and rely on
udev/hotplug for folks that want more than 256.
> This brings me to another question: Will people really be using
> file-based images for their VMs? It seems to me that the performance
> of using a block device overshadows the convenience of a file image.
If the performance of the loop driver could be better (and fundamentally,
there's no reason it can't be pretty good), then I see no reason why using
file images wouldn't be the most common approach.
Files are quite a lot easier to manage than partitions. Of course, I see
no reason why someone couldn't write a FUSE front-end to LVM :-)
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-06-20 21:10 ` Dan Smith
2006-06-21 14:45 ` Anthony Liguori
@ 2006-06-30 13:41 ` Stephen C. Tweedie
2006-06-30 14:17 ` Dan Smith
1 sibling, 1 reply; 44+ messages in thread
From: Stephen C. Tweedie @ 2006-06-30 13:41 UTC (permalink / raw)
To: Dan Smith
Cc: Ian Pratt, xen-devel@lists.xensource.com, Julian Chesterfield,
NAHieu, Andrew Warfield
Hi,
On Tue, 2006-06-20 at 14:10 -0700, Dan Smith wrote:
> This brings me to another question: Will people really be using
> file-based images for their VMs? It seems to me that the performance
> of using a block device overshadows the convenience of a file image.
It depends on the environment. To support cold/live migration, having
network-attached storage will be required; and file images on NFS would
be an extremely simple-to-setup way to achieve that.
Personally I use LVM block devices almost exclusively when doing single-
node testing, but NFS files are the easiest way I've got to share those
images.
--Stephen
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-06-30 13:41 ` Stephen C. Tweedie
@ 2006-06-30 14:17 ` Dan Smith
2006-06-30 19:37 ` Stephen C. Tweedie
0 siblings, 1 reply; 44+ messages in thread
From: Dan Smith @ 2006-06-30 14:17 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Julian Chesterfield, Ian Pratt, NAHieu,
xen-devel@lists.xensource.com, Andrew Warfield
[-- Attachment #1.1: Type: text/plain, Size: 785 bytes --]
SCT> It depends on the environment. To support cold/live migration,
SCT> having network-attached storage will be required; and file images
SCT> on NFS would be an extremely simple-to-setup way to achieve that.
Ah, but block devices can play too. With dm-userspace, we could
migrate a domain from one machine to another, faulting the needed
blocks from its block devices on-demand, and copying the rest in the
background. This would give us a peer-to-peer setup where block
devices could slowly move from machine to machine, following its
owner. Once your block was accessed (or copied in the background),
it's local and fast. A peer-to-peer NAS setup.
What do you think?
--
Dan Smith
IBM Linux Technology Center
Open Hypervisor Team
email: danms@us.ibm.com
[-- Attachment #1.2: Type: application/pgp-signature, Size: 190 bytes --]
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-06-30 14:17 ` Dan Smith
@ 2006-06-30 19:37 ` Stephen C. Tweedie
2006-06-30 20:06 ` Dan Smith
2006-07-03 12:02 ` Harry Butterworth
0 siblings, 2 replies; 44+ messages in thread
From: Stephen C. Tweedie @ 2006-06-30 19:37 UTC (permalink / raw)
To: Dan Smith
Cc: Ian Pratt, xen-devel@lists.xensource.com, Julian Chesterfield,
NAHieu, Andrew Warfield
Hi,
On Fri, 2006-06-30 at 07:17 -0700, Dan Smith wrote:
> SCT> It depends on the environment. To support cold/live migration,
> SCT> having network-attached storage will be required; and file images
> SCT> on NFS would be an extremely simple-to-setup way to achieve that.
>
> Ah, but block devices can play too. With dm-userspace, we could
> migrate a domain from one machine to another, faulting the needed
> blocks from its block devices on-demand, and copying the rest in the
> background. This would give us a peer-to-peer setup where block
> devices could slowly move from machine to machine, following its
> owner. Once your block was accessed (or copied in the background),
> it's local and fast. A peer-to-peer NAS setup.
Could be useful in places, but it introduces a number of new
dependencies. The destination host now relies on the source host for
data, so if the source crashes, you crash the destination too; and if
you power-cycle, how do you track where in your cluster the latest copy
of the block device is?
A true NAS solution isolates the Xen hosts from these problems.
--Stephen
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-06-30 19:37 ` Stephen C. Tweedie
@ 2006-06-30 20:06 ` Dan Smith
2006-06-30 22:15 ` Jerone Young
2006-07-03 12:02 ` Harry Butterworth
1 sibling, 1 reply; 44+ messages in thread
From: Dan Smith @ 2006-06-30 20:06 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Julian Chesterfield, Ian Pratt, NAHieu,
xen-devel@lists.xensource.com, Andrew Warfield
[-- Attachment #1.1: Type: text/plain, Size: 1198 bytes --]
ST> Could be useful in places, but it introduces a number of new
ST> dependencies.
I was mostly commenting about making migrating block devices as easy
as (or easier) than file-backed domains, especially from a migration
point of view. Being able to use local LVMs but still migrate easily
without a NAS would be cool, I think, where appropriate.
ST> The destination host now relies on the source host for data, so if
ST> the source crashes, you crash the destination too;
Sure, which a NAS solves, assuming the NAS is stable.
ST> and if you power-cycle, how do you track where in your cluster the
ST> latest copy of the block device is?
I think that keeping metadata on that and invalidating blocks when you
pull them off the source host could be done without too much trouble.
Plus, I'm not talking about multiple-writers, so I think you could
ignore a lot of the normal locking issues.
ST> A true NAS solution isolates the Xen hosts from these problems.
Absolutely. So what's the benefit of having image files on NFS (as
you mentioned) if you can use nbd or iSCSI?
--
Dan Smith
IBM Linux Technology Center
Open Hypervisor Team
email: danms@us.ibm.com
[-- Attachment #1.2: Type: application/pgp-signature, Size: 190 bytes --]
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-06-30 20:06 ` Dan Smith
@ 2006-06-30 22:15 ` Jerone Young
2006-07-01 0:36 ` Mark Williamson
2006-07-03 14:52 ` Stephen C. Tweedie
0 siblings, 2 replies; 44+ messages in thread
From: Jerone Young @ 2006-06-30 22:15 UTC (permalink / raw)
To: Dan Smith
Cc: Ian Pratt, xen-devel@lists.xensource.com, Julian Chesterfield,
NAHieu, Andrew Warfield
On Fri, 2006-06-30 at 13:06 -0700, Dan Smith wrote:
> ST> Could be useful in places, but it introduces a number of new
> ST> dependencies.
>
> I was mostly commenting about making migrating block devices as easy
> as (or easier) than file-backed domains, especially from a migration
> point of view. Being able to use local LVMs but still migrate easily
> without a NAS would be cool, I think, where appropriate.
I would ask how exactly do you propose to do this ? Today at least
file-backed domains seems to be the only real world way of doing
migrations. Migrating block devices seems a little hairy (what if the
other machine is already using sda for example), and may not be all the
practical to do.
>
> ST> The destination host now relies on the source host for data, so if
> ST> the source crashes, you crash the destination too;
>
> Sure, which a NAS solves, assuming the NAS is stable.
>
> ST> and if you power-cycle, how do you track where in your cluster the
> ST> latest copy of the block device is?
>
> I think that keeping metadata on that and invalidating blocks when you
> pull them off the source host could be done without too much trouble.
> Plus, I'm not talking about multiple-writers, so I think you could
> ignore a lot of the normal locking issues.
>
> ST> A true NAS solution isolates the Xen hosts from these problems.
>
> Absolutely. So what's the benefit of having image files on NFS (as
> you mentioned) if you can use nbd or iSCSI?
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-06-30 22:15 ` Jerone Young
@ 2006-07-01 0:36 ` Mark Williamson
2006-07-01 14:22 ` Dan Smith
2006-07-03 14:52 ` Stephen C. Tweedie
1 sibling, 1 reply; 44+ messages in thread
From: Mark Williamson @ 2006-07-01 0:36 UTC (permalink / raw)
To: xen-devel
Cc: Ian Pratt, Jerone Young, Julian Chesterfield, NAHieu,
Andrew Warfield, Dan Smith
> I would ask how exactly do you propose to do this ? Today at least
> file-backed domains seems to be the only real world way of doing
> migrations. Migrating block devices seems a little hairy (what if the
> other machine is already using sda for example), and may not be all the
> practical to do.
Well, it doesn't really matter what the destination dom0 is using as block
devices provided the node name doesn't have to stay the same on the
destination machine - and if you use the hotplug scripts to set up block
devices then it doesn;t.
I think being able to demand-fault virtual disks across would be quite cool
(with the copy eventually completing in the background, eliminating the
origin as a point of failure. For smaller, or more ad-hoc setups this could
be quite useful (especially if you had a daemon trickle updates across the
network continuously at low bandwidth to minimise the diffs during migration)
Cheers,
Mark
>
> > ST> The destination host now relies on the source host for data, so if
> > ST> the source crashes, you crash the destination too;
> >
> > Sure, which a NAS solves, assuming the NAS is stable.
> >
> > ST> and if you power-cycle, how do you track where in your cluster the
> > ST> latest copy of the block device is?
> >
> > I think that keeping metadata on that and invalidating blocks when you
> > pull them off the source host could be done without too much trouble.
> > Plus, I'm not talking about multiple-writers, so I think you could
> > ignore a lot of the normal locking issues.
> >
> > ST> A true NAS solution isolates the Xen hosts from these problems.
> >
> > Absolutely. So what's the benefit of having image files on NFS (as
> > you mentioned) if you can use nbd or iSCSI?
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xensource.com
> > http://lists.xensource.com/xen-devel
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
--
Dave: Just a question. What use is a unicyle with no seat? And no pedals!
Mark: To answer a question with a question: What use is a skateboard?
Dave: Skateboards have wheels.
Mark: My wheel has a wheel!
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-07-01 0:36 ` Mark Williamson
@ 2006-07-01 14:22 ` Dan Smith
2006-07-03 11:00 ` Mark Williamson
0 siblings, 1 reply; 44+ messages in thread
From: Dan Smith @ 2006-07-01 14:22 UTC (permalink / raw)
To: Mark Williamson
Cc: Ian Pratt, Jerone Young, xen-devel, Julian Chesterfield, NAHieu,
Andrew Warfield
[-- Attachment #1.1: Type: text/plain, Size: 1261 bytes --]
MW> Well, it doesn't really matter what the destination dom0 is using
MW> as block devices provided the node name doesn't have to stay the
MW> same on the destination machine - and if you use the hotplug
MW> scripts to set up block devices then it doesn;t.
Right, exactly.
MW> I think being able to demand-fault virtual disks across would be
MW> quite cool (with the copy eventually completing in the background,
MW> eliminating the origin as a point of failure. For smaller, or
MW> more ad-hoc setups this could be quite useful (especially if you
MW> had a daemon trickle updates across the network continuously at
MW> low bandwidth to minimise the diffs during migration)
This is the exact situation I had in mind. I think it would be
extremely cool to have a peer-to-peer block migration mechanism, which
would allow the convenience of files for migration and the speed of
block devices. You could even have a method for migrating block
images between machines, independent of a migration. Imagine
something like:
lvmcp /dev/vols/foo othermachine:/dev/vols
I think that would be neat. It's rather straightforward too, I
think.
--
Dan Smith
IBM Linux Technology Center
Open Hypervisor Team
email: danms@us.ibm.com
[-- Attachment #1.2: Type: application/pgp-signature, Size: 190 bytes --]
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-07-01 14:22 ` Dan Smith
@ 2006-07-03 11:00 ` Mark Williamson
0 siblings, 0 replies; 44+ messages in thread
From: Mark Williamson @ 2006-07-03 11:00 UTC (permalink / raw)
To: Dan Smith
Cc: Ian Pratt, Jerone Young, xen-devel, Julian Chesterfield, NAHieu,
Andrew Warfield
> This is the exact situation I had in mind. I think it would be
> extremely cool to have a peer-to-peer block migration mechanism, which
> would allow the convenience of files for migration and the speed of
> block devices. You could even have a method for migrating block
> images between machines, independent of a migration. Imagine
> something like:
>
> lvmcp /dev/vols/foo othermachine:/dev/vols
Yes, that could even be nice and generic to other use cases, which is always a
good sign and a good way of getting extra developers.
> I think that would be neat. It's rather straightforward too, I
> think.
Another thing I've always fancied is the ability to keep a virtual machine's
memory and disk images on two machines in close-sync by continuously
trickling diffs... This would be used in cases (e.g. desktop migration to a
mobile device, emergency server relocation) where you do have warning that a
migration is required but you want really low latency (e.g. before your UPS
runs out, so you can pick up your laptop and run to a meeting, etc, etc).
Cheers,
Mark
--
Dave: Just a question. What use is a unicyle with no seat? And no pedals!
Mark: To answer a question with a question: What use is a skateboard?
Dave: Skateboards have wheels.
Mark: My wheel has a wheel!
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-06-30 22:15 ` Jerone Young
2006-07-01 0:36 ` Mark Williamson
@ 2006-07-03 14:52 ` Stephen C. Tweedie
1 sibling, 0 replies; 44+ messages in thread
From: Stephen C. Tweedie @ 2006-07-03 14:52 UTC (permalink / raw)
To: Jerone Young
Cc: Ian Pratt, xen-devel@lists.xensource.com, Julian Chesterfield,
NAHieu, Andrew Warfield, Dan Smith
Hi,
On Fri, 2006-06-30 at 17:15 -0500, Jerone Young wrote:
> I would ask how exactly do you propose to do this ? Today at least
> file-backed domains seems to be the only real world way of doing
> migrations. Migrating block devices seems a little hairy (what if the
> other machine is already using sda for example), and may not be all the
> practical to do.
The practicality of it is certainly a concern; but for businesses with
SANs already deployed that's less of an issue. The issue of multiple
users exists for files just as much as for devices, though.
--Stephen
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-06-30 19:37 ` Stephen C. Tweedie
2006-06-30 20:06 ` Dan Smith
@ 2006-07-03 12:02 ` Harry Butterworth
2006-07-03 14:56 ` Stephen C. Tweedie
1 sibling, 1 reply; 44+ messages in thread
From: Harry Butterworth @ 2006-07-03 12:02 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Ian Pratt, xen-devel@lists.xensource.com, Julian Chesterfield,
NAHieu, Andrew Warfield, Dan Smith
On Fri, 2006-06-30 at 20:37 +0100, Stephen C. Tweedie wrote:
> Hi,
>
> On Fri, 2006-06-30 at 07:17 -0700, Dan Smith wrote:
> > SCT> It depends on the environment. To support cold/live migration,
> > SCT> having network-attached storage will be required; and file images
> > SCT> on NFS would be an extremely simple-to-setup way to achieve that.
> >
> > Ah, but block devices can play too. With dm-userspace, we could
> > migrate a domain from one machine to another, faulting the needed
> > blocks from its block devices on-demand, and copying the rest in the
> > background. This would give us a peer-to-peer setup where block
> > devices could slowly move from machine to machine, following its
> > owner. Once your block was accessed (or copied in the background),
> > it's local and fast. A peer-to-peer NAS setup.
>
> Could be useful in places, but it introduces a number of new
> dependencies. The destination host now relies on the source host for
> data, so if the source crashes, you crash the destination too; and if
> you power-cycle, how do you track where in your cluster the latest copy
> of the block device is?
It's easy. You run code to coordinate the mapping inside a
fault-tolerant virtual machine which persists across node failures and
cluster power cycles.
>
> A true NAS solution isolates the Xen hosts from these problems.
>
> --Stephen
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-07-03 12:02 ` Harry Butterworth
@ 2006-07-03 14:56 ` Stephen C. Tweedie
2006-07-03 15:40 ` Harry Butterworth
0 siblings, 1 reply; 44+ messages in thread
From: Stephen C. Tweedie @ 2006-07-03 14:56 UTC (permalink / raw)
To: Harry Butterworth
Cc: Ian Pratt, xen-devel@lists.xensource.com, Julian Chesterfield,
NAHieu, Andrew Warfield, Dan Smith
Hi,
On Mon, 2006-07-03 at 13:02 +0100, Harry Butterworth wrote:
> > Could be useful in places, but it introduces a number of new
> > dependencies. The destination host now relies on the source host for
> > data, so if the source crashes, you crash the destination too; and if
> > you power-cycle, how do you track where in your cluster the latest copy
> > of the block device is?
>
> It's easy. You run code to coordinate the mapping inside a
> fault-tolerant virtual machine which persists across node failures and
> cluster power cycles.
Right, you just made the point I was making --- you've introduced
dependency on a new hypothetical fault-tolerant, cluster-aware device
layer. :-)
In principle, with the right software, and configuring your entire
infrastructure from scratch, this sort of device-based mechanism may
work very well.
But today, with my existing storage already set up, the only way I can
easily add Xen migration capabilities to my network, taking advantage of
the existing storage server I have, is to use NFS from that server. I
just don't have any block-level SAN configured. *That* is why NFS is
important --- not because it's necessarily the better choice, but that
it's one of the configurations we can expect users to have already.
Conversely, for users with SANs already, whether running over iSCSI or
FC or whatever, block-level migration will be needed. It's a matter of
being able to use existing solutions rather than mandating a new storage
configuration.
--Stephen
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-07-03 14:56 ` Stephen C. Tweedie
@ 2006-07-03 15:40 ` Harry Butterworth
2006-07-04 19:39 ` Andrew Warfield
0 siblings, 1 reply; 44+ messages in thread
From: Harry Butterworth @ 2006-07-03 15:40 UTC (permalink / raw)
To: Stephen C. Tweedie
Cc: Ian Pratt, xen-devel@lists.xensource.com, Julian Chesterfield,
NAHieu, Andrew Warfield, Dan Smith
On Mon, 2006-07-03 at 15:56 +0100, Stephen C. Tweedie wrote:
> Hi,
>
> On Mon, 2006-07-03 at 13:02 +0100, Harry Butterworth wrote:
>
> > > Could be useful in places, but it introduces a number of new
> > > dependencies. The destination host now relies on the source host for
> > > data, so if the source crashes, you crash the destination too; and if
> > > you power-cycle, how do you track where in your cluster the latest copy
> > > of the block device is?
> >
> > It's easy. You run code to coordinate the mapping inside a
> > fault-tolerant virtual machine which persists across node failures and
> > cluster power cycles.
>
> Right, you just made the point I was making --- you've introduced
> dependency on a new hypothetical fault-tolerant, cluster-aware device
> layer. :-)
Yes, well I said we were going to need one of these about a year and a
half ago. We should really have had it finished by now ;-P
>
> In principle, with the right software, and configuring your entire
> infrastructure from scratch, this sort of device-based mechanism may
> work very well.
Yes. It does. Here's one we prepared earlier:
http://www-03.ibm.com/press/us/en/pressrelease/19705.wss
> But today, with my existing storage already set up, the only way I can
> easily add Xen migration capabilities to my network, taking advantage of
> the existing storage server I have, is to use NFS from that server. I
> just don't have any block-level SAN configured. *That* is why NFS is
> important --- not because it's necessarily the better choice, but that
> it's one of the configurations we can expect users to have already.
>
> Conversely, for users with SANs already, whether running over iSCSI or
> FC or whatever, block-level migration will be needed. It's a matter of
> being able to use existing solutions rather than mandating a new storage
> configuration.
I agree that it's generally most important to have solutions that work
now. I'm just taking an opportunity to get people thinking about how to
solve the kind of problems exemplified by the block device migration
above; of which there are quite a few other examples in Xen.
>
> --Stephen
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-07-03 15:40 ` Harry Butterworth
@ 2006-07-04 19:39 ` Andrew Warfield
2006-07-05 0:25 ` Dan Smith
2006-07-05 1:40 ` Harry Butterworth
0 siblings, 2 replies; 44+ messages in thread
From: Andrew Warfield @ 2006-07-04 19:39 UTC (permalink / raw)
To: Harry Butterworth
Cc: Ian Pratt, xen-devel@lists.xensource.com, Julian Chesterfield,
NAHieu, Dan Smith
(Reordering quotes from these last two replies:)
>From Stephen:
> > Conversely, for users with SANs already, whether running over iSCSI or
> > FC or whatever, block-level migration will be needed. It's a matter of
> > being able to use existing solutions rather than mandating a new storage
> > configuration.
If you have location transparency between the VM and the storage then
NFS and SANs should both work just wine without block migration.
Aside from some minor reconfig in dom0 as part of the movement, I
don't see why you think it's going to be needed here -- I've done this
with both GNBD and iSCSI just fine. At least insofar as I'm reading
"block-level" migration to mean "copying the blocks over to the new
physical host" -- this is how I took dan to mean this initially.
Now, in situations where the disk is fate-sharing with the CPU that
the VM is running on (e.g. you are using a local disk and want to
migrate VMs to turn the physical machine off for service), then it
seems like some form of block migration is obviously required.
Something along the lines of DRDB would seem to do a good job of
mirroring the disk to a second location in advance of migrating.
I don't think that I see the immediate benefit of the lazy (migrate
and fault blocks across on demand) block migration. It doubles your
exposure to failure (at least) and adds overhead. The only possible
example I can think of is to very temporarily offload a VM that's gone
heavily CPU bound onto an unloaded host. Is there a more obviously
useful situation that I'm missing?
> > In principle, with the right software, and configuring your entire
> > infrastructure from scratch, this sort of device-based mechanism may
> > work very well.
>
> Yes. It does. Here's one we prepared earlier:
> http://www-03.ibm.com/press/us/en/pressrelease/19705.wss
I rather doubt that anyone who happens to have purchased SVC as an
image store is terribly concerned about the ability to lazily copy VM
images from one local disk to another.
a.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-07-04 19:39 ` Andrew Warfield
@ 2006-07-05 0:25 ` Dan Smith
2006-07-05 0:48 ` Andrew Warfield
2006-07-05 1:40 ` Harry Butterworth
1 sibling, 1 reply; 44+ messages in thread
From: Dan Smith @ 2006-07-05 0:25 UTC (permalink / raw)
To: Andrew Warfield
Cc: Ian Pratt, Harry Butterworth, Julian Chesterfield, NAHieu,
xen-devel@lists.xensource.com
[-- Attachment #1.1: Type: text/plain, Size: 1673 bytes --]
AW> I don't think that I see the immediate benefit of the lazy
AW> (migrate and fault blocks across on demand) block migration. It
AW> doubles your exposure to failure (at least) and adds overhead.
AW> The only possible example I can think of is to very temporarily
AW> offload a VM that's gone heavily CPU bound onto an unloaded host.
AW> Is there a more obviously useful situation that I'm missing?
I think the immediate benefit is mostly as a "built-in" feature to
allow migration of VMs easily between machines that do not share
access to a centralized infrastructure. Right now, if you want to do
that, you have to migrate the entire block device or file before you
can start the domain on the other side. The lazy migration allows you
to get the domain started immediately. It's probably not insanely
useful in an enterprise environment, but it would be a nice feature
for Xen to have, and I think it's possible that more enterprise
functionality could arise from developing the foundation. Even if you
had a centralized block server, you could still benefit from the
abilities, by caching blocks locally in a local block device, such as
a hard disk. The same infrastructure that provides the P2P lazy-copy
migration could be used to provide local caching, and probably more
interesting things.
I guess my initial comment was: I would think real enterprise people
would use iSCSI and a real SAN to provide access, instead of files on
NFS. In that case, perhaps we can give more flexibility than the NFS
solution, with better performance.
--
Dan Smith
IBM Linux Technology Center
Open Hypervisor Team
email: danms@us.ibm.com
[-- Attachment #1.2: Type: application/pgp-signature, Size: 190 bytes --]
[-- Attachment #2: Type: text/plain, Size: 138 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-07-05 0:25 ` Dan Smith
@ 2006-07-05 0:48 ` Andrew Warfield
0 siblings, 0 replies; 44+ messages in thread
From: Andrew Warfield @ 2006-07-05 0:48 UTC (permalink / raw)
To: Dan Smith
Cc: Ian Pratt, Harry Butterworth, Julian Chesterfield, NAHieu,
xen-devel@lists.xensource.com
> Even if you
> had a centralized block server, you could still benefit from the
> abilities, by caching blocks locally in a local block device, such as
> a hard disk. The same infrastructure that provides the P2P lazy-copy
> migration could be used to provide local caching, and probably more
> interesting things.
Sure, local caching certainly makes sense here and I think there's
plenty of room to demonstrate benefit with using, but not depending
on, local disk.
> I guess my initial comment was: I would think real enterprise people
> would use iSCSI and a real SAN to provide access, instead of files on
> NFS. In that case, perhaps we can give more flexibility than the NFS
> solution, with better performance.
The concern that I have heard to motivate NFS is that vmware (and to a
lesser degree virtual server) have effectively trained administrators
to expect to manage VMs as image files (with vmdk/vhd). So people
understand how to configure NFS, and they understand how to
backup/snapshot/dup images using unix 'cp'. It's a largely
non-technical concern, and I agree that you could do cunning FS hacks
to achieve the same sort of interface to LUNs or LVM volumes. Still,
a lot of enterprise admins seem to be very attached to NFS, and a
FS-level interface to their images and already have a lot of
home-baked-goods to interact with them that way. To punctuate this
(and somebody please correct me if this is inaccurate...) I think that
VMware have only just started supporting iSCSI in the recent release
of esx/infrastructure -- so across the boards of enterprise installs
this is all reasonably new ground for existing users.
a.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] Blktap: Userspace file-based image support.(RFC)
2006-07-04 19:39 ` Andrew Warfield
2006-07-05 0:25 ` Dan Smith
@ 2006-07-05 1:40 ` Harry Butterworth
1 sibling, 0 replies; 44+ messages in thread
From: Harry Butterworth @ 2006-07-05 1:40 UTC (permalink / raw)
To: Andrew Warfield
Cc: Ian Pratt, xen-devel@lists.xensource.com, Julian Chesterfield,
NAHieu, Dan Smith
On Tue, 2006-07-04 at 12:39 -0700, Andrew Warfield wrote:
> > > In principle, with the right software, and configuring your entire
> > > infrastructure from scratch, this sort of device-based mechanism may
> > > work very well.
> >
> > Yes. It does. Here's one we prepared earlier:
> > http://www-03.ibm.com/press/us/en/pressrelease/19705.wss
>
> I rather doubt that anyone who happens to have purchased SVC as an
> image store is terribly concerned about the ability to lazily copy VM
> images from one local disk to another.
Stephen was talking about a hypothetical cluster aware device
infrastructure and I was pointing out that cluster aware device
infrastructures were already a solved problem and only hypothetical in
the sense that there isn't an open source implementation of one yet.
I was also pointing out that the technique used to create the cluster
aware device infrastructure for SVC which is publicly written up
(amongst other things) here
http://www.research.ibm.com/journal/sj/422/glider.pdf but better
described in purest form here
http://portal.acm.org/citation.cfm?id=279227.279229 can also
conveniently be used to solve almost all the difficult clustering
problems in clustered Xen deployments of which there will be
many--including the problem of making lazy migrations between local
disks on different physical machines sufficiently robust to allow an
enterprise class customer to consider using the feature should we choose
to implement it.
Harry.
^ permalink raw reply [flat|nested] 44+ messages in thread
[parent not found: <C0BD844E.5C4D%julian@xensource.com>]
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
[not found] <C0BD844E.5C4D%julian@xensource.com>
@ 2006-06-20 13:44 ` Julian Chesterfield
0 siblings, 0 replies; 44+ messages in thread
From: Julian Chesterfield @ 2006-06-20 13:44 UTC (permalink / raw)
To: aliguori; +Cc: xen-devel
> On 19/6/06 10:56 pm, "Anthony Liguori" <aliguori@us.ibm.com> wrote:
>
>> Julian Chesterfield wrote:
>>>>
>>>> On 19/6/06 8:15 pm, "Anthony Liguori" <aliguori@us.ibm.com> wrote:
>>>>
>>>>> Couple general comments on the code:
>>>>>
>>>>> Please don't introduce more (ab)uses of /proc. Sure it's just for
>>>>> debugging but there's no reason to not make that sysfs.
>>>>>
>>>>> I'm not an expert here, but the nopage handlers that I've seen
>>>>> return
>>>>> NOPAGE_SIGBUS instead of manually causing a SIGBUS on current.
>>>>>
>>>>> I think it's better to use C99 initialization than GCC:
>>>>>
>>>>> owner: ..., => .owner = ...,
>>>>>
>>>>> Some of the indenting is a bit off from Linux CodingStyle. Stuff
>>>>> like
>>>>> if( => if ( and some random spaces after an (.
>>>>>
>>>>> There's some code commented out with C++ comments too.
>>>>>
>>>>> What's the significance of /**BLKTAP**/ and /**TAPEND**/?
>>>>>
>>>>> I'm a little surprised to see these conversion tools too.
>>>>> Wouldn't it
>>>>> be easier to just add some parameters to qemu-img?
>>>
>>> Thanks for the comments anthony. When we initially played with qcow
>>> images it was easier to knock-up our own frontend to the plugins for
>>> converting between the different image types and testing features
>>> like
>>> image sparseness. We added an optimisation feature in the xen qcow
>>> plugin which would allocate full extents for non backing file based
>>> images as well as the asynchronous callback architecture to enable
>>> request batching for AIO.
>>>
>>> We could certainly adapt qemu-img to use these and other features.
>>> Not
>>> sure what the best approach for keeping the toolsets in synch between
>>> the 2 projects would be though.
>>
>> It may be worth just bringing up the changes on qemu-devel. I know
>> why
>> you'd want to change the cluster size (it's a pain to work with
>> clusters
>> < block size). I saw another comment about making metadata more
>> coarse. Can you clarify the reasons for that?
We've been thinking about an enhancement to the qcow driver to use
smarter readahead on the request ring in order to speculatively limit
the number of metadata writes where request batching is used. This is
an advantage of having access to the full frontend request queue which
enables the userspace agent to make smart decisions regarding caching
and safe but minimal metadata writes.
(Not sure which comment you'd read, but hope this may answer it!)
- Julian
^ permalink raw reply [flat|nested] 44+ messages in thread
[parent not found: <C0BDB8FE.5C5D%julian@xensource.com>]
* Re: [PATCH] Blktap: Userspace file-based image support. (RFC)
[not found] <C0BDB8FE.5C5D%julian@xensource.com>
@ 2006-06-20 13:57 ` Julian Chesterfield
0 siblings, 0 replies; 44+ messages in thread
From: Julian Chesterfield @ 2006-06-20 13:57 UTC (permalink / raw)
To: danms; +Cc: xen-devel
>
> Another question I have is this: doesn't the dependence on libaio
> limit you to certain filesystems? For example, the page for libaio
> doesn't mention reisferfs as supported. Does that mean that SLES
> users won't be able to use ublkback?
Dan, I've just tested blktap with a reiserfs base filesystem. There
were no errors opening files with O_DIRECT, and the performance is
proportionally similar to ext3 (average block reads close to native,
average block writes above 80% under bonnie++). I'll explore further,
however it seems that O_DIRECT is supported.
Thanks,
Julian
>
> Thanks for posting your code Andrew!
>
> --
> Dan Smith
> IBM Linux Technology Center
> Open Hypervisor Team
> email: danms@us.ibm.com
>
> ------ End of Forwarded Message
>
> <Attachment (application_pgp-signature document)>
^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2006-07-05 1:40 UTC | newest]
Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-19 16:19 [PATCH] Blktap: Userspace file-based image support. (RFC) Andrew Warfield
2006-06-19 16:51 ` NAHieu
2006-06-19 17:22 ` Andrew Warfield
2006-06-19 18:41 ` NAHieu
2006-06-19 21:07 ` Andrew Warfield
2006-06-19 21:16 ` Dan Smith
2006-06-19 18:55 ` Anthony Liguori
2006-06-19 19:22 ` Andrew Warfield
2006-06-19 19:26 ` Andrew Warfield
2006-06-19 19:51 ` Anthony Liguori
2006-06-19 19:15 ` Anthony Liguori
2006-06-19 19:31 ` Andrew Warfield
2006-06-29 3:35 ` Rusty Russell
2006-06-29 5:24 ` Andrew Warfield
2006-06-29 6:31 ` Rusty Russell
2006-06-29 14:34 ` Andrew Warfield
2006-06-30 13:35 ` Stephen C. Tweedie
2006-06-30 14:17 ` Julian Chesterfield
2006-06-30 18:41 ` Jeff Moyer
2006-06-29 11:49 ` Anthony Liguori
2006-06-29 12:26 ` Laurent Vivier
[not found] <C0BCD26E.5C31%julian@xensource.com>
2006-06-19 21:42 ` Julian Chesterfield
2006-06-19 21:56 ` Anthony Liguori
-- strict thread matches above, loose matches on Subject: below --
2006-06-20 11:07 [PATCH] Blktap: Userspace file-based image support.(RFC) Ian Pratt
2006-06-20 21:10 ` Dan Smith
2006-06-21 14:45 ` Anthony Liguori
2006-06-30 13:41 ` Stephen C. Tweedie
2006-06-30 14:17 ` Dan Smith
2006-06-30 19:37 ` Stephen C. Tweedie
2006-06-30 20:06 ` Dan Smith
2006-06-30 22:15 ` Jerone Young
2006-07-01 0:36 ` Mark Williamson
2006-07-01 14:22 ` Dan Smith
2006-07-03 11:00 ` Mark Williamson
2006-07-03 14:52 ` Stephen C. Tweedie
2006-07-03 12:02 ` Harry Butterworth
2006-07-03 14:56 ` Stephen C. Tweedie
2006-07-03 15:40 ` Harry Butterworth
2006-07-04 19:39 ` Andrew Warfield
2006-07-05 0:25 ` Dan Smith
2006-07-05 0:48 ` Andrew Warfield
2006-07-05 1:40 ` Harry Butterworth
[not found] <C0BD844E.5C4D%julian@xensource.com>
2006-06-20 13:44 ` [PATCH] Blktap: Userspace file-based image support. (RFC) Julian Chesterfield
[not found] <C0BDB8FE.5C5D%julian@xensource.com>
2006-06-20 13:57 ` Julian Chesterfield
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.