All of lore.kernel.org
 help / color / mirror / Atom feed
[parent not found: <C0BDB8FE.5C5D%julian@xensource.com>]
[parent not found: <C0BD844E.5C4D%julian@xensource.com>]
* RE: [PATCH] Blktap: Userspace file-based image support.(RFC)
@ 2006-06-20 11:07 Ian Pratt
  2006-06-20 21:10 ` Dan Smith
  0 siblings, 1 reply; 44+ messages in thread
From: Ian Pratt @ 2006-06-20 11:07 UTC (permalink / raw)
  To: Dan Smith, Andrew Warfield; +Cc: NAHieu, Xen Developers, Julian Chesterfield

> AW> This should be fixable though.  I'm also not sure how carefully
> AW> dm-u watches block completion responses to ensure safety of
> AW> metadata updates relative to data writes.  This too should be
> AW> fixable -- i just don't know if the user-level tools can currently
> AW> request completion notifications on requests that they've
> AW> processed.
> 
> So, right now, we're a little optimistic about metadata writing.  It
> will be relatively easy to hijack the callback routine for the disk
> request (a technique which is heavily used in the rest of the block
> layer) to get a completion trigger.  We can then notify userspace for
> the metadata write and then trigger the original callback routine for
> completion.

Yep, dm-userspace is certainly going to need to have a way of
intercepting IO completions and then choosing when it's actually going
to propagate the completion to the backend. That's quite a big change to
the current code (incidentally, the dm-snap code is pretty shocking in
this respect too).

> AW> A benefit to the dm-user patch is that it is more of a linux
> AW> approach than a xen+linux approach.  Dm-user will be generally
> AW> useful in the linux tree
> 
> Right, this is a huge advantage, I think.  Being able to mount images
> as if they were disks will be quite helpful.  Another benefit is the
> ability to easily convert between formats.  Converting a vmdk to a
> qcow is as easy as mounting both and doing a "cp -R" between them.

I think the blktap code should definitely export a kernel device at the
top so that the same property holds. Should be easy to add.

> AW> which has some bad failure characteristics which can result in
> AW> both data being acknowledged as written even though it hasn't
> AW> been, and the OOM killer going insane.  I think some fixes to loop
> AW> probably need to be applied in the near future given how much
> AW> people are generally depending on the code with VMs.
> 
> Can you elaborate about what specifically is wrong with the loop
> driver?

It doesn't bypass the buffer cache (so all bets are off for data
integrity) and can end up consuming all of dom0 memory with dirty
buffers -- just create a few loop devices and do a few parallel dd's to
them and watch the oomkiller go on the rampage. It's even worse if the
filesystem the file lives on is slow e.g. NFS.

> AW> Julian and I have talked about extending the tap driver to combine
> AW> it with blkback and allow block address translation without access
> AW> to request contents.
> 
> Since the kernel already has a block address translation solution
> (i.e. device-mapper), is there a benefit to adding another
> xen-specific one?

I think blktap and dm-userspace are quite complementary, so I don't see
a problem with having them both in the tree. Right now, blktap looks to
be the more mature solution, but dm-userspace could catch up. Blktap
will obviously still be preferable when its necessary to actually touch
the data.

Ian

^ permalink raw reply	[flat|nested] 44+ messages in thread
* [PATCH] Blktap: Userspace file-based image support. (RFC)
@ 2006-06-19 16:19 Andrew Warfield
  2006-06-19 16:51 ` NAHieu
                   ` (3 more replies)
  0 siblings, 4 replies; 44+ messages in thread
From: Andrew Warfield @ 2006-06-19 16:19 UTC (permalink / raw)
  To: Xen Developers; +Cc: Julian Chesterfield

[-- Attachment #1: Type: text/plain, Size: 7186 bytes --]

Attached to this email is a patch containing the (new and improved)
blktap Linux driver and associated userspace tools for Xen.  In
addition to being more flavourful, containing half the fat, and
removing stains twice as well as the old driver, this stuff adds a
userspace block backend and let you use raw (without loopback), qcow,
and vmdk-based image files for your domUs.  There's also a fun little
driver that provides a shared-memory block device which, in
combination with OCFS2, represents a cheap-and-cheerful fast shared
filesystem between multiple domUs.

This code has been (somewhat lackadaisically) developed over the past
few years at Cambridge and has recently enjoyed massive improvements
thanks to the considerable efforts of Julian Chesterfield.

The code "works for us" and has been tested on a grand total of about
three machines.  We would love to have feedback from a broader
audience, in terms of both trying out the tools and inspecting the code.
We'll plan to release new patches at about 1-week intervals based on
comments.

Performance is quite good, and we intend to focus on this a bit more
over the next few weeks, releasing updated patches as they are
available.  Bonnie results this morning are as follows (64-bit results
compare against linux blkback+loopback file, Julian can follow up with
loopback results for 32-bit later if anyone's interested):

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine    MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
64-bit:
xen0     4096 40115 93.4 41067 12.7 22757  1.2 32532 56.7 53724  0.4 121.4  0.0
img-sp   4096 20291 86.0 38091 18.1 19939  8.2 30854 69.0 47779  4.2  95.3  0.4
loop-sp  4096 33421 77.6 33663 13.1 18546  5.1 28606 59.2 46659  6.0  85.2  0.1

32-Bit:
xen0     1024 33857 94.0 45804 9.0  23269  0.0 25825 52.0 55628  0   185.0  0.0
img-sp   1448 32743 92.0 40703 8.0  23281  0.0 31139 75.0 56585  0   208.1  0.0

The patch is against cset 0426:840f33e54054 -- but is unlikely to
conflict with anything recent.  You'll need libaio and libaio-devel on
your build machine for the tools to compile.


Blktap readme follows.)

Thanks!
a.

---


Blktap Userspace Tools + Library
================================

Andrew Warfield and Julian Chesterfield
16th June 2006

{firstname.lastname}@cl.cam.ac.uk

The blktap userspace toolkit provides a user-level disk I/O
interface. The blktap mechanism involves a kernel driver that acts
similarly to the existing Xen/Linux blkback driver, and a set of
associated user-level libraries.  Using these tools, blktap allows
virtual block devices presented to VMs to be implemented in userspace
and to be backed by raw partitions, files, network, etc.

The key benefit of blktap is that it makes it easy and fast to write
arbitrary block backends, and that these user-level backends actually
perform very well.  Specifically:

- Metadata disk formats such as Copy-on-Write, encrypted disks, sparse
  formats and other compression features can be easily implemented.
  O_DIRECT and libaio allow high-performance implementation of even
  sparse image formats such as QCoW, while still preserving the safe
  ordering of metadata and data writes to ensure data integrity.
  (As opposed to, for instance, both the loopback driver and LVM snaps
  which both have very dangerous failure cases.)

- Accessing file-based images from userspace avoids problems related
  to flushing dirty pages which are present in the Linux loopback
  driver.  (Specifically, doing a large number of writes to an
  NFS-backed image don't result in the OOM killer going berserk.)

- Per-disk handler processes enable easier userspace policing of block
  resources, and process-granularity QoS techniques (disk scheduling
  and related tools) may be trivially applied to block devices.

- It's very easy to take advantage of userspace facilities such as
  networking libraries, compression utilities, peer-to-peer
  file-sharing systems and so on to build more complex block backends.

- Crashes are contained -- incremental development/debugging is very
  fast.

- All block data is forwarded in a zero-copy fashion, allowing for
  low-overhead userspace implementations.

How it works (in one paragraph):

Working in conjunction with the kernel blktap driver, all disk I/O
requests from VMs are passed to the userspace deamon (using a shared
memory interface) through a character device. Each active disk is
mappd to an individual device node, allowing per-disk processes to
implement individual block devices where desired.  The userspace
drivers are implemented using asynchronous (Linux libaio),
O_DIRECT-based calls to preserve the unbuffered, batched and
asynchronous request dispatch achieved with the existing blockback
code.  We provide a simple, asynchronous virtual disk interface that
makes it quite easy to add new disk implementations.


As of June 2006 the current supported disk formats are:

 - Raw Images (both on partitions and in image files)
 - File-backed Qcow disks (sparse qcow overlay on a raw image/patrition).
 - Standalone sparse Qcow disks (sparse disks, not backed by a parent image).
 - Fast shareable RAM disk between VMs (requires some form of cluster-based
   filesystem support e.g. OCFS2 in the guest kernel)
 - Some VMDK images - your mileage may vary

Raw and QCow images have asynchronous backends and so should perform
fairly well.  VMDK is based directly on the qemu vmdk driver, which is
synchronous (a.k.a. slow).

The qcow backends support existing qcow disks.  There are also a set
of tools to generate and convert qcow images.  With these tools (and
driver support), we maintain the qcow file format but adjust
parameters for higher performance with Xen -- using a larger segment
size (4096 instead of 512) and more coarsely allocating metadata
regions.  We are continuing to improve this work and expect qcow
performance to improve a great deal over the newxt few weeks.

Build and Installation Instructions
===================================

You will need libaio >= 0.3.104 on your target system to build the
tools (if you are installing RPMs, this means libaio and
libaio-devel).

Make to configure the blktap backend driver in your dom0 kernel.  It
will cooperate fine with the existing backend driver, so you can
experiment with tap disks without breaking existing VM configs.

To build the tools separately, "make && make install" in
tools/blktap_user.


Using the Tools
===============

Prepare the image for booting. For qcow files use the qcow utilities
installed earlier. e.g. qcow-create generates a blank standalone image
or a file-backed CoW image. img2qcow takes an existing image or
partition and creates a sparse, standalone qcow-based file.

Start the userspace disk agent either on system boot (e.g. via an init
script) or manually => 'blktapctrl'

Customise the VM config file to use the 'tap' handler, followed by the
driver type. e.g. for a raw image such as a file or partition:

disk = ['tap:aio:<FILENAME>,sda1,w']

e.g. for a qcow image:

disk = ['tap:qcow:<FILENAME>,sda1,w']

[-- Attachment #2: blktap.patch.gz --]
[-- Type: application/x-gzip, Size: 82205 bytes --]

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2006-07-05  1:40 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <C0BCD26E.5C31%julian@xensource.com>
2006-06-19 21:42 ` [PATCH] Blktap: Userspace file-based image support. (RFC) Julian Chesterfield
2006-06-19 21:56   ` Anthony Liguori
     [not found] <C0BDB8FE.5C5D%julian@xensource.com>
2006-06-20 13:57 ` Julian Chesterfield
     [not found] <C0BD844E.5C4D%julian@xensource.com>
2006-06-20 13:44 ` Julian Chesterfield
2006-06-20 11:07 [PATCH] Blktap: Userspace file-based image support.(RFC) Ian Pratt
2006-06-20 21:10 ` Dan Smith
2006-06-21 14:45   ` Anthony Liguori
2006-06-30 13:41   ` Stephen C. Tweedie
2006-06-30 14:17     ` Dan Smith
2006-06-30 19:37       ` Stephen C. Tweedie
2006-06-30 20:06         ` Dan Smith
2006-06-30 22:15           ` Jerone Young
2006-07-01  0:36             ` Mark Williamson
2006-07-01 14:22               ` Dan Smith
2006-07-03 11:00                 ` Mark Williamson
2006-07-03 14:52             ` Stephen C. Tweedie
2006-07-03 12:02         ` Harry Butterworth
2006-07-03 14:56           ` Stephen C. Tweedie
2006-07-03 15:40             ` Harry Butterworth
2006-07-04 19:39               ` Andrew Warfield
2006-07-05  0:25                 ` Dan Smith
2006-07-05  0:48                   ` Andrew Warfield
2006-07-05  1:40                 ` Harry Butterworth
  -- strict thread matches above, loose matches on Subject: below --
2006-06-19 16:19 [PATCH] Blktap: Userspace file-based image support. (RFC) Andrew Warfield
2006-06-19 16:51 ` NAHieu
2006-06-19 17:22   ` Andrew Warfield
2006-06-19 18:41     ` NAHieu
2006-06-19 21:07       ` Andrew Warfield
2006-06-19 21:16     ` Dan Smith
2006-06-19 18:55 ` Anthony Liguori
2006-06-19 19:22   ` Andrew Warfield
2006-06-19 19:26   ` Andrew Warfield
2006-06-19 19:51     ` Anthony Liguori
2006-06-19 19:15 ` Anthony Liguori
2006-06-19 19:31   ` Andrew Warfield
2006-06-29  3:35 ` Rusty Russell
2006-06-29  5:24   ` Andrew Warfield
2006-06-29  6:31     ` Rusty Russell
2006-06-29 14:34       ` Andrew Warfield
2006-06-30 13:35         ` Stephen C. Tweedie
2006-06-30 14:17           ` Julian Chesterfield
2006-06-30 18:41             ` Jeff Moyer
2006-06-29 11:49   ` Anthony Liguori
2006-06-29 12:26     ` Laurent Vivier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.