From mboxrd@z Thu Jan 1 00:00:00 1970 From: Anthony Liguori Subject: Re: [PATCH] Blktap: Userspace file-based image support. (RFC) Date: Mon, 19 Jun 2006 14:15:10 -0500 Message-ID: <4496F7BE.2020108@us.ibm.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Andrew Warfield Cc: Xen Developers , Julian Chesterfield List-Id: xen-devel@lists.xenproject.org Couple general comments on the code: Please don't introduce more (ab)uses of /proc. Sure it's just for debugging but there's no reason to not make that sysfs. I'm not an expert here, but the nopage handlers that I've seen return NOPAGE_SIGBUS instead of manually causing a SIGBUS on current. I think it's better to use C99 initialization than GCC: owner: ..., => .owner = ..., Some of the indenting is a bit off from Linux CodingStyle. Stuff like if( => if ( and some random spaces after an (. There's some code commented out with C++ comments too. What's the significance of /**BLKTAP**/ and /**TAPEND**/? I'm a little surprised to see these conversion tools too. Wouldn't it be easier to just add some parameters to qemu-img? Pretty interesting stuff, thanks for posting. Regards, Anthony Liguori Andrew Warfield wrote: > Attached to this email is a patch containing the (new and improved) > blktap Linux driver and associated userspace tools for Xen. In > addition to being more flavourful, containing half the fat, and > removing stains twice as well as the old driver, this stuff adds a > userspace block backend and let you use raw (without loopback), qcow, > and vmdk-based image files for your domUs. There's also a fun little > driver that provides a shared-memory block device which, in > combination with OCFS2, represents a cheap-and-cheerful fast shared > filesystem between multiple domUs. > > This code has been (somewhat lackadaisically) developed over the past > few years at Cambridge and has recently enjoyed massive improvements > thanks to the considerable efforts of Julian Chesterfield. > > The code "works for us" and has been tested on a grand total of about > three machines. We would love to have feedback from a broader > audience, in terms of both trying out the tools and inspecting the code. > We'll plan to release new patches at about 1-week intervals based on > comments. > > Performance is quite good, and we intend to focus on this a bit more > over the next few weeks, releasing updated patches as they are > available. Bonnie results this morning are as follows (64-bit results > compare against linux blkback+loopback file, Julian can follow up with > loopback results for 32-bit later if anyone's interested): > > -------Sequential Output-------- ---Sequential Input-- > --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- > --Seeks--- > Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU > /sec %CPU > 64-bit: > xen0 4096 40115 93.4 41067 12.7 22757 1.2 32532 56.7 53724 0.4 > 121.4 0.0 > img-sp 4096 20291 86.0 38091 18.1 19939 8.2 30854 69.0 47779 4.2 > 95.3 0.4 > loop-sp 4096 33421 77.6 33663 13.1 18546 5.1 28606 59.2 46659 6.0 > 85.2 0.1 > > 32-Bit: > xen0 1024 33857 94.0 45804 9.0 23269 0.0 25825 52.0 55628 0 > 185.0 0.0 > img-sp 1448 32743 92.0 40703 8.0 23281 0.0 31139 75.0 56585 0 > 208.1 0.0 > > The patch is against cset 0426:840f33e54054 -- but is unlikely to > conflict with anything recent. You'll need libaio and libaio-devel on > your build machine for the tools to compile. > > > Blktap readme follows.) > > Thanks! > a. > > --- > > > Blktap Userspace Tools + Library > ================================ > > Andrew Warfield and Julian Chesterfield > 16th June 2006 > > {firstname.lastname}@cl.cam.ac.uk > > The blktap userspace toolkit provides a user-level disk I/O > interface. The blktap mechanism involves a kernel driver that acts > similarly to the existing Xen/Linux blkback driver, and a set of > associated user-level libraries. Using these tools, blktap allows > virtual block devices presented to VMs to be implemented in userspace > and to be backed by raw partitions, files, network, etc. > > The key benefit of blktap is that it makes it easy and fast to write > arbitrary block backends, and that these user-level backends actually > perform very well. Specifically: > > - Metadata disk formats such as Copy-on-Write, encrypted disks, sparse > formats and other compression features can be easily implemented. > O_DIRECT and libaio allow high-performance implementation of even > sparse image formats such as QCoW, while still preserving the safe > ordering of metadata and data writes to ensure data integrity. > (As opposed to, for instance, both the loopback driver and LVM snaps > which both have very dangerous failure cases.) > > - Accessing file-based images from userspace avoids problems related > to flushing dirty pages which are present in the Linux loopback > driver. (Specifically, doing a large number of writes to an > NFS-backed image don't result in the OOM killer going berserk.) > > - Per-disk handler processes enable easier userspace policing of block > resources, and process-granularity QoS techniques (disk scheduling > and related tools) may be trivially applied to block devices. > > - It's very easy to take advantage of userspace facilities such as > networking libraries, compression utilities, peer-to-peer > file-sharing systems and so on to build more complex block backends. > > - Crashes are contained -- incremental development/debugging is very > fast. > > - All block data is forwarded in a zero-copy fashion, allowing for > low-overhead userspace implementations. > > How it works (in one paragraph): > > Working in conjunction with the kernel blktap driver, all disk I/O > requests from VMs are passed to the userspace deamon (using a shared > memory interface) through a character device. Each active disk is > mappd to an individual device node, allowing per-disk processes to > implement individual block devices where desired. The userspace > drivers are implemented using asynchronous (Linux libaio), > O_DIRECT-based calls to preserve the unbuffered, batched and > asynchronous request dispatch achieved with the existing blockback > code. We provide a simple, asynchronous virtual disk interface that > makes it quite easy to add new disk implementations. > > > As of June 2006 the current supported disk formats are: > > - Raw Images (both on partitions and in image files) > - File-backed Qcow disks (sparse qcow overlay on a raw image/patrition). > - Standalone sparse Qcow disks (sparse disks, not backed by a parent > image). > - Fast shareable RAM disk between VMs (requires some form of > cluster-based > filesystem support e.g. OCFS2 in the guest kernel) > - Some VMDK images - your mileage may vary > > Raw and QCow images have asynchronous backends and so should perform > fairly well. VMDK is based directly on the qemu vmdk driver, which is > synchronous (a.k.a. slow). > > The qcow backends support existing qcow disks. There are also a set > of tools to generate and convert qcow images. With these tools (and > driver support), we maintain the qcow file format but adjust > parameters for higher performance with Xen -- using a larger segment > size (4096 instead of 512) and more coarsely allocating metadata > regions. We are continuing to improve this work and expect qcow > performance to improve a great deal over the newxt few weeks. > > Build and Installation Instructions > =================================== > > You will need libaio >= 0.3.104 on your target system to build the > tools (if you are installing RPMs, this means libaio and > libaio-devel). > > Make to configure the blktap backend driver in your dom0 kernel. It > will cooperate fine with the existing backend driver, so you can > experiment with tap disks without breaking existing VM configs. > > To build the tools separately, "make && make install" in > tools/blktap_user. > > > Using the Tools > =============== > > Prepare the image for booting. For qcow files use the qcow utilities > installed earlier. e.g. qcow-create generates a blank standalone image > or a file-backed CoW image. img2qcow takes an existing image or > partition and creates a sparse, standalone qcow-based file. > > Start the userspace disk agent either on system boot (e.g. via an init > script) or manually => 'blktapctrl' > > Customise the VM config file to use the 'tap' handler, followed by the > driver type. e.g. for a raw image such as a file or partition: > > disk = ['tap:aio:,sda1,w'] > > e.g. for a qcow image: > > disk = ['tap:qcow:,sda1,w'] > ------------------------------------------------------------------------ > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel