qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Anthony Liguori <aliguori@us.ibm.com>
To: qemu-devel <qemu-devel@nongnu.org>
Cc: Kevin Wolf <kwolf@redhat.com>, Chunqiang Tang <ctang@us.ibm.com>,
	Eric Van Hensbergen <ericvh@gmail.com>,
	Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Subject: [Qemu-devel] Moving beyond image files
Date: Mon, 21 Mar 2011 10:05:20 -0500	[thread overview]
Message-ID: <4D876930.6000106@us.ibm.com> (raw)

We've been evaluating block migration in a real environment to try to 
understand what the overhead of it is compared to normal migration.  The 
results so far are pretty disappointing.  The speed of local disks ends 
up becoming a big bottleneck even before the network does.

This has got me thinking about what we could do to avoid local I/O via 
deduplication and other techniques.  This has led me to wonder if its 
time to move beyond simple image files into something a bit more 
sophisticated.

Ideally, I'd want a full Content Addressable Storage database like Venti 
but there are lots of performance concerns with something like that.

I've been thinking about a middle ground and am looking for some 
feedback.   Here's my current thinking:

1) All block I/O goes through a daemon.  There may be more than one 
daemon to support multi-tenancy.

2) The daemon maintains metadata for each image that includes an extent 
mapping and then a clustered allocated bitmap within each extent 
(similar to FVD).

At this point, it's basically sparse raw but through a single daemon.

3) All writes result in a sha1 being calculated before the write is 
completed.  The daemon maintains a mapping of sha1's -> clusters.  A 
single sha1 may map to many clusters.  The sha1 mapping can be made 
eventually consistent using a journal or even dirty bitmap.  It can be 
partially rebuilt easily.

I think this is where v1 stops.  With just this level of functionality, 
I think we have some very interesting properties:

a) Performance should be pretty close to raw

b) Without doing any (significant) disk I/O, we know exactly what data 
an image is composed of.  This means we can do an rsync style image 
streaming that uses potentially much less network I/O and potentially 
much less disk I/O.

In a v2, I think you can add some interesting features that take 
advantage of the hashing.  For instance:

4) If you run out of disk space, you can looking at a hash with a 
refcount > 1, and split off a reference making it copy-on-write.  Then 
you can treat the remaining references as free list entries.

5) Copy-on-write references potentially become very interesting for 
image streaming because you can avoid any I/O for blocks that are 
already stored locally.

This is not fully baked yet but I thought I'd at least throw it out 
there as a topic for discussion.  I think we've focused almost entirely 
on single images so I think it's worth thinking a little about different 
storage models.

Regards,

Anthony Liguori

             reply	other threads:[~2011-03-21 15:06 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-03-21 15:05 Anthony Liguori [this message]
2011-03-21 15:16 ` [Qemu-devel] Moving beyond image files Alexander Graf
2011-03-21 16:04   ` Anthony Liguori
2011-03-21 21:35 ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D876930.6000106@us.ibm.com \
    --to=aliguori@us.ibm.com \
    --cc=ctang@us.ibm.com \
    --cc=ericvh@gmail.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).