* [Qemu-devel] Moving beyond image files
@ 2011-03-21 15:05 Anthony Liguori
2011-03-21 15:16 ` Alexander Graf
2011-03-21 21:35 ` Stefan Hajnoczi
0 siblings, 2 replies; 4+ messages in thread
From: Anthony Liguori @ 2011-03-21 15:05 UTC (permalink / raw)
To: qemu-devel
Cc: Kevin Wolf, Chunqiang Tang, Eric Van Hensbergen, Stefan Hajnoczi
We've been evaluating block migration in a real environment to try to
understand what the overhead of it is compared to normal migration. The
results so far are pretty disappointing. The speed of local disks ends
up becoming a big bottleneck even before the network does.
This has got me thinking about what we could do to avoid local I/O via
deduplication and other techniques. This has led me to wonder if its
time to move beyond simple image files into something a bit more
sophisticated.
Ideally, I'd want a full Content Addressable Storage database like Venti
but there are lots of performance concerns with something like that.
I've been thinking about a middle ground and am looking for some
feedback. Here's my current thinking:
1) All block I/O goes through a daemon. There may be more than one
daemon to support multi-tenancy.
2) The daemon maintains metadata for each image that includes an extent
mapping and then a clustered allocated bitmap within each extent
(similar to FVD).
At this point, it's basically sparse raw but through a single daemon.
3) All writes result in a sha1 being calculated before the write is
completed. The daemon maintains a mapping of sha1's -> clusters. A
single sha1 may map to many clusters. The sha1 mapping can be made
eventually consistent using a journal or even dirty bitmap. It can be
partially rebuilt easily.
I think this is where v1 stops. With just this level of functionality,
I think we have some very interesting properties:
a) Performance should be pretty close to raw
b) Without doing any (significant) disk I/O, we know exactly what data
an image is composed of. This means we can do an rsync style image
streaming that uses potentially much less network I/O and potentially
much less disk I/O.
In a v2, I think you can add some interesting features that take
advantage of the hashing. For instance:
4) If you run out of disk space, you can looking at a hash with a
refcount > 1, and split off a reference making it copy-on-write. Then
you can treat the remaining references as free list entries.
5) Copy-on-write references potentially become very interesting for
image streaming because you can avoid any I/O for blocks that are
already stored locally.
This is not fully baked yet but I thought I'd at least throw it out
there as a topic for discussion. I think we've focused almost entirely
on single images so I think it's worth thinking a little about different
storage models.
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] Moving beyond image files
2011-03-21 15:05 [Qemu-devel] Moving beyond image files Anthony Liguori
@ 2011-03-21 15:16 ` Alexander Graf
2011-03-21 16:04 ` Anthony Liguori
2011-03-21 21:35 ` Stefan Hajnoczi
1 sibling, 1 reply; 4+ messages in thread
From: Alexander Graf @ 2011-03-21 15:16 UTC (permalink / raw)
To: Anthony Liguori
Cc: Kevin Wolf, Chunqiang Tang, Eric Van Hensbergen, qemu-devel,
Stefan Hajnoczi
On 21.03.2011, at 16:05, Anthony Liguori wrote:
> We've been evaluating block migration in a real environment to try to understand what the overhead of it is compared to normal migration. The results so far are pretty disappointing. The speed of local disks ends up becoming a big bottleneck even before the network does.
>
> This has got me thinking about what we could do to avoid local I/O via deduplication and other techniques. This has led me to wonder if its time to move beyond simple image files into something a bit more sophisticated.
>
> Ideally, I'd want a full Content Addressable Storage database like Venti but there are lots of performance concerns with something like that.
>
> I've been thinking about a middle ground and am looking for some feedback. Here's my current thinking:
>
> 1) All block I/O goes through a daemon. There may be more than one daemon to support multi-tenancy.
>
> 2) The daemon maintains metadata for each image that includes an extent mapping and then a clustered allocated bitmap within each extent (similar to FVD).
>
> At this point, it's basically sparse raw but through a single daemon.
>
> 3) All writes result in a sha1 being calculated before the write is completed. The daemon maintains a mapping of sha1's -> clusters. A single sha1 may map to many clusters. The sha1 mapping can be made eventually consistent using a journal or even dirty bitmap. It can be partially rebuilt easily.
>
> I think this is where v1 stops. With just this level of functionality, I think we have some very interesting properties:
>
> a) Performance should be pretty close to raw
>
> b) Without doing any (significant) disk I/O, we know exactly what data an image is composed of. This means we can do an rsync style image streaming that uses potentially much less network I/O and potentially much less disk I/O.
>
> In a v2, I think you can add some interesting features that take advantage of the hashing. For instance:
>
> 4) If you run out of disk space, you can looking at a hash with a refcount > 1, and split off a reference making it copy-on-write. Then you can treat the remaining references as free list entries.
>
> 5) Copy-on-write references potentially become very interesting for image streaming because you can avoid any I/O for blocks that are already stored locally.
>
> This is not fully baked yet but I thought I'd at least throw it out there as a topic for discussion. I think we've focused almost entirely on single images so I think it's worth thinking a little about different storage models.
Wouldn't it make sense to have your file system be that daemon and add an interface to it so you can receive the sha1 sums (that you need for dedup anyways) to calculate rsync style diffs?
That way you'd also speed up 2 other use cases:
a) normal raw storage - no need to implement new protocols, file formats, etc
b) real rsync on real data that is not vm images
Alex
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] Moving beyond image files
2011-03-21 15:16 ` Alexander Graf
@ 2011-03-21 16:04 ` Anthony Liguori
0 siblings, 0 replies; 4+ messages in thread
From: Anthony Liguori @ 2011-03-21 16:04 UTC (permalink / raw)
To: Alexander Graf
Cc: Kevin Wolf, Chunqiang Tang, Eric Van Hensbergen, qemu-devel,
Stefan Hajnoczi
On 03/21/2011 10:16 AM, Alexander Graf wrote:
> On 21.03.2011, at 16:05, Anthony Liguori wrote:
>
>>
>> 5) Copy-on-write references potentially become very interesting for image streaming because you can avoid any I/O for blocks that are already stored locally.
>>
>> This is not fully baked yet but I thought I'd at least throw it out there as a topic for discussion. I think we've focused almost entirely on single images so I think it's worth thinking a little about different storage models.
> Wouldn't it make sense to have your file system be that daemon
I see that as purely an implementation detail.
Regards,
Anthony Liguori
> and add an interface to it so you can receive the sha1 sums (that you need for dedup anyways) to calculate rsync style diffs?
>
> That way you'd also speed up 2 other use cases:
>
> a) normal raw storage - no need to implement new protocols, file formats, etc
> b) real rsync on real data that is not vm images
>
>
> Alex
>
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] Moving beyond image files
2011-03-21 15:05 [Qemu-devel] Moving beyond image files Anthony Liguori
2011-03-21 15:16 ` Alexander Graf
@ 2011-03-21 21:35 ` Stefan Hajnoczi
1 sibling, 0 replies; 4+ messages in thread
From: Stefan Hajnoczi @ 2011-03-21 21:35 UTC (permalink / raw)
To: Anthony Liguori
Cc: Kevin Wolf, Chunqiang Tang, Eric Van Hensbergen, qemu-devel,
Stefan Hajnoczi
On Mon, Mar 21, 2011 at 3:05 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> 2) The daemon maintains metadata for each image that includes an extent
> mapping and then a clustered allocated bitmap within each extent (similar to
> FVD).
s/clustered allocated bitmap/cluster allocation bitmap/ ?
> 3) All writes result in a sha1 being calculated before the write is
> completed. The daemon maintains a mapping of sha1's -> clusters. A single
> sha1 may map to many clusters. The sha1 mapping can be made eventually
> consistent using a journal or even dirty bitmap. It can be partially
> rebuilt easily.
Can you explain this in more detail? A write to a single sector of a
cluster causes what to happen? Why is the hash calculated before
acking the write and not queued in the background if the hash mapping
is only eventually consistent?
For v3:
1. Snapshots.
2. You can connect remote daemons for read-only master images. If an
image is backed off a remote image, reads to unallocated clusters are
sent to the remote. This also allows for a master image daemon to
keep refcounts of how many instances are currently based of an image,
and if they copy/stream that data they can drop instances and storage
administrators know the master image is safe for deletion.
Stefan
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2011-03-21 21:35 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-21 15:05 [Qemu-devel] Moving beyond image files Anthony Liguori
2011-03-21 15:16 ` Alexander Graf
2011-03-21 16:04 ` Anthony Liguori
2011-03-21 21:35 ` Stefan Hajnoczi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).