[Qemu-devel] QCOW2 deduplication design

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] QCOW2 deduplication design
@ 2013-01-09 15:24 Benoît Canet
  2013-01-09 16:16 ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Benoît Canet @ 2013-01-09 15:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, stefanha

Hello,

Here is a mail to open a discussion on QCOW2 deduplication design and
performance.

The actual deduplication strategy is RAM based.
One of the goal of the project is to plan and implement an alternative way to do
the lookups from disk for bigger images.

I will in a first section enumerate the disk overheads of the RAM based lookup
strategy and then in the second section enumerate the additionals costs of doing
lookups in a disk based prefix b-tree.

Comments and sugestions are welcome.

I) RAM based lookups overhead

The qcow2 read path is not modified by the deduplication patchset.

Each cluster written gets its hash computed.

Two GTrees are used to give access to the hashes : one indexed by hash and
one other indexed by physical offset.

I.0) unaligned write

when a write is unaligned or smaller than a 4KB cluster the deduplication code
issue one or two reads to get the missing data required to build a 4KB*n linear
buffer.
The deduplication metrics code show that this situation don't happen with virtio
and ext3 as a guest partition.

I.1) First write overhead

The hash is computed

the cluster is not duplicated so the hash is stored in a linked list

after that the writev call get a new 64KB L2 dedup hash block corresponding to
the physical sector of the writen cluster.
(This can be an allocating write requiring to write the offset of the new block
in the dedup table and flush)

The hash is written in the l2 dedup hash block and flushed later by the
dedup_block_cache

I.2) Same cluster rewrite at the same place

The hash is computed

qcow2_get_cluster_offset is called and the result is used to check that it is a
rewrite

The cluster is counted as duplicated and not rewriten on disk

I.3) First duplicated cluster write

The hash is computed

qcow2_get_cluster_offset is called and we see that we are not rewriting the same
cluster at the same place

I.3.a) The L2 entry of the first cluster written with this hash is overwritten
to remove the QCOW_OFLAG_COPIED flag.

I.3.b) the dedup hash block of the hash is overwritten to remember at the next
startup that QCOW_OFLAG_COPIED has been cleared.

A new L2 entry is created for this logical sector pointing to the physical
cluster. (potential allocating write)

the refcount of the physical cluster is updated

I.4) Duplicated clusters further writes

Same as I.2 without I.3.a and I.3.b

I.5) cluster removal
When a L2 entry to a cluster become stale the qcow2 code decrement the
refcount.
When the refcount reach zero the L2 hash block of the stale cluster
is written to clear the hash.
This happen often and require the second GTree to find the hash by it's physical
sector number

I.6) max refcount reached
The L2 hash block of the cluster is written in order to remember at next startup
that it must not be used anymore for deduplication. The hash is dropped from the
gtrees.

II) Disk based lookup additional overhead

My initial idea is to replace the RAM based GTree indexed by hash
by a prefix b-tree stored on disk.
ict.pue.udlap.mx/people/carlos/is215/papers/p11-bayer.pdf
As hash are mostly random the prefix tree should work well.

An additional data structure would be needed to do the lookups by
physical offsets.
This additional data structure is clearly a bottleneck in this design.

Each lookup will cost 0(log n) disk ios.
Insertion and deletion can cost the double of io (need to split and merge leafs
and nodes)

II.1) First write overhead

O(log n) lookup in the b-tree for lookup of the hash

Disks Lookups by physical sector to remove stale hash node.
(What structure to use for this)

When the hash is written the leaf of the failed lookup can be reused
If leaf is full splitting the leaf and restructuring the tree must be done
in O(log n).

Update of a not yet defined way to lookup b-tree leaf by their offset on disk
(potential allocating write)

II.2) Same cluster rewrite at the same place
lookup in the b-tree to find the hash

II.3) First duplicated cluster write.
lookup in the b-tree to find the hash

The leaf of the b-tree must be overwritten to remember that we have cleared
QCOW_OFLAG_COPIED.

II.4) Duplicated cluster further writes
lookup in the b-tree to find the hash

II.5) cluster removal
A query to find the hash b-tree leaf by sector must be done on a not yet defined
data structure

The hash must be removed from the b-tree leaf

Two methods can be used to bypass the needs of this additional data structure.
-Read the cluster then compute this hash and use the hash for the removal
-Store hash and cluster forever while setting their refcount at a special value
meaning "reserved for future reuse".

II.6) max refcount reached

The ram implementation just drop the hash corresponding to the written cluster
and mark on disk that this hash must not be reloaded. A dupplicate hash is
then created.

I have not found yet how to handle this with a b-tree as it's not supposed
to contains duplicate keys.

Regards

Benoît

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] QCOW2 deduplication design
  2013-01-09 15:24 [Qemu-devel] QCOW2 deduplication design Benoît Canet
@ 2013-01-09 16:16 ` Stefan Hajnoczi
  2013-01-09 16:32   ` Eric Blake
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2013-01-09 16:16 UTC (permalink / raw)
  To: "efcount."; +Cc: Kevin Wolf, Paolo Bonzini, qemu-devel, Stefan Hajnoczi

On Wed, Jan 9, 2013 at 4:24 PM, Benoît Canet <benoit.canet@irqsave.net> wrote:
> Here is a mail to open a discussion on QCOW2 deduplication design and
> performance.
>
> The actual deduplication strategy is RAM based.
> One of the goal of the project is to plan and implement an alternative way to do
> the lookups from disk for bigger images.
>
>
> I will in a first section enumerate the disk overheads of the RAM based lookup
> strategy and then in the second section enumerate the additionals costs of doing
> lookups in a disk based prefix b-tree.
>
> Comments and sugestions are welcome.
>
> I) RAM based lookups overhead
>
> The qcow2 read path is not modified by the deduplication patchset.
>
> Each cluster written gets its hash computed.
>
> Two GTrees are used to give access to the hashes : one indexed by hash and
> one other indexed by physical offset.

What is the GTree indexed by physical offset used for?

> I.0) unaligned write
>
> when a write is unaligned or smaller than a 4KB cluster the deduplication code
> issue one or two reads to get the missing data required to build a 4KB*n linear
> buffer.
> The deduplication metrics code show that this situation don't happen with virtio
> and ext3 as a guest partition.

If the application uses O_DIRECT inside the guest you may see <4 KB
requests even on ext3 guest file systems.  But in the buffered I/O
case the file system will use 4 KB blocks or similar.

>
> I.1) First write overhead
>
> The hash is computed
>
> the cluster is not duplicated so the hash is stored in a linked list
>
> after that the writev call get a new 64KB L2 dedup hash block corresponding to
> the physical sector of the writen cluster.
> (This can be an allocating write requiring to write the offset of the new block
> in the dedup table and flush)
>
> The hash is written in the l2 dedup hash block and flushed later by the
> dedup_block_cache
>
> I.2) Same cluster rewrite at the same place
>
> The hash is computed
>
> qcow2_get_cluster_offset is called and the result is used to check that it is a
> rewrite
>
> The cluster is counted as duplicated and not rewriten on disk

This case is when identical data is rewritten in place?  No writes are
required - this is the scenario where online dedup is faster than
non-dedup because we avoid I/O entirely.

>
> I.3) First duplicated cluster write
>
> The hash is computed
>
> qcow2_get_cluster_offset is called and we see that we are not rewriting the same
> cluster at the same place
>
> I.3.a) The L2 entry of the first cluster written with this hash is overwritten
> to remove the QCOW_OFLAG_COPIED flag.
>
> I.3.b) the dedup hash block of the hash is overwritten to remember at the next
> startup that QCOW_OFLAG_COPIED has been cleared.
>
> A new L2 entry is created for this logical sector pointing to the physical
> cluster. (potential allocating write)
>
> the refcount of the physical cluster is updated
>
> I.4) Duplicated clusters further writes
>
> Same as I.2 without I.3.a and I.3.b
>
> I.5) cluster removal
> When a L2 entry to a cluster become stale the qcow2 code decrement the
> refcount.
> When the refcount reach zero the L2 hash block of the stale cluster
> is written to clear the hash.
> This happen often and require the second GTree to find the hash by it's physical
> sector number

This happens often?  I'm surprised.  Thought this only happens when
you delete snapshots or resize the image file?  Maybe I misunderstood
this case.

> I.6) max refcount reached
> The L2 hash block of the cluster is written in order to remember at next startup
> that it must not be used anymore for deduplication. The hash is dropped from the
> gtrees.

Interesting case.  This means you can no longer take snapshots
containing this cluster because we cannot track references :(.

Worst case: guest fills the disk with the same 4 KB data (e.g.
zeroes).  There is only a single data cluster but the refcount is
maxed out.  Now it is not possible to take a snapshot.

Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] QCOW2 deduplication design
  2013-01-09 16:16 ` Stefan Hajnoczi
@ 2013-01-09 16:32   ` Eric Blake
  2013-01-10  6:59     ` Stefan Hajnoczi
  2013-01-09 16:40   ` Benoît Canet
  2013-01-09 20:57   ` Benoît Canet
  2 siblings, 1 reply; 9+ messages in thread
From: Eric Blake @ 2013-01-09 16:32 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Paolo Bonzini, Stefan Hajnoczi, qemu-devel,
	"efcount."

[-- Attachment #1: Type: text/plain, Size: 1180 bytes --]

On 01/09/2013 09:16 AM, Stefan Hajnoczi wrote:

>> I.6) max refcount reached
>> The L2 hash block of the cluster is written in order to remember at next startup
>> that it must not be used anymore for deduplication. The hash is dropped from the
>> gtrees.
> 
> Interesting case.  This means you can no longer take snapshots
> containing this cluster because we cannot track references :(.
> 
> Worst case: guest fills the disk with the same 4 KB data (e.g.
> zeroes).  There is only a single data cluster but the refcount is
> maxed out.  Now it is not possible to take a snapshot.

Except that a sector of all zeroes should be represented as a hole,
rather than occupying a cluster (that is, since all zeros is the most
likely value for a very common data, but also has a special case
representation that doesn't take any clusters at all, we should take
advantage of that).  But your point remains for a disk that reuses
another common 4KB cluster more than the refcount allows (for example,
how many copies of COPYING do you have on your disk?).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 619 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] QCOW2 deduplication design
  2013-01-09 16:16 ` Stefan Hajnoczi
  2013-01-09 16:32   ` Eric Blake
@ 2013-01-09 16:40   ` Benoît Canet
  2013-01-10  8:16     ` Stefan Hajnoczi
  2013-01-09 20:57   ` Benoît Canet
  2 siblings, 1 reply; 9+ messages in thread
From: Benoît Canet @ 2013-01-09 16:40 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Paolo Bonzini, qemu-devel, Stefan Hajnoczi

> 
> What is the GTree indexed by physical offset used for?

It's used for two things: deletion and loading of the hashes.

-Deletion is a hook in the refcount code that trigger when zero is reached.
 the only information the code got is the physical offset of the yet to discard
cluster. The hash must be cleared on disk so a lookup by offset is done.
Another way would be to read the deleted cluster, compute it's hash and use the
result to delete the hash from disk. It seems an heavy procedure.

-When the hash are loaded at startup another cluster written at the same
physical place can create another hash superceeding the first one.
The by offset tree is used in this case to keep the most recent hash for a given
cluster in memory.

> > when a write is unaligned or smaller than a 4KB cluster the deduplication code
> > issue one or two reads to get the missing data required to build a 4KB*n linear
> > buffer.
> > The deduplication metrics code show that this situation don't happen with virtio
> > and ext3 as a guest partition.
> 
> If the application uses O_DIRECT inside the guest you may see <4 KB
> requests even on ext3 guest file systems.  But in the buffered I/O
> case the file system will use 4 KB blocks or similar.

This means we can expect bad performances with some kind of loads.

> > The cluster is counted as duplicated and not rewriten on disk
> 
> This case is when identical data is rewritten in place?  No writes are
> required - this is the scenario where online dedup is faster than
> non-dedup because we avoid I/O entirely.

Yes but experiments shows that dedup is always faster. It goes exactly
at the storage native speed.

> > I.5) cluster removal
> > When a L2 entry to a cluster become stale the qcow2 code decrement the
> > refcount.
> > When the refcount reach zero the L2 hash block of the stale cluster
> > is written to clear the hash.
> > This happen often and require the second GTree to find the hash by it's physical
> > sector number
> 
> This happens often?  I'm surprised.  Thought this only happens when
> you delete snapshots or resize the image file?  Maybe I misunderstood
> this case.

Yes the preliminary metrics code shows that cluster removal happen often.
Maybe some recurent filesystem structure is written to disk first and
overwritten. (inode skeleton, or journal zeroing)

> > I.6) max refcount reached
> > The L2 hash block of the cluster is written in order to remember at next startup
> > that it must not be used anymore for deduplication. The hash is dropped from the
> > gtrees.
> 
> Interesting case.  This means you can no longer take snapshots
> containing this cluster because we cannot track references :(.
> 
> Worst case: guest fills the disk with the same 4 KB data (e.g.
> zeroes).  There is only a single data cluster but the refcount is
> maxed out.  Now it is not possible to take a snapshot.

Maybe I could just lower the dedup max refcount leaving room for snapshots.
It would need a way to differentiate the snapshot case in the hook code path.

Regards

Benoît

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] QCOW2 deduplication design
  2013-01-09 16:16 ` Stefan Hajnoczi
  2013-01-09 16:32   ` Eric Blake
  2013-01-09 16:40   ` Benoît Canet
@ 2013-01-09 20:57   ` Benoît Canet
  2 siblings, 0 replies; 9+ messages in thread
From: Benoît Canet @ 2013-01-09 20:57 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Paolo Bonzini, qemu-devel, Stefan Hajnoczi

> > Two GTrees are used to give access to the hashes : one indexed by hash and
> > one other indexed by physical offset.
> 
> What is the GTree indexed by physical offset used for?

I think I can get rid of the second GTree for ram based deduplication.
It need to:

-Start qcow2 with the deduplication operation turned off while loading hashes.
 So no need to check for new confliction hashes.
 Dedup will be resumed after complete loading.

-For cluster removal and overflow the hash can be read for free from the
 deduplication table using the physical cluster index. Then the gtree can be
 manipulated with the hash.

I'll try to implement this tomorow.

Best regards

Benoît

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] QCOW2 deduplication design
  2013-01-09 16:32   ` Eric Blake
@ 2013-01-10  6:59     ` Stefan Hajnoczi
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2013-01-10  6:59 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Paolo Bonzini, Stefan Hajnoczi, qemu-devel,
	"efcount."

On Wed, Jan 9, 2013 at 5:32 PM, Eric Blake <eblake@redhat.com> wrote:
> On 01/09/2013 09:16 AM, Stefan Hajnoczi wrote:
>
>>> I.6) max refcount reached
>>> The L2 hash block of the cluster is written in order to remember at next startup
>>> that it must not be used anymore for deduplication. The hash is dropped from the
>>> gtrees.
>>
>> Interesting case.  This means you can no longer take snapshots
>> containing this cluster because we cannot track references :(.
>>
>> Worst case: guest fills the disk with the same 4 KB data (e.g.
>> zeroes).  There is only a single data cluster but the refcount is
>> maxed out.  Now it is not possible to take a snapshot.
>
> Except that a sector of all zeroes should be represented as a hole,

QEMU doesn't scan data buffers for all zero by default.

Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] QCOW2 deduplication design
  2013-01-09 16:40   ` Benoît Canet
@ 2013-01-10  8:16     ` Stefan Hajnoczi
  2013-01-10 15:18       ` Benoît Canet
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Hajnoczi @ 2013-01-10  8:16 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Kevin Wolf, Paolo Bonzini, qemu-devel, Stefan Hajnoczi

On Wed, Jan 9, 2013 at 5:40 PM, Benoît Canet <benoit.canet@irqsave.net> wrote:
>> > I.5) cluster removal
>> > When a L2 entry to a cluster become stale the qcow2 code decrement the
>> > refcount.
>> > When the refcount reach zero the L2 hash block of the stale cluster
>> > is written to clear the hash.
>> > This happen often and require the second GTree to find the hash by it's physical
>> > sector number
>>
>> This happens often?  I'm surprised.  Thought this only happens when
>> you delete snapshots or resize the image file?  Maybe I misunderstood
>> this case.
>
> Yes the preliminary metrics code shows that cluster removal happen often.
> Maybe some recurent filesystem structure is written to disk first and
> overwritten. (inode skeleton, or journal zeroing)

Now I understand.  This case covers overwriting existing data with new
contents.  That is common :).

But are you seeing a cluster with refcount > 1 being overwritten
often?  If so, it's worth looking into why that happens.  It may be a
common pattern for certain file systems or applications to write
initial data 'A' first and then change it later.  This actually
suggests against online dedup, or at least for something like qcow2
delayed write where we don't "commit" yet because the guest will
probably still modify or append to the data.

If the initial write with data 'A' is usually followed up by a rewrite
shortly afterwards it may be possible to use a timer to dedup after a
delay.

Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] QCOW2 deduplication design
  2013-01-10  8:16     ` Stefan Hajnoczi
@ 2013-01-10 15:18       ` Benoît Canet
  2013-01-10 15:28         ` Stefan Hajnoczi
  0 siblings, 1 reply; 9+ messages in thread
From: Benoît Canet @ 2013-01-10 15:18 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Benoît Canet, Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	Paolo Bonzini

> Now I understand.  This case covers overwriting existing data with new
> contents.  That is common :).
> 
> But are you seeing a cluster with refcount > 1 being overwritten
> often?  If so, it's worth looking into why that happens.  It may be a
> common pattern for certain file systems or applications to write
> initial data 'A' first and then change it later.  This actually
> suggests against online dedup, or at least for something like qcow2
> delayed write where we don't "commit" yet because the guest will
> probably still modify or append to the data.

I apologize for the bogus former information.

The deduplication metrics accounting code was confusing the delete cluster
operation with the more common hash removal from tree operation.
After fixing the metrics code commons files manipulations on the guest only
generate a few delete cluster operations.

The cases where a lots of cluster are deleted is when the image is overwritten
with zeroes and reformating a partition with ext3.

Regards

Benoît

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] QCOW2 deduplication design
  2013-01-10 15:18       ` Benoît Canet
@ 2013-01-10 15:28         ` Stefan Hajnoczi
  0 siblings, 0 replies; 9+ messages in thread
From: Stefan Hajnoczi @ 2013-01-10 15:28 UTC (permalink / raw)
  To: Benoît Canet; +Cc: Kevin Wolf, Paolo Bonzini, qemu-devel, Stefan Hajnoczi

On Thu, Jan 10, 2013 at 4:18 PM, Benoît Canet <benoit.canet@irqsave.net> wrote:
>> Now I understand.  This case covers overwriting existing data with new
>> contents.  That is common :).
>>
>> But are you seeing a cluster with refcount > 1 being overwritten
>> often?  If so, it's worth looking into why that happens.  It may be a
>> common pattern for certain file systems or applications to write
>> initial data 'A' first and then change it later.  This actually
>> suggests against online dedup, or at least for something like qcow2
>> delayed write where we don't "commit" yet because the guest will
>> probably still modify or append to the data.
>
> I apologize for the bogus former information.
>
> The deduplication metrics accounting code was confusing the delete cluster
> operation with the more common hash removal from tree operation.
> After fixing the metrics code commons files manipulations on the guest only
> generate a few delete cluster operations.
>
> The cases where a lots of cluster are deleted is when the image is overwritten
> with zeroes and reformating a partition with ext3.

Eric raised a good point with zero detection.  Maybe we should turn it
on when dedup is enabled since we'll be scanning the data buffer
anyway.  It places special zero cluster markers in the L2 entry and
never stores zeroes at all.

The advantage of doing this is that we don't dedup zero clusters and
never hit the refcount limits on this relatively common operation.

Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-01-10 15:28 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-09 15:24 [Qemu-devel] QCOW2 deduplication design Benoît Canet
2013-01-09 16:16 ` Stefan Hajnoczi
2013-01-09 16:32   ` Eric Blake
2013-01-10  6:59     ` Stefan Hajnoczi
2013-01-09 16:40   ` Benoît Canet
2013-01-10  8:16     ` Stefan Hajnoczi
2013-01-10 15:18       ` Benoît Canet
2013-01-10 15:28         ` Stefan Hajnoczi
2013-01-09 20:57   ` Benoît Canet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).