Re: [Qemu-devel] QCOW2 deduplication design

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Benoît Canet" <benoit.canet@irqsave.net>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] QCOW2 deduplication design
Date: Wed, 9 Jan 2013 17:40:14 +0100	[thread overview]
Message-ID: <20130109163928.GC3494@irqsave.net> (raw)
In-Reply-To: <CAJSP0QVHY2XLyUw6-pGt8pf_S9XP0xe_e1LKKP-Yf3UvakxASQ@mail.gmail.com>

> 
> What is the GTree indexed by physical offset used for?

It's used for two things: deletion and loading of the hashes.

-Deletion is a hook in the refcount code that trigger when zero is reached.
 the only information the code got is the physical offset of the yet to discard
cluster. The hash must be cleared on disk so a lookup by offset is done.
Another way would be to read the deleted cluster, compute it's hash and use the
result to delete the hash from disk. It seems an heavy procedure.

-When the hash are loaded at startup another cluster written at the same
physical place can create another hash superceeding the first one.
The by offset tree is used in this case to keep the most recent hash for a given
cluster in memory.

> > when a write is unaligned or smaller than a 4KB cluster the deduplication code
> > issue one or two reads to get the missing data required to build a 4KB*n linear
> > buffer.
> > The deduplication metrics code show that this situation don't happen with virtio
> > and ext3 as a guest partition.
> 
> If the application uses O_DIRECT inside the guest you may see <4 KB
> requests even on ext3 guest file systems.  But in the buffered I/O
> case the file system will use 4 KB blocks or similar.

This means we can expect bad performances with some kind of loads.

> > The cluster is counted as duplicated and not rewriten on disk
> 
> This case is when identical data is rewritten in place?  No writes are
> required - this is the scenario where online dedup is faster than
> non-dedup because we avoid I/O entirely.

Yes but experiments shows that dedup is always faster. It goes exactly
at the storage native speed.

> > I.5) cluster removal
> > When a L2 entry to a cluster become stale the qcow2 code decrement the
> > refcount.
> > When the refcount reach zero the L2 hash block of the stale cluster
> > is written to clear the hash.
> > This happen often and require the second GTree to find the hash by it's physical
> > sector number
> 
> This happens often?  I'm surprised.  Thought this only happens when
> you delete snapshots or resize the image file?  Maybe I misunderstood
> this case.

Yes the preliminary metrics code shows that cluster removal happen often.
Maybe some recurent filesystem structure is written to disk first and
overwritten. (inode skeleton, or journal zeroing)

> > I.6) max refcount reached
> > The L2 hash block of the cluster is written in order to remember at next startup
> > that it must not be used anymore for deduplication. The hash is dropped from the
> > gtrees.
> 
> Interesting case.  This means you can no longer take snapshots
> containing this cluster because we cannot track references :(.
> 
> Worst case: guest fills the disk with the same 4 KB data (e.g.
> zeroes).  There is only a single data cluster but the refcount is
> maxed out.  Now it is not possible to take a snapshot.

Maybe I could just lower the dedup max refcount leaving room for snapshots.
It would need a way to differentiate the snapshot case in the hook code path.

Regards

Benoît

next prev parent reply	other threads:[~2013-01-09 16:40 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-09 15:24 [Qemu-devel] QCOW2 deduplication design Benoît Canet
2013-01-09 16:16 ` Stefan Hajnoczi
2013-01-09 16:32   ` Eric Blake
2013-01-10  6:59     ` Stefan Hajnoczi
2013-01-09 16:40   ` Benoît Canet [this message]
2013-01-10  8:16     ` Stefan Hajnoczi
2013-01-10 15:18       ` Benoît Canet
2013-01-10 15:28         ` Stefan Hajnoczi
2013-01-09 20:57   ` Benoît Canet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130109163928.GC3494@irqsave.net \
    --to=benoit.canet@irqsave.net \
    --cc=kwolf@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@gmail.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).