From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:46378) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Tsyh3-0007I3-59 for qemu-devel@nongnu.org; Wed, 09 Jan 2013 11:40:01 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Tsygy-0002Va-EB for qemu-devel@nongnu.org; Wed, 09 Jan 2013 11:39:57 -0500 Received: from nodalink.pck.nerim.net ([62.212.105.220]:34902 helo=paradis.irqsave.net) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Tsygy-0002UC-0z for qemu-devel@nongnu.org; Wed, 09 Jan 2013 11:39:52 -0500 Date: Wed, 9 Jan 2013 17:40:14 +0100 From: =?iso-8859-1?Q?Beno=EEt?= Canet Message-ID: <20130109163928.GC3494@irqsave.net> References: <20130109152443.GB3494@irqsave.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] QCOW2 deduplication design List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Kevin Wolf , Paolo Bonzini , qemu-devel , Stefan Hajnoczi >=20 > What is the GTree indexed by physical offset used for? It's used for two things: deletion and loading of the hashes. -Deletion is a hook in the refcount code that trigger when zero is reache= d. the only information the code got is the physical offset of the yet to d= iscard cluster. The hash must be cleared on disk so a lookup by offset is done. Another way would be to read the deleted cluster, compute it's hash and u= se the result to delete the hash from disk. It seems an heavy procedure. -When the hash are loaded at startup another cluster written at the same physical place can create another hash superceeding the first one. The by offset tree is used in this case to keep the most recent hash for = a given cluster in memory. > > when a write is unaligned or smaller than a 4KB cluster the deduplica= tion code > > issue one or two reads to get the missing data required to build a 4K= B*n linear > > buffer. > > The deduplication metrics code show that this situation don't happen = with virtio > > and ext3 as a guest partition. >=20 > If the application uses O_DIRECT inside the guest you may see <4 KB > requests even on ext3 guest file systems. But in the buffered I/O > case the file system will use 4 KB blocks or similar. This means we can expect bad performances with some kind of loads. > > The cluster is counted as duplicated and not rewriten on disk >=20 > This case is when identical data is rewritten in place? No writes are > required - this is the scenario where online dedup is faster than > non-dedup because we avoid I/O entirely. Yes but experiments shows that dedup is always faster. It goes exactly at the storage native speed. > > I.5) cluster removal > > When a L2 entry to a cluster become stale the qcow2 code decrement th= e > > refcount. > > When the refcount reach zero the L2 hash block of the stale cluster > > is written to clear the hash. > > This happen often and require the second GTree to find the hash by it= 's physical > > sector number >=20 > This happens often? I'm surprised. Thought this only happens when > you delete snapshots or resize the image file? Maybe I misunderstood > this case. Yes the preliminary metrics code shows that cluster removal happen often. Maybe some recurent filesystem structure is written to disk first and overwritten. (inode skeleton, or journal zeroing) > > I.6) max refcount reached > > The L2 hash block of the cluster is written in order to remember at n= ext startup > > that it must not be used anymore for deduplication. The hash is dropp= ed from the > > gtrees. >=20 > Interesting case. This means you can no longer take snapshots > containing this cluster because we cannot track references :(. >=20 > Worst case: guest fills the disk with the same 4 KB data (e.g. > zeroes). There is only a single data cluster but the refcount is > maxed out. Now it is not possible to take a snapshot. Maybe I could just lower the dedup max refcount leaving room for snapshot= s. It would need a way to differentiate the snapshot case in the hook code p= ath. Regards Beno=EEt