Re: Offline Deduplication for Btrfs

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Peter A <loony@loonybin.org>
To: linux-btrfs@vger.kernel.org
Subject: Re: Offline Deduplication for Btrfs
Date: Thu, 6 Jan 2011 11:11:35 -0500	[thread overview]
Message-ID: <201101061111.35134.loony@loonybin.org> (raw)
In-Reply-To: <4D25DA97.9010705@bobich.net>

On Thursday, January 06, 2011 10:07:03 am you wrote:
> I'd be interested to see the evidence of the "variable length" argument.
> I have a sneaky suspicion that it actually falls back to 512 byte
> blocks, which are much more likely to align, when more sensibly sized
> blocks fail. The downside is that you don't really want to store a 32
> byte hash key with every 512 bytes of data, so you could peel off 512
> byte blocks off the front in a hope that a bigger block that follows
> will match.
> 
> Thinking about it, this might actually not be too expensive to do. If
> the 4KB block doesn't match, check 512 byte sub-blocks, and try peeling
> them, to make the next one line up. If it doesn't, store the mismatch as
> a full 4KB block and resume. If you do find a match, save the peeled 512
> byte blocks separately and dedupe the 4KB block.
> 
> In fact, it's rather like the loop peeling optimization on a compiler,
> that allows you to align the data to the boundary suitable for vectorizing.
I'm not sure about this but to be honest I can not see any other way. 
Otherwise, how would you ever find a match? You can not store checksums of 
random sub-sections hoping that eventually stuff will match up... 512 byte is 
probably the best choice as its been a "known block size" since the dawn of 
unix.


> Actually, see above - I believe I was wrong about how expensive
> "variable length" block size is likely to be. It's more expensive, sure,
> but not orders of magnitude more expensive, and as discussed earlier,
> given the CPU isn't really the key bottleneck here, I think it'd be
> quite workable.
Hmm - from my work with DD systems, it seems to be the CPU that ends up being 
the limiting factor on dedupe performance... But again, it seems that you have 
much deeper insight into this topic and have already drawn up an algorithm in 
your mind :)

Peter.

-- 
Censorship: noun, circa 1591. a: Relief of the burden of independent thinking.

next prev parent reply	other threads:[~2011-01-06 16:11 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-05 16:36 Offline Deduplication for Btrfs Josef Bacik
2011-01-05 16:36 ` [PATCH] Btrfs: add extent-same ioctl for dedup Josef Bacik
2011-01-05 17:50   ` Simon Farnsworth
2011-01-05 16:36 ` [PATCH] Btrfs-progs: add dedup functionality Josef Bacik
2011-01-05 17:42 ` Offline Deduplication for Btrfs Gordan Bobic
2011-01-05 18:41   ` Diego Calleja
2011-01-05 19:01     ` Ray Van Dolson
2011-01-05 20:27       ` Gordan Bobic
2011-01-05 20:28       ` Josef Bacik
2011-01-05 20:25     ` Gordan Bobic
2011-01-05 21:14       ` Diego Calleja
2011-01-05 21:21         ` Gordan Bobic
2011-01-05 19:46   ` Josef Bacik
2011-01-05 19:58     ` Lars Wirzenius
2011-01-05 20:15       ` Josef Bacik
2011-01-05 20:34         ` Freddie Cash
2011-01-05 21:07       ` Lars Wirzenius
2011-01-05 20:12     ` Freddie Cash
2011-01-05 20:46     ` Gordan Bobic
     [not found]       ` <4D250B3C.6010708@shiftmail.org>
2011-01-06  1:03         ` Gordan Bobic
2011-01-06  1:56           ` Spelic
2011-01-06 10:39             ` Gordan Bobic
2011-01-06  3:33           ` Freddie Cash
2011-01-06  1:19       ` Spelic
2011-01-06  3:58         ` Peter A
2011-01-06 10:48           ` Gordan Bobic
2011-01-06 13:33             ` Peter A
2011-01-06 14:00               ` Gordan Bobic
2011-01-06 14:52                 ` Peter A
2011-01-06 15:07                   ` Gordan Bobic
2011-01-06 16:11                     ` Peter A [this message]
2011-01-06 18:35           ` Chris Mason
2011-01-08  0:27             ` Peter A
2011-01-06 14:30         ` Tomasz Torcz
2011-01-06 14:49           ` Gordan Bobic
2011-01-06  1:29   ` Chris Mason
2011-01-06 10:33     ` Gordan Bobic
2011-01-10 15:28     ` Ric Wheeler
2011-01-10 15:37       ` Josef Bacik
2011-01-10 15:39         ` Chris Mason
2011-01-10 15:43           ` Josef Bacik
2011-01-06 12:18   ` Simon Farnsworth
2011-01-06 12:29     ` Gordan Bobic
2011-01-06 13:30       ` Simon Farnsworth
2011-01-06 14:20     ` Ondřej Bílka
2011-01-06 14:41       ` Gordan Bobic
2011-01-06 15:37         ` Ondřej Bílka
2011-01-06  8:25 ` Yan, Zheng 
  -- strict thread matches above, loose matches on Subject: below --
2011-01-06  9:37 Tomasz Chmielewski
2011-01-06  9:51 ` Mike Hommey
2011-01-06 16:57   ` Hubert Kario
2011-01-06 10:52 ` Gordan Bobic
2011-01-16  0:18 Arjen Nienhuis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201101061111.35134.loony@loonybin.org \
    --to=loony@loonybin.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).