Re: Offline Deduplication for Btrfs

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Spelic <spelic@shiftmail.org>
To: Gordan Bobic <gordan@bobich.net>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Offline Deduplication for Btrfs
Date: Thu, 06 Jan 2011 02:19:04 +0100	[thread overview]
Message-ID: <4D251888.7060508@shiftmail.org> (raw)
In-Reply-To: <4D24D8BC.90808@bobich.net>

On 01/05/2011 09:46 PM, Gordan Bobic wrote:
> On 01/05/2011 07:46 PM, Josef Bacik wrote:
>
> Offline dedup is more expensive - so why are you of the opinion that 
> it is less silly? And comparison by silliness quotiend still sounds 
> like an argument over which is better.
>

If I can say my opinion, I wouldn't want dedup to be enabled online for 
the whole filesystem.

Three reasons:

1- Virtual machine disk images should not get deduplicated imho, if you 
care about performances, because fragmentation is more important in that 
case.
So offline dedup is preferable IMHO. Or at least online dedup should 
happen only on configured paths.

2- I don't want performances to drop all the time. I would run dedup 
periodically on less active hours, hence, offline. A rate limiter should 
also be implemented so not to trash the drives too much. Also a stop and 
continue should be implemented, so that dedup which couldn't finish 
within a certain time-frame (e.g. one night) can be made continue the 
night after without restarting from the beginning.

3- Only some directories should be deduped, for performance reasons. You 
can foresee where duplicate blocks can exist and where not. Backup 
directories typically, or mailservers directories. The rest is probably 
a waste of time.

> Dedup isn't for an average desktop user. Dedup is for backup storage 
> and virtual images

Not virtual images imho, for the reason above.

Also, the OS is small even if identical on multiple virtual images, how 
much is going to occupy anyway? Less than 5GB per disk image usually. 
And that's the only thing that would be deduped because data likely to 
be different on each instance. How many VMs running you have? 20? That's 
at most 100GB saved one-time at the cost of a lot of fragmentation.

Now if you backup those images periodically, that's a place where I 
would run dedup.

> I'd just make it always use the fs block size. No point in making it 
> variable.

Agreed. What is the reason for variable block size?

>> And then lets bring up the fact that you _have_ to manually compare 
>> any data you
>> are going to dedup.  I don't care if you think you have the greatest 
>> hashing
>> algorithm known to man, you are still going to have collisions 
>> somewhere at some
>> point, so in order to make sure you don't lose data, you have to 
>> manually memcmp
>> the data.

Totally agreed

>> So if you are doing this online, that means reading back the copy you
>> want to dedup in the write path so you can do the memcmp before you 
>> write.  That
>> is going to make your write performance _suck_.
>
> IIRC, this is configurable in ZFS so that you can switch off the 
> physical block comparison. If you use SHA256, the probability of a 
> collission (unless SHA is broken, in which case we have much bigger 
> problems) is 1^128. Times 4KB blocks, that is one collission in 10^24 
> Exabytes. That's one trillion trillion (that's double trillion) Exabytes. 

I like mathematics, but I don't care this time. I would never enable 
dedup without full blocks compare. I think most users and most companies 
would do the same.

If there is full blocks compare, a simpler/faster algorithm could be 
chosen, like md5. Or even a md-64bits which I don't think it exists, but 
you can take MD4 and then xor the first 8 bytes with the second 8 bytes 
so to reduce it to 8 bytes only. This is just because it saves 60% of 
the RAM occupation during dedup, which is expected to be large, and the 
collisions are still insignificant at 64bits. Clearly you need to do 
full blocks compare after that.

BTW if you want to allow (as an option) dedup without full blocks 
compare, SHA1 is not so good: sha-0 already had problems, now sha-1 has 
problems, I almost wouldn't suggest it for cryptographically secure 
stuff foreseeing the future. Use ripemd160 or even better ripemd256 
which is even faster according to 
http://www.cryptopp.com/benchmarks.html  ripemds are much better 
algorithms than shas: they have no known weaknesses.
Note that deduplication IS a cryptographically sensitive matter because 
if sha-1 is cracked, people can nuke (or maybe even alter, and with 
this, hack privileges) other users' files by providing blocks with the 
same SHA and waiting for dedup to pass.
Same thing for AES btw, it is showing weaknesses: use blowfish or twofish.
SHA1 and AES are two wrong standards...

Dedup without full blocks compare seems indeed suited for online dedup 
(which I wouldn't enable, now for one more reason) because with full 
block compares performances would really suck. But please leave full 
blocks compare for the offline dedup.

Also I could suggest a third type of deduplication, but this is 
harder... it's a file-level deduplication which works like xdelta, that 
is, it is capable to recognize piece of identical data on two files, 
which are not at the same offset and which are not even aligned at block 
boundary. For this, a rolling hash like the one of rsync, or the xdelta 
3.0 algorithm could be used. For this to work I suppose Btrfs needs to 
handle the padding of filesystem blocks... which I'm not sure it was 
foreseen.

Above in this thread you said:
> The _only_ reason to defer deduping is that hashing costs CPU time. 
> But the chances are that a modern CPU core can churn out MD5 and/or 
> SHA256 hashes faster than a modern mechanical disk can keep up. A 
> 15,000rpm disk can theoretically handle 250 IOPS. A modern CPU can 
> handle considerably more than 250 block hashings per second. You could 
> argue that this changes in cases of sequential I/O on big files, but a 
> 1.86GHz GHz Core2 can churn through 111MB/s of SHA256, which even SSDs 
> will struggle to keep up with.

A normal 1TB disk with platters can do 130MB/sec sequential, no problems.
A SSD can do more like 200MB/sec write 280MB/sec read sequential or 
random and is actually limited only by the SATA 3.0gbit/sec but soon 
enough they will have SATA/SAS 6.0gbit/sec.
More cores can be used for hashing but multicore implementation for 
stuff that is not natively threaded (such as parallel and completely 
separate queries to a DB) usually is very difficult to do well. E.g. it 
was attempted recently on MD raid for parity computation by 
knowledgeable people but it performed so much worse than single-core 
that it was disabled.

next prev parent reply	other threads:[~2011-01-06  1:19 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-05 16:36 Offline Deduplication for Btrfs Josef Bacik
2011-01-05 16:36 ` [PATCH] Btrfs: add extent-same ioctl for dedup Josef Bacik
2011-01-05 17:50   ` Simon Farnsworth
2011-01-05 16:36 ` [PATCH] Btrfs-progs: add dedup functionality Josef Bacik
2011-01-05 17:42 ` Offline Deduplication for Btrfs Gordan Bobic
2011-01-05 18:41   ` Diego Calleja
2011-01-05 19:01     ` Ray Van Dolson
2011-01-05 20:27       ` Gordan Bobic
2011-01-05 20:28       ` Josef Bacik
2011-01-05 20:25     ` Gordan Bobic
2011-01-05 21:14       ` Diego Calleja
2011-01-05 21:21         ` Gordan Bobic
2011-01-05 19:46   ` Josef Bacik
2011-01-05 19:58     ` Lars Wirzenius
2011-01-05 20:15       ` Josef Bacik
2011-01-05 20:34         ` Freddie Cash
2011-01-05 21:07       ` Lars Wirzenius
2011-01-05 20:12     ` Freddie Cash
2011-01-05 20:46     ` Gordan Bobic
     [not found]       ` <4D250B3C.6010708@shiftmail.org>
2011-01-06  1:03         ` Gordan Bobic
2011-01-06  1:56           ` Spelic
2011-01-06 10:39             ` Gordan Bobic
2011-01-06  3:33           ` Freddie Cash
2011-01-06  1:19       ` Spelic [this message]
2011-01-06  3:58         ` Peter A
2011-01-06 10:48           ` Gordan Bobic
2011-01-06 13:33             ` Peter A
2011-01-06 14:00               ` Gordan Bobic
2011-01-06 14:52                 ` Peter A
2011-01-06 15:07                   ` Gordan Bobic
2011-01-06 16:11                     ` Peter A
2011-01-06 18:35           ` Chris Mason
2011-01-08  0:27             ` Peter A
2011-01-06 14:30         ` Tomasz Torcz
2011-01-06 14:49           ` Gordan Bobic
2011-01-06  1:29   ` Chris Mason
2011-01-06 10:33     ` Gordan Bobic
2011-01-10 15:28     ` Ric Wheeler
2011-01-10 15:37       ` Josef Bacik
2011-01-10 15:39         ` Chris Mason
2011-01-10 15:43           ` Josef Bacik
2011-01-06 12:18   ` Simon Farnsworth
2011-01-06 12:29     ` Gordan Bobic
2011-01-06 13:30       ` Simon Farnsworth
2011-01-06 14:20     ` Ondřej Bílka
2011-01-06 14:41       ` Gordan Bobic
2011-01-06 15:37         ` Ondřej Bílka
2011-01-06  8:25 ` Yan, Zheng 
  -- strict thread matches above, loose matches on Subject: below --
2011-01-06  9:37 Tomasz Chmielewski
2011-01-06  9:51 ` Mike Hommey
2011-01-06 16:57   ` Hubert Kario
2011-01-06 10:52 ` Gordan Bobic
2011-01-16  0:18 Arjen Nienhuis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D251888.7060508@shiftmail.org \
    --to=spelic@shiftmail.org \
    --cc=gordan@bobich.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).