From: Spelic <spelic@shiftmail.org>
To: Gordan Bobic <gordan@bobich.net>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Offline Deduplication for Btrfs
Date: Thu, 06 Jan 2011 02:19:04 +0100 [thread overview]
Message-ID: <4D251888.7060508@shiftmail.org> (raw)
In-Reply-To: <4D24D8BC.90808@bobich.net>
On 01/05/2011 09:46 PM, Gordan Bobic wrote:
> On 01/05/2011 07:46 PM, Josef Bacik wrote:
>
> Offline dedup is more expensive - so why are you of the opinion that
> it is less silly? And comparison by silliness quotiend still sounds
> like an argument over which is better.
>
If I can say my opinion, I wouldn't want dedup to be enabled online for
the whole filesystem.
Three reasons:
1- Virtual machine disk images should not get deduplicated imho, if you
care about performances, because fragmentation is more important in that
case.
So offline dedup is preferable IMHO. Or at least online dedup should
happen only on configured paths.
2- I don't want performances to drop all the time. I would run dedup
periodically on less active hours, hence, offline. A rate limiter should
also be implemented so not to trash the drives too much. Also a stop and
continue should be implemented, so that dedup which couldn't finish
within a certain time-frame (e.g. one night) can be made continue the
night after without restarting from the beginning.
3- Only some directories should be deduped, for performance reasons. You
can foresee where duplicate blocks can exist and where not. Backup
directories typically, or mailservers directories. The rest is probably
a waste of time.
> Dedup isn't for an average desktop user. Dedup is for backup storage
> and virtual images
Not virtual images imho, for the reason above.
Also, the OS is small even if identical on multiple virtual images, how
much is going to occupy anyway? Less than 5GB per disk image usually.
And that's the only thing that would be deduped because data likely to
be different on each instance. How many VMs running you have? 20? That's
at most 100GB saved one-time at the cost of a lot of fragmentation.
Now if you backup those images periodically, that's a place where I
would run dedup.
> I'd just make it always use the fs block size. No point in making it
> variable.
Agreed. What is the reason for variable block size?
>> And then lets bring up the fact that you _have_ to manually compare
>> any data you
>> are going to dedup. I don't care if you think you have the greatest
>> hashing
>> algorithm known to man, you are still going to have collisions
>> somewhere at some
>> point, so in order to make sure you don't lose data, you have to
>> manually memcmp
>> the data.
Totally agreed
>> So if you are doing this online, that means reading back the copy you
>> want to dedup in the write path so you can do the memcmp before you
>> write. That
>> is going to make your write performance _suck_.
>
> IIRC, this is configurable in ZFS so that you can switch off the
> physical block comparison. If you use SHA256, the probability of a
> collission (unless SHA is broken, in which case we have much bigger
> problems) is 1^128. Times 4KB blocks, that is one collission in 10^24
> Exabytes. That's one trillion trillion (that's double trillion) Exabytes.
I like mathematics, but I don't care this time. I would never enable
dedup without full blocks compare. I think most users and most companies
would do the same.
If there is full blocks compare, a simpler/faster algorithm could be
chosen, like md5. Or even a md-64bits which I don't think it exists, but
you can take MD4 and then xor the first 8 bytes with the second 8 bytes
so to reduce it to 8 bytes only. This is just because it saves 60% of
the RAM occupation during dedup, which is expected to be large, and the
collisions are still insignificant at 64bits. Clearly you need to do
full blocks compare after that.
BTW if you want to allow (as an option) dedup without full blocks
compare, SHA1 is not so good: sha-0 already had problems, now sha-1 has
problems, I almost wouldn't suggest it for cryptographically secure
stuff foreseeing the future. Use ripemd160 or even better ripemd256
which is even faster according to
http://www.cryptopp.com/benchmarks.html ripemds are much better
algorithms than shas: they have no known weaknesses.
Note that deduplication IS a cryptographically sensitive matter because
if sha-1 is cracked, people can nuke (or maybe even alter, and with
this, hack privileges) other users' files by providing blocks with the
same SHA and waiting for dedup to pass.
Same thing for AES btw, it is showing weaknesses: use blowfish or twofish.
SHA1 and AES are two wrong standards...
Dedup without full blocks compare seems indeed suited for online dedup
(which I wouldn't enable, now for one more reason) because with full
block compares performances would really suck. But please leave full
blocks compare for the offline dedup.
Also I could suggest a third type of deduplication, but this is
harder... it's a file-level deduplication which works like xdelta, that
is, it is capable to recognize piece of identical data on two files,
which are not at the same offset and which are not even aligned at block
boundary. For this, a rolling hash like the one of rsync, or the xdelta
3.0 algorithm could be used. For this to work I suppose Btrfs needs to
handle the padding of filesystem blocks... which I'm not sure it was
foreseen.
Above in this thread you said:
> The _only_ reason to defer deduping is that hashing costs CPU time.
> But the chances are that a modern CPU core can churn out MD5 and/or
> SHA256 hashes faster than a modern mechanical disk can keep up. A
> 15,000rpm disk can theoretically handle 250 IOPS. A modern CPU can
> handle considerably more than 250 block hashings per second. You could
> argue that this changes in cases of sequential I/O on big files, but a
> 1.86GHz GHz Core2 can churn through 111MB/s of SHA256, which even SSDs
> will struggle to keep up with.
A normal 1TB disk with platters can do 130MB/sec sequential, no problems.
A SSD can do more like 200MB/sec write 280MB/sec read sequential or
random and is actually limited only by the SATA 3.0gbit/sec but soon
enough they will have SATA/SAS 6.0gbit/sec.
More cores can be used for hashing but multicore implementation for
stuff that is not natively threaded (such as parallel and completely
separate queries to a DB) usually is very difficult to do well. E.g. it
was attempted recently on MD raid for parity computation by
knowledgeable people but it performed so much worse than single-core
that it was disabled.
next prev parent reply other threads:[~2011-01-06 1:19 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-01-05 16:36 Offline Deduplication for Btrfs Josef Bacik
2011-01-05 16:36 ` [PATCH] Btrfs: add extent-same ioctl for dedup Josef Bacik
2011-01-05 17:50 ` Simon Farnsworth
2011-01-05 16:36 ` [PATCH] Btrfs-progs: add dedup functionality Josef Bacik
2011-01-05 17:42 ` Offline Deduplication for Btrfs Gordan Bobic
2011-01-05 18:41 ` Diego Calleja
2011-01-05 19:01 ` Ray Van Dolson
2011-01-05 20:27 ` Gordan Bobic
2011-01-05 20:28 ` Josef Bacik
2011-01-05 20:25 ` Gordan Bobic
2011-01-05 21:14 ` Diego Calleja
2011-01-05 21:21 ` Gordan Bobic
2011-01-05 19:46 ` Josef Bacik
2011-01-05 19:58 ` Lars Wirzenius
2011-01-05 20:15 ` Josef Bacik
2011-01-05 20:34 ` Freddie Cash
2011-01-05 21:07 ` Lars Wirzenius
2011-01-05 20:12 ` Freddie Cash
2011-01-05 20:46 ` Gordan Bobic
[not found] ` <4D250B3C.6010708@shiftmail.org>
2011-01-06 1:03 ` Gordan Bobic
2011-01-06 1:56 ` Spelic
2011-01-06 10:39 ` Gordan Bobic
2011-01-06 3:33 ` Freddie Cash
2011-01-06 1:19 ` Spelic [this message]
2011-01-06 3:58 ` Peter A
2011-01-06 10:48 ` Gordan Bobic
2011-01-06 13:33 ` Peter A
2011-01-06 14:00 ` Gordan Bobic
2011-01-06 14:52 ` Peter A
2011-01-06 15:07 ` Gordan Bobic
2011-01-06 16:11 ` Peter A
2011-01-06 18:35 ` Chris Mason
2011-01-08 0:27 ` Peter A
2011-01-06 14:30 ` Tomasz Torcz
2011-01-06 14:49 ` Gordan Bobic
2011-01-06 1:29 ` Chris Mason
2011-01-06 10:33 ` Gordan Bobic
2011-01-10 15:28 ` Ric Wheeler
2011-01-10 15:37 ` Josef Bacik
2011-01-10 15:39 ` Chris Mason
2011-01-10 15:43 ` Josef Bacik
2011-01-06 12:18 ` Simon Farnsworth
2011-01-06 12:29 ` Gordan Bobic
2011-01-06 13:30 ` Simon Farnsworth
2011-01-06 14:20 ` Ondřej Bílka
2011-01-06 14:41 ` Gordan Bobic
2011-01-06 15:37 ` Ondřej Bílka
2011-01-06 8:25 ` Yan, Zheng
-- strict thread matches above, loose matches on Subject: below --
2011-01-06 9:37 Tomasz Chmielewski
2011-01-06 9:51 ` Mike Hommey
2011-01-06 16:57 ` Hubert Kario
2011-01-06 10:52 ` Gordan Bobic
2011-01-16 0:18 Arjen Nienhuis
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4D251888.7060508@shiftmail.org \
--to=spelic@shiftmail.org \
--cc=gordan@bobich.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).