From: Tomasz Chmielewski <mangoo@wpkg.org>
To: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Offline Deduplication for Btrfs
Date: Thu, 06 Jan 2011 10:37:46 +0100 [thread overview]
Message-ID: <4D258D6A.9010903@wpkg.org> (raw)
> I have been thinking a lot about de-duplication for a backup application
> I am writing. I wrote a little script to figure out how much it would
> save me. For my laptop home directory, about 100 GiB of data, it was a
> couple of percent, depending a bit on the size of the chunks. With 4 KiB
> chunks, I would save about two gigabytes. (That's assuming no MD5 hash
> collisions.) I don't have VM images, but I do have a fair bit of saved
> e-mail. So, for backups, I concluded it was worth it to provide an
> option to do this. I have no opinion on whether it is worthwhile to do
> in btrfs.
Online deduplication is very useful for backups of big, multi-gigabyte
files which change constantly.
Some mail servers store files this way; some MUA store the files like
this; databases are also common to pack everything in big files which
tend to change here and there almost all the time.
Multi-gigabyte files which only have few megabytes changed can't be
hardlinked; simple maths shows that even compressing multiple files
which have few differences will lead to greater space usage than a few
megabytes extra in each (because everything else is deduplicated).
And I don't even want to think about IO needed to offline dedup a
multi-terabyte storage (1 TB disks and bigger are becoming standard
nowadays) i.e. daily, especially when the storage is already heavily
used in IO terms.
Now, one popular tool which can deal with small changes in files is
rsync. It can be used to copy files over the network - so that if you
want to copy/update a multi-gigabyte file which only has a few changes,
rsync would need to transfer just a few megabytes.
On disk however, rsync creates a "temporary copy" of the original file,
where it packs unchanged contents together with any changes made. For
example, while it copies/updates a file, we will have:
original_file.bin
.temporary_random_name
Later, original_file.bin would be removed, and .temporary_random_name
would be renamed to original_file.bin. Here goes away any deduplication
we had so far, we have to start the IO over again.
--
Tomasz Chmielewski
http://wpkg.org
next reply other threads:[~2011-01-06 9:37 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-01-06 9:37 Tomasz Chmielewski [this message]
2011-01-06 9:51 ` Offline Deduplication for Btrfs Mike Hommey
2011-01-06 16:57 ` Hubert Kario
2011-01-06 10:52 ` Gordan Bobic
-- strict thread matches above, loose matches on Subject: below --
2011-01-16 0:18 Arjen Nienhuis
2011-01-05 16:36 Josef Bacik
2011-01-05 17:42 ` Gordan Bobic
2011-01-05 18:41 ` Diego Calleja
2011-01-05 19:01 ` Ray Van Dolson
2011-01-05 20:27 ` Gordan Bobic
2011-01-05 20:28 ` Josef Bacik
2011-01-05 20:25 ` Gordan Bobic
2011-01-05 21:14 ` Diego Calleja
2011-01-05 21:21 ` Gordan Bobic
2011-01-05 19:46 ` Josef Bacik
2011-01-05 19:58 ` Lars Wirzenius
2011-01-05 20:15 ` Josef Bacik
2011-01-05 20:34 ` Freddie Cash
2011-01-05 21:07 ` Lars Wirzenius
2011-01-05 20:12 ` Freddie Cash
2011-01-05 20:46 ` Gordan Bobic
[not found] ` <4D250B3C.6010708@shiftmail.org>
2011-01-06 1:03 ` Gordan Bobic
2011-01-06 1:56 ` Spelic
2011-01-06 10:39 ` Gordan Bobic
2011-01-06 3:33 ` Freddie Cash
2011-01-06 1:19 ` Spelic
2011-01-06 3:58 ` Peter A
2011-01-06 10:48 ` Gordan Bobic
2011-01-06 13:33 ` Peter A
2011-01-06 14:00 ` Gordan Bobic
2011-01-06 14:52 ` Peter A
2011-01-06 15:07 ` Gordan Bobic
2011-01-06 16:11 ` Peter A
2011-01-06 18:35 ` Chris Mason
2011-01-08 0:27 ` Peter A
2011-01-06 14:30 ` Tomasz Torcz
2011-01-06 14:49 ` Gordan Bobic
2011-01-06 1:29 ` Chris Mason
2011-01-06 10:33 ` Gordan Bobic
2011-01-10 15:28 ` Ric Wheeler
2011-01-10 15:37 ` Josef Bacik
2011-01-10 15:39 ` Chris Mason
2011-01-10 15:43 ` Josef Bacik
2011-01-06 12:18 ` Simon Farnsworth
2011-01-06 12:29 ` Gordan Bobic
2011-01-06 13:30 ` Simon Farnsworth
2011-01-06 14:20 ` Ondřej Bílka
2011-01-06 14:41 ` Gordan Bobic
2011-01-06 15:37 ` Ondřej Bílka
2011-01-06 8:25 ` Yan, Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4D258D6A.9010903@wpkg.org \
--to=mangoo@wpkg.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).