All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gordan Bobic <gordan@bobich.net>
To: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Offline Deduplication for Btrfs
Date: Thu, 06 Jan 2011 10:52:36 +0000	[thread overview]
Message-ID: <4D259EF4.504@bobich.net> (raw)
In-Reply-To: <4D258D6A.9010903@wpkg.org>

Tomasz Chmielewski wrote:
>> I have been thinking a lot about de-duplication for a backup application
>> I am writing. I wrote a little script to figure out how much it would
>> save me. For my laptop home directory, about 100 GiB of data, it was a
>> couple of percent, depending a bit on the size of the chunks. With 4 KiB
>> chunks, I would save about two gigabytes. (That's assuming no MD5 hash
>> collisions.) I don't have VM images, but I do have a fair bit of saved
>> e-mail. So, for backups, I concluded it was worth it to provide an
>> option to do this. I have no opinion on whether it is worthwhile to do
>> in btrfs.
> 
> Online deduplication is very useful for backups of big, multi-gigabyte 
> files which change constantly.
> Some mail servers store files this way; some MUA store the files like 
> this; databases are also common to pack everything in big files which 
> tend to change here and there almost all the time.
> 
> Multi-gigabyte files which only have few megabytes changed can't be 
> hardlinked; simple maths shows that even compressing multiple files 
> which have few differences will lead to greater space usage than a few 
> megabytes extra in each (because everything else is deduplicated).
> 
> And I don't even want to think about IO needed to offline dedup a 
> multi-terabyte storage (1 TB disks and bigger are becoming standard 
> nowadays) i.e. daily, especially when the storage is already heavily 
> used in IO terms.
> 
> 
> Now, one popular tool which can deal with small changes in files is 
> rsync. It can be used to copy files over the network - so that if you 
> want to copy/update a multi-gigabyte file which only has a few changes, 
> rsync would need to transfer just a few megabytes.
> 
> On disk however, rsync creates a "temporary copy" of the original file, 
> where it packs unchanged contents together with any changes made. For 
> example, while it copies/updates a file, we will have:
> 
> original_file.bin
> .temporary_random_name
> 
> Later, original_file.bin would be removed, and .temporary_random_name 
> would be renamed to original_file.bin. Here goes away any deduplication 
> we had so far, we have to start the IO over again.

You can tell rsync to either modify the file in place (--inplace) or to 
put the temp file somewhere else (--temp-dir=DIR).

Gordan

  parent reply	other threads:[~2011-01-06 10:52 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-06  9:37 Offline Deduplication for Btrfs Tomasz Chmielewski
2011-01-06  9:51 ` Mike Hommey
2011-01-06 16:57   ` Hubert Kario
2011-01-06 10:52 ` Gordan Bobic [this message]
  -- strict thread matches above, loose matches on Subject: below --
2011-01-16  0:18 Arjen Nienhuis
2011-01-05 16:36 Josef Bacik
2011-01-05 17:42 ` Gordan Bobic
2011-01-05 18:41   ` Diego Calleja
2011-01-05 19:01     ` Ray Van Dolson
2011-01-05 20:27       ` Gordan Bobic
2011-01-05 20:28       ` Josef Bacik
2011-01-05 20:25     ` Gordan Bobic
2011-01-05 21:14       ` Diego Calleja
2011-01-05 21:21         ` Gordan Bobic
2011-01-05 19:46   ` Josef Bacik
2011-01-05 19:58     ` Lars Wirzenius
2011-01-05 20:15       ` Josef Bacik
2011-01-05 20:34         ` Freddie Cash
2011-01-05 21:07       ` Lars Wirzenius
2011-01-05 20:12     ` Freddie Cash
2011-01-05 20:46     ` Gordan Bobic
     [not found]       ` <4D250B3C.6010708@shiftmail.org>
2011-01-06  1:03         ` Gordan Bobic
2011-01-06  1:56           ` Spelic
2011-01-06 10:39             ` Gordan Bobic
2011-01-06  3:33           ` Freddie Cash
2011-01-06  1:19       ` Spelic
2011-01-06  3:58         ` Peter A
2011-01-06 10:48           ` Gordan Bobic
2011-01-06 13:33             ` Peter A
2011-01-06 14:00               ` Gordan Bobic
2011-01-06 14:52                 ` Peter A
2011-01-06 15:07                   ` Gordan Bobic
2011-01-06 16:11                     ` Peter A
2011-01-06 18:35           ` Chris Mason
2011-01-08  0:27             ` Peter A
2011-01-06 14:30         ` Tomasz Torcz
2011-01-06 14:49           ` Gordan Bobic
2011-01-06  1:29   ` Chris Mason
2011-01-06 10:33     ` Gordan Bobic
2011-01-10 15:28     ` Ric Wheeler
2011-01-10 15:37       ` Josef Bacik
2011-01-10 15:39         ` Chris Mason
2011-01-10 15:43           ` Josef Bacik
2011-01-06 12:18   ` Simon Farnsworth
2011-01-06 12:29     ` Gordan Bobic
2011-01-06 13:30       ` Simon Farnsworth
2011-01-06 14:20     ` Ondřej Bílka
2011-01-06 14:41       ` Gordan Bobic
2011-01-06 15:37         ` Ondřej Bílka
2011-01-06  8:25 ` Yan, Zheng 

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D259EF4.504@bobich.net \
    --to=gordan@bobich.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.