Re: Content based storage

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Boyd Waters <waters.boyd@gmail.com>
To: Ric Wheeler <rwheeler@redhat.com>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: Content based storage
Date: Sat, 20 Mar 2010 17:24:39 -0400	[thread overview]
Message-ID: <-1949627963487010042@unknownmsgid> (raw)
In-Reply-To: <4BA4C811.4060702@redhat.com>

On Mar 20, 2010, at 9:05 AM, Ric Wheeler <rwheeler@redhat.com> wrote:
>>
>> My dataset reported a dedup factor of 1.28 for about 4TB, meaning
>> that
>> almost a third of the dataset was duplicated.

> It is always interesting to compare this to the rate you would get
> with old fashioned compression to see how effective this is. Seems
> to be not that aggressive if I understand your results correctly.
>
> Any idea of how compressible your data set was?

Well, of course if I used zip on the whole 4 TB that would deal with
my duplication issues, and give me a useless, static blob with no
checksumming. I haven't tried.
>

One thing that I did do, seven (!) years ago, was to detect duplicate
files (not blocks) and use hard links. I was able to squeeze out all
of the air in a series of backups, and was able to see all of them. I
used a Perl script for all this. It was nuts, but now I understand why
Apple implemented hard links to directories in HFS in order to get
thier Time Machine product.  I didn't have copy-on-write, so btrfs
snapshots completely spank a manual system like this, but I did get 7-
to-1 compression. These days you can use rsync with "--link-target" to
make hard-linked duplicates of large directory trees. Tar, cpio, and
friends tend to break when transferring hundreds of gigabytes with
thousands of hard links. Or they ignore the hard links.

Good times. I'm not sure how this is germane to btrfs, except to point
out pathological file-system usage that I've actually attempted in
real life. I actually use a lot of the ZFS feature set, and I look
forward to btrfs stability. I think btrfs can get there.

next prev parent reply	other threads:[~2010-03-20 21:24 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-16  9:21 Content based storage David Brown
2010-03-16 22:45 ` Fabio
2010-03-17  8:21   ` David Brown
2010-03-17  0:45 ` Hubert Kario
2010-03-17  8:27   ` David Brown
2010-03-17  8:48     ` Heinz-Josef Claes
2010-03-17 15:25       ` Hubert Kario
2010-03-17 15:33         ` Leszek Ciesielski
2010-03-17 19:43           ` Hubert Kario
2010-03-20  2:46             ` Boyd Waters
2010-03-20 13:05               ` Ric Wheeler
2010-03-20 21:24                 ` Boyd Waters [this message]
2010-03-20 22:16                   ` Ric Wheeler
2010-03-20 22:44                     ` Ric Wheeler
2010-03-21  6:55                       ` Boyd Waters
2010-03-18 23:33   ` create debian package of btrfs kernel from git tree rk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=-1949627963487010042@unknownmsgid \
    --to=waters.boyd@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=rwheeler@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).