linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ric Wheeler <rwheeler@redhat.com>
To: Boyd Waters <waters.boyd@gmail.com>
Cc: "linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: Content based storage
Date: Sat, 20 Mar 2010 18:16:12 -0400	[thread overview]
Message-ID: <4BA5492C.5030709@redhat.com> (raw)
In-Reply-To: <-1949627963487010042@unknownmsgid>

On 03/20/2010 05:24 PM, Boyd Waters wrote:
> On Mar 20, 2010, at 9:05 AM, Ric Wheeler<rwheeler@redhat.com>  wrote:
>>>
>>> My dataset reported a dedup factor of 1.28 for about 4TB, meaning
>>> that
>>> almost a third of the dataset was duplicated.
>
>> It is always interesting to compare this to the rate you would get
>> with old fashioned compression to see how effective this is. Seems
>> to be not that aggressive if I understand your results correctly.
>>
>> Any idea of how compressible your data set was?
>
> Well, of course if I used zip on the whole 4 TB that would deal with
> my duplication issues, and give me a useless, static blob with no
> checksumming. I haven't tried.

gzip/bzip2 of the block device was not meant to give a best case estimate of 
what traditional compression can do. Many block devices (including some single 
spindle disks) can do encryption internally.

>
> One thing that I did do, seven (!) years ago, was to detect duplicate
> files (not blocks) and use hard links. I was able to squeeze out all
> of the air in a series of backups, and was able to see all of them. I
> used a Perl script for all this. It was nuts, but now I understand why
> Apple implemented hard links to directories in HFS in order to get
> thier Time Machine product.  I didn't have copy-on-write, so btrfs
> snapshots completely spank a manual system like this, but I did get 7-
> to-1 compression. These days you can use rsync with "--link-target" to
> make hard-linked duplicates of large directory trees. Tar, cpio, and
> friends tend to break when transferring hundreds of gigabytes with
> thousands of hard links. Or they ignore the hard links.
>
> Good times. I'm not sure how this is germane to btrfs, except to point
> out pathological file-system usage that I've actually attempted in
> real life. I actually use a lot of the ZFS feature set, and I look
> forward to btrfs stability. I think btrfs can get there.

File level dedup is something we did in a group I worked with before and can 
certainly be quite effective. Even better, it is much easier to map into normal 
user expectations :-)

ric


  reply	other threads:[~2010-03-20 22:16 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-16  9:21 Content based storage David Brown
2010-03-16 22:45 ` Fabio
2010-03-17  8:21   ` David Brown
2010-03-17  0:45 ` Hubert Kario
2010-03-17  8:27   ` David Brown
2010-03-17  8:48     ` Heinz-Josef Claes
2010-03-17 15:25       ` Hubert Kario
2010-03-17 15:33         ` Leszek Ciesielski
2010-03-17 19:43           ` Hubert Kario
2010-03-20  2:46             ` Boyd Waters
2010-03-20 13:05               ` Ric Wheeler
2010-03-20 21:24                 ` Boyd Waters
2010-03-20 22:16                   ` Ric Wheeler [this message]
2010-03-20 22:44                     ` Ric Wheeler
2010-03-21  6:55                       ` Boyd Waters
2010-03-18 23:33   ` create debian package of btrfs kernel from git tree rk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4BA5492C.5030709@redhat.com \
    --to=rwheeler@redhat.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=waters.boyd@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).