public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Ric Wheeler <rwheeler@redhat.com>
To: Tomasz Chmielewski <mangoo@wpkg.org>
Cc: Ric Wheeler <rwheeler@redhat.com>,
	Michael Tharp <gxti@partiallystapled.com>,
	Thomas Glanzmann <thomas@glanzmann.de>,
	Chris Mason <chris.mason@oracle.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: Data Deduplication with the help of an online filesystem check
Date: Mon, 04 May 2009 10:45:46 -0400	[thread overview]
Message-ID: <49FEFF9A.8060803@redhat.com> (raw)
In-Reply-To: <49FEFE27.5090804@wpkg.org>

On 05/04/2009 10:39 AM, Tomasz Chmielewski wrote:
> Ric Wheeler schrieb:
>
>> One thing in the above scheme that would be really interesting for 
>> all possible hash functions is maintaining good stats on hash 
>> collisions, effectiveness of the hash, etc. There has been a lot of 
>> press about MD5 hash collisions for example - it would be really neat 
>> to be able to track real world data on those,
>
> See here ("The hashing function"):
>
> http://backuppc.sourceforge.net/faq/BackupPC.html#some_design_issues
>
> It's not "real world data", but it gives some overview which applies 
> here.
>
>

I have lots of first hand "real world" data from 5 years at EMC in the 
Centera group where we used various types of hashes to do single 
instancing at the file level, but those are unfortunately not public :-) 
Note that other parts of EMC do dedup at the block level (Avamar most 
notably).

The key to doing dedup at scale is answering a bunch of questions:

     (1) Block level or file level dedup?
     (2) Inband dedup (during a write) or background dedup?
     (3) How reliably can you protect the pool of blocks? How reliably 
can you protect the database that maps hashes to blocks?
     (4) Can you give users who are somewhat jaded confidence in your 
solution (this is where stats come in very handy!)
     (5) In the end, dedup is basically a data compression trick - how 
effective is it (including the costs of the metadata, etc) compared to 
less complex schemes? What does it costs in CPU, DRAM and impact on the 
foreground workload?

Dedup is a very active area in the storage world, but you need to be 
extremely robust in the face of failure since a single block failure 
could (worst case!) lose all stored data in the pool :-)

Regards,

Ric


  reply	other threads:[~2009-05-04 14:45 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-27  3:33 Data Deduplication with the help of an online filesystem check Thomas Glanzmann
2009-04-27 13:37 ` Chris Mason
2009-04-28  5:22   ` Thomas Glanzmann
2009-04-28 10:02     ` Chris Mason
2009-04-28 13:49       ` Andrey Kuzmin
2009-04-28 13:58         ` Chris Mason
2009-04-28 14:04           ` Thomas Glanzmann
2009-04-28 17:21             ` Chris Mason
2009-04-28 20:10               ` Thomas Glanzmann
2009-04-28 20:29                 ` Thomas Glanzmann
2009-04-28 13:58         ` jim owens
2009-04-28 16:10       ` Anthony Roberts
2009-04-28 15:59   ` Thomas Glanzmann
2009-04-28 16:04     ` Tomasz Chmielewski
2009-04-28 17:29       ` Edward Shishkin
2009-04-28 17:34         ` Thomas Glanzmann
2009-04-28 17:38           ` Chris Mason
2009-04-28 17:43             ` Thomas Glanzmann
2009-04-28 17:45             ` Heinz-Josef Claes
2009-04-28 20:16               ` Thomas Glanzmann
2009-04-28 20:36                 ` Heinz-Josef Claes
2009-04-28 20:52                   ` Thomas Glanzmann
2009-04-28 20:58                     ` Chris Mason
2009-04-28 21:12                       ` Thomas Glanzmann
2009-04-28 21:26                         ` Chris Mason
2009-04-28 22:14                           ` Thomas Glanzmann
2009-04-28 23:18                             ` Chris Mason
2009-04-29 12:03                               ` Thomas Glanzmann
2009-04-29 13:11                                 ` Michael Tharp
2009-04-29 13:14                                 ` Chris Mason
2009-04-29 13:58                                   ` Thomas Glanzmann
2009-04-29 14:31                                     ` Chris Mason
2009-04-29 15:26                                       ` Thomas Glanzmann
2009-04-29 15:45                                         ` Chris Mason
2009-06-04  8:49                                           ` Thomas Glanzmann
2009-06-04 11:43                                             ` Chris Mason
2009-06-04 12:03                                               ` Thomas Glanzmann
2009-06-04 12:43                                                 ` Chris Mason
2009-06-05 12:20                                               ` Tomasz Chmielewski
2009-06-05 12:50                                                 ` Chris Mason
2009-06-05 15:35                                                   ` Tomasz Chmielewski
2009-04-29  0:06                       ` Bron Gondwana
2009-05-06 15:16               ` Sander
2009-04-28 17:32       ` Thomas Glanzmann
2009-04-28 17:41         ` Michael Tharp
2009-04-28 20:14           ` Thomas Glanzmann
2009-05-04 14:29           ` Ric Wheeler
2009-05-04 14:39             ` Tomasz Chmielewski
2009-05-04 14:45               ` Ric Wheeler [this message]
2009-05-04 15:15                 ` Thomas Glanzmann
2009-05-04 16:03                   ` Ric Wheeler
2009-05-04 16:16                     ` Andrey Kuzmin
2009-05-04 16:24                       ` Thomas Glanzmann
2009-05-04 18:06                         ` Jan-Frode Myklebust
2009-05-04 19:16                           ` Andrey Kuzmin
2009-05-05  8:02                           ` Thomas Glanzmann
2009-05-04 16:26                     ` Thomas Glanzmann
2009-05-04 19:11                       ` Heinz-Josef Claes
2009-05-04 21:29                         ` Dmitri Nikulin
2009-05-05  7:18                           ` Heinz-Josef Claes
2009-05-24  7:27                         ` Thomas Glanzmann
2009-04-28 17:23     ` Chris Mason
2009-04-28 17:37       ` Thomas Glanzmann
2009-04-28 17:43         ` Chris Mason
2009-04-28 20:15           ` Thomas Glanzmann
2009-04-28 21:19           ` Dmitri Nikulin
2009-04-28 20:24       ` Thomas Glanzmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49FEFF9A.8060803@redhat.com \
    --to=rwheeler@redhat.com \
    --cc=chris.mason@oracle.com \
    --cc=gxti@partiallystapled.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=mangoo@wpkg.org \
    --cc=thomas@glanzmann.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox