From: Ric Wheeler <rwheeler@redhat.com>
To: Michael Tharp <gxti@partiallystapled.com>
Cc: Thomas Glanzmann <thomas@glanzmann.de>,
Tomasz Chmielewski <mangoo@wpkg.org>,
Chris Mason <chris.mason@oracle.com>,
linux-btrfs@vger.kernel.org
Subject: Re: Data Deduplication with the help of an online filesystem check
Date: Mon, 04 May 2009 10:29:58 -0400 [thread overview]
Message-ID: <49FEFBE6.40209@redhat.com> (raw)
In-Reply-To: <49F73FC9.3070607@partiallystapled.com>
On 04/28/2009 01:41 PM, Michael Tharp wrote:
> Thomas Glanzmann wrote:
>> no, I just used the md5 checksum. And even if I have a hash escalation
>> which is highly unlikely it still gives a good house number.
>
> I'd start with a crc32 and/or MD5 to find candidate blocks, then do a
> bytewise comparison before actually merging them. Even the risk of an
> accidental collision is too high, and considering there are plenty of
> birthday-style MD5 attacks it would not be extraordinarily difficult
> to construct a block that collides with e.g. a system library.
>
> Keep in mind that although digests do a fairly good job of making
> unique identifiers for larger chunks of data, they can only hold so
> many unique combinations. Considering you're comparing blocks of a few
> kibibytes in size it's best to just do a foolproof comparison. There's
> nothing wrong with using a checksum/digest as a screening mechanism
> though.
>
> -- m. tharp
One thing in the above scheme that would be really interesting for all
possible hash functions is maintaining good stats on hash collisions,
effectiveness of the hash, etc. There has been a lot of press about MD5
hash collisions for example - it would be really neat to be able to
track real world data on those,
Ric
next prev parent reply other threads:[~2009-05-04 14:29 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-27 3:33 Data Deduplication with the help of an online filesystem check Thomas Glanzmann
2009-04-27 13:37 ` Chris Mason
2009-04-28 5:22 ` Thomas Glanzmann
2009-04-28 10:02 ` Chris Mason
2009-04-28 13:49 ` Andrey Kuzmin
2009-04-28 13:58 ` Chris Mason
2009-04-28 14:04 ` Thomas Glanzmann
2009-04-28 17:21 ` Chris Mason
2009-04-28 20:10 ` Thomas Glanzmann
2009-04-28 20:29 ` Thomas Glanzmann
2009-04-28 13:58 ` jim owens
2009-04-28 16:10 ` Anthony Roberts
2009-04-28 15:59 ` Thomas Glanzmann
2009-04-28 16:04 ` Tomasz Chmielewski
2009-04-28 17:29 ` Edward Shishkin
2009-04-28 17:34 ` Thomas Glanzmann
2009-04-28 17:38 ` Chris Mason
2009-04-28 17:43 ` Thomas Glanzmann
2009-04-28 17:45 ` Heinz-Josef Claes
2009-04-28 20:16 ` Thomas Glanzmann
2009-04-28 20:36 ` Heinz-Josef Claes
2009-04-28 20:52 ` Thomas Glanzmann
2009-04-28 20:58 ` Chris Mason
2009-04-28 21:12 ` Thomas Glanzmann
2009-04-28 21:26 ` Chris Mason
2009-04-28 22:14 ` Thomas Glanzmann
2009-04-28 23:18 ` Chris Mason
2009-04-29 12:03 ` Thomas Glanzmann
2009-04-29 13:11 ` Michael Tharp
2009-04-29 13:14 ` Chris Mason
2009-04-29 13:58 ` Thomas Glanzmann
2009-04-29 14:31 ` Chris Mason
2009-04-29 15:26 ` Thomas Glanzmann
2009-04-29 15:45 ` Chris Mason
2009-06-04 8:49 ` Thomas Glanzmann
2009-06-04 11:43 ` Chris Mason
2009-06-04 12:03 ` Thomas Glanzmann
2009-06-04 12:43 ` Chris Mason
2009-06-05 12:20 ` Tomasz Chmielewski
2009-06-05 12:50 ` Chris Mason
2009-06-05 15:35 ` Tomasz Chmielewski
2009-04-29 0:06 ` Bron Gondwana
2009-05-06 15:16 ` Sander
2009-04-28 17:32 ` Thomas Glanzmann
2009-04-28 17:41 ` Michael Tharp
2009-04-28 20:14 ` Thomas Glanzmann
2009-05-04 14:29 ` Ric Wheeler [this message]
2009-05-04 14:39 ` Tomasz Chmielewski
2009-05-04 14:45 ` Ric Wheeler
2009-05-04 15:15 ` Thomas Glanzmann
2009-05-04 16:03 ` Ric Wheeler
2009-05-04 16:16 ` Andrey Kuzmin
2009-05-04 16:24 ` Thomas Glanzmann
2009-05-04 18:06 ` Jan-Frode Myklebust
2009-05-04 19:16 ` Andrey Kuzmin
2009-05-05 8:02 ` Thomas Glanzmann
2009-05-04 16:26 ` Thomas Glanzmann
2009-05-04 19:11 ` Heinz-Josef Claes
2009-05-04 21:29 ` Dmitri Nikulin
2009-05-05 7:18 ` Heinz-Josef Claes
2009-05-24 7:27 ` Thomas Glanzmann
2009-04-28 17:23 ` Chris Mason
2009-04-28 17:37 ` Thomas Glanzmann
2009-04-28 17:43 ` Chris Mason
2009-04-28 20:15 ` Thomas Glanzmann
2009-04-28 21:19 ` Dmitri Nikulin
2009-04-28 20:24 ` Thomas Glanzmann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49FEFBE6.40209@redhat.com \
--to=rwheeler@redhat.com \
--cc=chris.mason@oracle.com \
--cc=gxti@partiallystapled.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=mangoo@wpkg.org \
--cc=thomas@glanzmann.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox