From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Tharp Subject: Re: Data Deduplication with the help of an online filesystem check Date: Tue, 28 Apr 2009 13:41:29 -0400 Message-ID: <49F73FC9.3070607@partiallystapled.com> References: <20090427033331.GC17677@cip.informatik.uni-erlangen.de> <1240839448.26451.13.camel@think.oraclecorp.com> <20090428155900.GA1722@cip.informatik.uni-erlangen.de> <49F728F6.6030307@wpkg.org> <20090428173251.GB7217@cip.informatik.uni-erlangen.de> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Cc: Tomasz Chmielewski , Chris Mason , linux-btrfs@vger.kernel.org To: Thomas Glanzmann Return-path: In-Reply-To: <20090428173251.GB7217@cip.informatik.uni-erlangen.de> List-ID: Thomas Glanzmann wrote: > no, I just used the md5 checksum. And even if I have a hash escalation > which is highly unlikely it still gives a good house number. I'd start with a crc32 and/or MD5 to find candidate blocks, then do a bytewise comparison before actually merging them. Even the risk of an accidental collision is too high, and considering there are plenty of birthday-style MD5 attacks it would not be extraordinarily difficult to construct a block that collides with e.g. a system library. Keep in mind that although digests do a fairly good job of making unique identifiers for larger chunks of data, they can only hold so many unique combinations. Considering you're comparing blocks of a few kibibytes in size it's best to just do a foolproof comparison. There's nothing wrong with using a checksum/digest as a screening mechanism though. -- m. tharp