Re: Data De-duplication - Oliver Mattos

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Oliver Mattos <oliver.mattos08@imperial.ac.uk>
To: Chris Mason <chris.mason@oracle.com>
Cc: <linux-btrfs@vger.kernel.org>
Subject: Re: Data De-duplication
Date: Wed, 10 Dec 2008 17:53:12 +0000	[thread overview]
Message-ID: <1228931592.6076.12.camel@mattos-laptop> (raw)
In-Reply-To: <1228915802.11900.8.camel@think.oraclecorp.com>

> > 2)  Keep a tree of checksums for data blocks, so that a bit of data can
> > be located by it's checksum.  Whenever a data block is about to be
> > written check if the block matches any known block, and if it does then
> > don't bother duplicating the data on disk.  I suspect this option may
> > not be realistic for performance reasons.
> > 
> 
> When compression was added, the writeback path was changed to make
> option #2 viable, at least in the case where the admin is willing to
> risk hash collisions on strong hashes.  When the a direct read
> comparison is required before sharing blocks, it is probably best done
> by a stand alone utility, since we don't want wait for a read of a full
> extent every time we want to write on.
> 

Can we assume hash collisions won't occur?  I mean if it's a 256 bit
hash then even with 256TB of data, and one hash per block, the chances
of collision are still too small to calculate on gnome calculator.

The only issue is if the hash algorithm is later found to be flawed, a
malicious bit of data could be stored on the disk who's hash would
collide with some more important data, potentially allowing the contents
of one file to be replaced with another.

Even if we don't assume hash collisions won't occur (eg. for crc's), the
write performance when writing duplicate files is equal to the read
performance of the disk, since for every block written by a program, one
block will need to be read, and no blocks written.  This is still better
than the write case (as most devices read faster than write), and has
the added advantage of saving lots of space.

next prev parent reply	other threads:[~2008-12-10 17:53 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-09 22:48 Data De-duplication Oliver Mattos
2008-12-10 11:52 ` Miguel Figueiredo Mascarenhas Sousa Filipe
2008-12-10 13:30 ` Chris Mason
2008-12-10 17:53   ` Oliver Mattos [this message]
2008-12-11 15:12     ` Chris Mason
     [not found]   ` <32809.2001:470:e828:1::2:2.1228939660.squirrel@avalon.arbitraryconstant.com>
2008-12-10 21:10     ` Oliver Mattos
2008-12-10 21:19       ` Ray Van Dolson
2008-12-10 21:42         ` Oliver Mattos
2008-12-10 21:57           ` Tracy Reed
2008-12-10 22:06             ` Oliver Mattos
2008-12-10 22:10             ` Ray Van Dolson
2008-12-11  0:18               ` Oliver Mattos
2008-12-11  3:42                 ` Oliver Mattos
2008-12-11  3:50                   ` Ray Van Dolson
2008-12-11  9:58                     ` Oliver Mattos
2008-12-14 19:37                 ` Omen Wild
2008-12-14 12:25         ` Chris Samuel
2008-12-10 13:30 ` seth huang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1228931592.6076.12.camel@mattos-laptop \
    --to=oliver.mattos08@imperial.ac.uk \
    --cc=chris.mason@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox