public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Oliver Mattos <oliver.mattos08@imperial.ac.uk>
To: Ray Van Dolson <rayvd@bludgeon.org>
Cc: <btrfs-devel@arbitraryconstant.com>,
	Chris Mason <chris.mason@oracle.com>,
	<linux-btrfs@vger.kernel.org>
Subject: Re: Data De-duplication
Date: Wed, 10 Dec 2008 21:42:16 +0000	[thread overview]
Message-ID: <1228945336.7571.26.camel@mattos-laptop> (raw)
In-Reply-To: <20081210211903.GA29002@bludgeon.org>

I see quite a few uses for this, and while it looks like the kernel mode
automatic de-dup-on-write code might be performance costly, require disk
format changes, and be controversial, it sounds like the user mode
utility could be implemented today.

It looks like a simple script could do the job - just iterate through
every file in the filesystem, run md5sum on every block of every file,
whenever a duplicate is found call an ioctl to remove the duplicate
data.  By md5summing each block it can also effectively compress disk
images.

While not very efficient it should work, and having something like this
in the toolkit would mean as soon as btrfs gets stable enough for
everyday use it would immediately out-do every other linux filesystem in
terms of space efficiency for some workloads.

In the long term kernel mode de-duplication would probably be good.  I'm
willing to bet even the average user has say 1-2% of data duplicated
somewhere on the HD due to accidental copies instead of moves, same
application installed to two different paths, two users who happen to
have the same file each saved in their home folder, etc, so even the
average user will slightly benefit.

I'm considering writing that script to test on my ext3 disk just to see
how much duplicate wasted data I really have.

Thanks
Oliver



  reply	other threads:[~2008-12-10 21:42 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-09 22:48 Data De-duplication Oliver Mattos
2008-12-10 11:52 ` Miguel Figueiredo Mascarenhas Sousa Filipe
2008-12-10 13:30 ` Chris Mason
2008-12-10 17:53   ` Oliver Mattos
2008-12-11 15:12     ` Chris Mason
     [not found]   ` <32809.2001:470:e828:1::2:2.1228939660.squirrel@avalon.arbitraryconstant.com>
2008-12-10 21:10     ` Oliver Mattos
2008-12-10 21:19       ` Ray Van Dolson
2008-12-10 21:42         ` Oliver Mattos [this message]
2008-12-10 21:57           ` Tracy Reed
2008-12-10 22:06             ` Oliver Mattos
2008-12-10 22:10             ` Ray Van Dolson
2008-12-11  0:18               ` Oliver Mattos
2008-12-11  3:42                 ` Oliver Mattos
2008-12-11  3:50                   ` Ray Van Dolson
2008-12-11  9:58                     ` Oliver Mattos
2008-12-14 19:37                 ` Omen Wild
2008-12-14 12:25         ` Chris Samuel
2008-12-10 13:30 ` seth huang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1228945336.7571.26.camel@mattos-laptop \
    --to=oliver.mattos08@imperial.ac.uk \
    --cc=btrfs-devel@arbitraryconstant.com \
    --cc=chris.mason@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=rayvd@bludgeon.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox