All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ray Van Dolson <rayvd@bludgeon.org>
To: Oliver Mattos <oliver.mattos08@imperial.ac.uk>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>,
	Tracy Reed <treed@tracyreed.org>,
	btrfs-devel@arbitraryconstant.com,
	Chris Mason <chris.mason@oracle.com>
Subject: Re: Data De-duplication
Date: Wed, 10 Dec 2008 19:50:52 -0800	[thread overview]
Message-ID: <20081211035052.GA3917@bludgeon.org> (raw)
In-Reply-To: <1228966979.7571.48.camel@mattos-laptop>

On Thu, Dec 11, 2008 at 03:42:58AM +0000, Oliver Mattos wrote:
> Here is a script to locate duplicate data WITHIN files:
> 
> On some test file sets of binary data with no duplicated files, about 3%
> of the data blocks were duplicated, and about 0.1% of the data blocks
> were nulls.  The data was mainly elf and win32 binaries plus some random
> game data, office documents and a few images.
> 
> This code is hideously slow, so don't give it more than a couple of MB
> of files to chew through at once.  In retrospect I should've just
> written it in plain fast C instead of fighting with bash pipes!
> 
> Note to get "verbose" output, just remove everything after the word
> "sort" in the code.

Neat.  Thanks much.  It'd be cool to output the results of each of your
hashes to a database so you can get a feel for how many duplicate
blocks there are cross-files as well.

I'd like to run this in a similar setup on all my VMware VMDK files and
get an idea of how much space savings there would be across 20+ Windows
2003 VMDK files... probably *lots* of common blocks.

Ray

  reply	other threads:[~2008-12-11  3:50 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-09 22:48 Data De-duplication Oliver Mattos
2008-12-10 11:52 ` Miguel Figueiredo Mascarenhas Sousa Filipe
2008-12-10 13:30 ` Chris Mason
2008-12-10 17:53   ` Oliver Mattos
2008-12-11 15:12     ` Chris Mason
     [not found]   ` <32809.2001:470:e828:1::2:2.1228939660.squirrel@avalon.arbitraryconstant.com>
2008-12-10 21:10     ` Oliver Mattos
2008-12-10 21:19       ` Ray Van Dolson
2008-12-10 21:42         ` Oliver Mattos
2008-12-10 21:57           ` Tracy Reed
2008-12-10 22:06             ` Oliver Mattos
2008-12-10 22:10             ` Ray Van Dolson
2008-12-11  0:18               ` Oliver Mattos
2008-12-11  3:42                 ` Oliver Mattos
2008-12-11  3:50                   ` Ray Van Dolson [this message]
2008-12-11  9:58                     ` Oliver Mattos
2008-12-14 19:37                 ` Omen Wild
2008-12-14 12:25         ` Chris Samuel
2008-12-10 13:30 ` seth huang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20081211035052.GA3917@bludgeon.org \
    --to=rayvd@bludgeon.org \
    --cc=btrfs-devel@arbitraryconstant.com \
    --cc=chris.mason@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=oliver.mattos08@imperial.ac.uk \
    --cc=treed@tracyreed.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.