Re: Data De-duplication - Oliver Mattos

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Oliver Mattos <oliver.mattos08@imperial.ac.uk>
To: Ray Van Dolson <rayvd@bludgeon.org>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>,
	Tracy Reed <treed@tracyreed.org>,
	<btrfs-devel@arbitraryconstant.com>,
	Chris Mason <chris.mason@oracle.com>
Subject: Re: Data De-duplication
Date: Thu, 11 Dec 2008 09:58:46 +0000	[thread overview]
Message-ID: <1228989526.17969.17.camel@mattos-laptop> (raw)
In-Reply-To: <20081211035052.GA3917@bludgeon.org>

> Neat.  Thanks much.  It'd be cool to output the results of each of your
> hashes to a database so you can get a feel for how many duplicate
> blocks there are cross-files as well.
> 
> I'd like to run this in a similar setup on all my VMware VMDK files and
> get an idea of how much space savings there would be across 20+ Windows
> 2003 VMDK files... probably *lots* of common blocks.
> 
> Ray
Hi,

Currently it DOES do cross file block matching - thats why it takes sooo
long to run :-)

If you remove everything after the word "sort", it will make a verbose
output, which you could then stick into some SQL database if you wanted.
You could relativey easily format the output into a format for direct
input to an SQL database if you modified the line with the "dd" in it
within the first while.  You can also remove the "sort" and the pipe
before it to get an unsorted output - the advantage of this is it takes
less time.

I'm guessing that if you had the time to run this on multi-gigabyte disk
images you'd find that as much as 80% of the blocks are duplicated
between any two virtual machines of the same operating system.

That means if you have 20 Win 2k3 VM's and the first VM image of Windows
+ software is 2GB (after nulls are removed), the total size for 20VM's
could be ~6GB (remembering there will be extra redundancy the more VM's
you add)- not a bad saving.

Thanks
Oliver

next prev parent reply	other threads:[~2008-12-11  9:58 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-09 22:48 Data De-duplication Oliver Mattos
2008-12-10 11:52 ` Miguel Figueiredo Mascarenhas Sousa Filipe
2008-12-10 13:30 ` Chris Mason
2008-12-10 17:53   ` Oliver Mattos
2008-12-11 15:12     ` Chris Mason
     [not found]   ` <32809.2001:470:e828:1::2:2.1228939660.squirrel@avalon.arbitraryconstant.com>
2008-12-10 21:10     ` Oliver Mattos
2008-12-10 21:19       ` Ray Van Dolson
2008-12-10 21:42         ` Oliver Mattos
2008-12-10 21:57           ` Tracy Reed
2008-12-10 22:06             ` Oliver Mattos
2008-12-10 22:10             ` Ray Van Dolson
2008-12-11  0:18               ` Oliver Mattos
2008-12-11  3:42                 ` Oliver Mattos
2008-12-11  3:50                   ` Ray Van Dolson
2008-12-11  9:58                     ` Oliver Mattos [this message]
2008-12-14 19:37                 ` Omen Wild
2008-12-14 12:25         ` Chris Samuel
2008-12-10 13:30 ` seth huang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1228989526.17969.17.camel@mattos-laptop \
    --to=oliver.mattos08@imperial.ac.uk \
    --cc=btrfs-devel@arbitraryconstant.com \
    --cc=chris.mason@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=rayvd@bludgeon.org \
    --cc=treed@tracyreed.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox