From: "Miguel Figueiredo Mascarenhas Sousa Filipe" <miguel.filipe@gmail.com>
To: "Oliver Mattos" <oliver.mattos08@imperial.ac.uk>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Data De-duplication
Date: Wed, 10 Dec 2008 11:52:58 +0000 [thread overview]
Message-ID: <f058a9c30812100352l268205a7l259b5bf2bf45aae1@mail.gmail.com> (raw)
In-Reply-To: <1228862899.8130.1.camel@mattos-laptop>
Hi all,
On Tue, Dec 9, 2008 at 10:48 PM, Oliver Mattos
<oliver.mattos08@imperial.ac.uk> wrote:
> Hi,
>
> Say I download a large file from the net to /mnt/a.iso. I then download
> the same file again to /mnt/b.iso. These files now have the same
> content, but are stored twice since the copies weren't made with the bcp
> utility.
>
> The same occurs if a directory tree with duplicate files (created with
> bcp) is put through a non-aware program - for example tarred and then
> untarred again.
>
> This could be improved in two ways:
>
> 1) Make a utility which checks the checksums for all the data extents,
> and if the checksums of data match for two files then check the file
> data, and if the file data matches then keep only one copy. It could be
> run as a cron job to free up disk space on systems where duplicate data
> is common (eg. virtual machine images)
>
I have a perl script that does this, made by a friend of mine.
First it sweeps a dir/ and stats() every file, putting all files with
same size X in a linked list on a hashtable entry for size X.
Then it will md5sum all files with same bytesize to confirm if they
really are copies of each others.
Because if first only stats, and only md5sum files with "potencial"
duplicates, its faster than regular scripts I've seen.
Do you want this ?
> 2) Keep a tree of checksums for data blocks, so that a bit of data can
> be located by it's checksum. Whenever a data block is about to be
> written check if the block matches any known block, and if it does then
> don't bother duplicating the data on disk. I suspect this option may
> not be realistic for performance reasons.
>
> If either is possible then thought needs to be put into if it's worth
> doing on a file level, or a partial-file level (ie. if I have two
> similar files, can the space used by the identical parts of the files be
> saved)
>
> Has any thought been put into either 1) or 2) - are either possible or
> desired?
>
> Thanks
> Oliver
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
Miguel Sousa Filipe
next prev parent reply other threads:[~2008-12-10 11:52 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-12-09 22:48 Data De-duplication Oliver Mattos
2008-12-10 11:52 ` Miguel Figueiredo Mascarenhas Sousa Filipe [this message]
2008-12-10 13:30 ` Chris Mason
2008-12-10 17:53 ` Oliver Mattos
2008-12-11 15:12 ` Chris Mason
[not found] ` <32809.2001:470:e828:1::2:2.1228939660.squirrel@avalon.arbitraryconstant.com>
2008-12-10 21:10 ` Oliver Mattos
2008-12-10 21:19 ` Ray Van Dolson
2008-12-10 21:42 ` Oliver Mattos
2008-12-10 21:57 ` Tracy Reed
2008-12-10 22:06 ` Oliver Mattos
2008-12-10 22:10 ` Ray Van Dolson
2008-12-11 0:18 ` Oliver Mattos
2008-12-11 3:42 ` Oliver Mattos
2008-12-11 3:50 ` Ray Van Dolson
2008-12-11 9:58 ` Oliver Mattos
2008-12-14 19:37 ` Omen Wild
2008-12-14 12:25 ` Chris Samuel
2008-12-10 13:30 ` seth huang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f058a9c30812100352l268205a7l259b5bf2bf45aae1@mail.gmail.com \
--to=miguel.filipe@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=oliver.mattos08@imperial.ac.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox