public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* Data De-duplication
@ 2008-12-09 22:48 Oliver Mattos
  2008-12-10 11:52 ` Miguel Figueiredo Mascarenhas Sousa Filipe
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Oliver Mattos @ 2008-12-09 22:48 UTC (permalink / raw)
  To: linux-btrfs

Hi,

Say I download a large file from the net to /mnt/a.iso.  I then download
the same file again to /mnt/b.iso.  These files now have the same
content, but are stored twice since the copies weren't made with the bcp
utility.

The same occurs if a directory tree with duplicate files (created with
bcp) is put through a non-aware program - for example tarred and then
untarred again.

This could be improved in two ways:

1)  Make a utility which checks the checksums for all the data extents,
and if the checksums of data match for two files then check the file
data, and if the file data matches then keep only one copy.  It could be
run as a cron job to free up disk space on systems where duplicate data
is common (eg. virtual machine images)

2)  Keep a tree of checksums for data blocks, so that a bit of data can
be located by it's checksum.  Whenever a data block is about to be
written check if the block matches any known block, and if it does then
don't bother duplicating the data on disk.  I suspect this option may
not be realistic for performance reasons.

If either is possible then thought needs to be put into if it's worth
doing on a file level, or a partial-file level (ie. if I have two
similar files, can the space used by the identical parts of the files be
saved)

Has any thought been put into either 1) or 2) - are either possible or
desired?

Thanks
Oliver


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2008-12-14 19:37 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-09 22:48 Data De-duplication Oliver Mattos
2008-12-10 11:52 ` Miguel Figueiredo Mascarenhas Sousa Filipe
2008-12-10 13:30 ` Chris Mason
2008-12-10 17:53   ` Oliver Mattos
2008-12-11 15:12     ` Chris Mason
     [not found]   ` <32809.2001:470:e828:1::2:2.1228939660.squirrel@avalon.arbitraryconstant.com>
2008-12-10 21:10     ` Oliver Mattos
2008-12-10 21:19       ` Ray Van Dolson
2008-12-10 21:42         ` Oliver Mattos
2008-12-10 21:57           ` Tracy Reed
2008-12-10 22:06             ` Oliver Mattos
2008-12-10 22:10             ` Ray Van Dolson
2008-12-11  0:18               ` Oliver Mattos
2008-12-11  3:42                 ` Oliver Mattos
2008-12-11  3:50                   ` Ray Van Dolson
2008-12-11  9:58                     ` Oliver Mattos
2008-12-14 19:37                 ` Omen Wild
2008-12-14 12:25         ` Chris Samuel
2008-12-10 13:30 ` seth huang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox