From: Andi Kleen <andi@firstfloor.org>
To: Morey Roof <moreyroof@gmail.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: New feature Idea
Date: Wed, 13 Aug 2008 22:00:13 +0200 [thread overview]
Message-ID: <87y730znw2.fsf@basil.nowhere.org> (raw)
In-Reply-To: <48A320A0.80609@gmail.com> (Morey Roof's message of "Wed, 13 Aug 2008 11:57:53 -0600")
Morey Roof <moreyroof@gmail.com> writes:
> I have been thinking about a new feature to start work on that I am
> interested in and I was hoping people could give me some feedback and
> ideas of how to tackle it. Anyways, I want to create a data
> deduplication system that can work in two different modes. One mode
> is that when the system is idle or not beyond a set load point a
> background process would scan the volume for duplicate blocks. The
> other mode would be used for systems that are nearline or backup
> systems that don't really care about the performance and it would do
> the deduplication during block allocation.
Seems like a special case of compression? Perhaps compression would help
more?
> One of the ways I was thinking of to find the duplicate blocks would
> be to use the checksums as a quick compare. If the checksums match
> then do a complete compare before adjusting the nodes on the files.
> However, I believe that I will need to create a tree based on the
> checksum values.
If you really want to do deduplication: It might be advantageous to do
this on larger units.
If you assume that data is usually shared between similar files (which
is a reasonable assumption) and do the deduplication on whole files
you can also use the size as an index and avoid checksumming all files
with a unique size. I wrote a user level duplicated file checker some
time ago that used this trick successfully.
-Andi
next prev parent reply other threads:[~2008-08-13 20:00 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-08-13 17:57 New feature Idea Morey Roof
2008-08-13 18:45 ` Jeff Fisher
2008-08-13 18:54 ` jim owens
2008-08-13 19:00 ` Jeff Fisher
2008-08-13 19:09 ` Morey Roof
2008-08-13 19:06 ` Joe Peterson
2008-08-13 19:28 ` jim owens
2008-08-13 19:40 ` Morey Roof
2008-08-13 19:28 ` btrfs-devel
2008-08-13 19:35 ` Kevin Cantu
2008-08-13 19:45 ` Morey Roof
2008-08-14 17:12 ` Chris Mason
2008-08-14 18:06 ` Anthony Roberts
2008-08-14 18:49 ` Zach Brown
2008-08-14 19:45 ` Morey Roof
2008-08-14 19:53 ` Chris Mason
2008-08-13 20:00 ` Andi Kleen [this message]
2008-08-13 20:10 ` Morey Roof
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87y730znw2.fsf@basil.nowhere.org \
--to=andi@firstfloor.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=moreyroof@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox