From: Heinz-Josef Claes <hjclaes@web.de>
To: Thomas Glanzmann <thomas@glanzmann.de>
Cc: Ric Wheeler <rwheeler@redhat.com>,
Tomasz Chmielewski <mangoo@wpkg.org>,
Michael Tharp <gxti@partiallystapled.com>,
Chris Mason <chris.mason@oracle.com>,
linux-btrfs@vger.kernel.org
Subject: Re: Data Deduplication with the help of an online filesystem check
Date: Mon, 04 May 2009 21:11:09 +0200 [thread overview]
Message-ID: <49FF3DCD.40306@web.de> (raw)
In-Reply-To: <20090504162650.GD13777@cip.informatik.uni-erlangen.de>
Thomas Glanzmann schrieb:
> Ric,
>
>
>> I would not categorize it as offline, but just not as inband (i.e., you can
>> run a low priority background process to handle dedup).
>>
>
>
>> Offline windows are extremely rare in production sites these days and
>> it could take a very long time to do dedup at the block level over a
>> large file system :-)
>>
>
> let me rephrase, by offline I meant asynchronous during off hours.
>
>
Hi, during the last half year I thought a little bit about doing dedup
for my backup program: not only with fixed blocks (which is
implemented), but with moving blocks (with all offsets in a file: 1
byte, 2 byte, ...). That means, I have to have *lots* of comparisions
(size of file - blocksize). Even it's not the same, it must be very fast
and that's the same problem like the one discussed here.
My solution (not yet implemented) is as follows (hopefully I remember well):
I calculate a checksum of 24 bit. (there can be another size)
This means, I can have 2^24 different checksums.
Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember
well, I'm just in a hotel and have no calculator): one bit for each
possibility. This verctor is initialized with zeros.
For each calculated checksum of a block, I set the according bit in the
bit vector.
It's very fast, to check if a block with a special checksum exists in
the filesystem (backup for me) by checking the appropriate bit in the
bit vector.
If it doesn't exist, it's a new block
If it exists, there need to be a separate 'real' check if it's really
the same block (which is slow, but's that's happening <<1% of the time).
I hope it is possible to understand my thoughts. I'm in a hotel and I
possibly cannot track the emails in this list in the next hours or days.
Regards, HJC
>> 1/3 is not sufficient for dedup in my opinion - you can get that with
>> normal compression at the block level.
>>
>
> 1/3 is what gives me real time data of an production environment in a
> mixed VM setup without compression.
>
> Thomas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
next prev parent reply other threads:[~2009-05-04 19:11 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-27 3:33 Data Deduplication with the help of an online filesystem check Thomas Glanzmann
2009-04-27 13:37 ` Chris Mason
2009-04-28 5:22 ` Thomas Glanzmann
2009-04-28 10:02 ` Chris Mason
2009-04-28 13:49 ` Andrey Kuzmin
2009-04-28 13:58 ` Chris Mason
2009-04-28 14:04 ` Thomas Glanzmann
2009-04-28 17:21 ` Chris Mason
2009-04-28 20:10 ` Thomas Glanzmann
2009-04-28 20:29 ` Thomas Glanzmann
2009-04-28 13:58 ` jim owens
2009-04-28 16:10 ` Anthony Roberts
2009-04-28 15:59 ` Thomas Glanzmann
2009-04-28 16:04 ` Tomasz Chmielewski
2009-04-28 17:29 ` Edward Shishkin
2009-04-28 17:34 ` Thomas Glanzmann
2009-04-28 17:38 ` Chris Mason
2009-04-28 17:43 ` Thomas Glanzmann
2009-04-28 17:45 ` Heinz-Josef Claes
2009-04-28 20:16 ` Thomas Glanzmann
2009-04-28 20:36 ` Heinz-Josef Claes
2009-04-28 20:52 ` Thomas Glanzmann
2009-04-28 20:58 ` Chris Mason
2009-04-28 21:12 ` Thomas Glanzmann
2009-04-28 21:26 ` Chris Mason
2009-04-28 22:14 ` Thomas Glanzmann
2009-04-28 23:18 ` Chris Mason
2009-04-29 12:03 ` Thomas Glanzmann
2009-04-29 13:11 ` Michael Tharp
2009-04-29 13:14 ` Chris Mason
2009-04-29 13:58 ` Thomas Glanzmann
2009-04-29 14:31 ` Chris Mason
2009-04-29 15:26 ` Thomas Glanzmann
2009-04-29 15:45 ` Chris Mason
2009-06-04 8:49 ` Thomas Glanzmann
2009-06-04 11:43 ` Chris Mason
2009-06-04 12:03 ` Thomas Glanzmann
2009-06-04 12:43 ` Chris Mason
2009-06-05 12:20 ` Tomasz Chmielewski
2009-06-05 12:50 ` Chris Mason
2009-06-05 15:35 ` Tomasz Chmielewski
2009-04-29 0:06 ` Bron Gondwana
2009-05-06 15:16 ` Sander
2009-04-28 17:32 ` Thomas Glanzmann
2009-04-28 17:41 ` Michael Tharp
2009-04-28 20:14 ` Thomas Glanzmann
2009-05-04 14:29 ` Ric Wheeler
2009-05-04 14:39 ` Tomasz Chmielewski
2009-05-04 14:45 ` Ric Wheeler
2009-05-04 15:15 ` Thomas Glanzmann
2009-05-04 16:03 ` Ric Wheeler
2009-05-04 16:16 ` Andrey Kuzmin
2009-05-04 16:24 ` Thomas Glanzmann
2009-05-04 18:06 ` Jan-Frode Myklebust
2009-05-04 19:16 ` Andrey Kuzmin
2009-05-05 8:02 ` Thomas Glanzmann
2009-05-04 16:26 ` Thomas Glanzmann
2009-05-04 19:11 ` Heinz-Josef Claes [this message]
2009-05-04 21:29 ` Dmitri Nikulin
2009-05-05 7:18 ` Heinz-Josef Claes
2009-05-24 7:27 ` Thomas Glanzmann
2009-04-28 17:23 ` Chris Mason
2009-04-28 17:37 ` Thomas Glanzmann
2009-04-28 17:43 ` Chris Mason
2009-04-28 20:15 ` Thomas Glanzmann
2009-04-28 21:19 ` Dmitri Nikulin
2009-04-28 20:24 ` Thomas Glanzmann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49FF3DCD.40306@web.de \
--to=hjclaes@web.de \
--cc=chris.mason@oracle.com \
--cc=gxti@partiallystapled.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=mangoo@wpkg.org \
--cc=rwheeler@redhat.com \
--cc=thomas@glanzmann.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.