From mboxrd@z Thu Jan 1 00:00:00 1970 From: Heinz-Josef Claes Subject: Re: Data Deduplication with the help of an online filesystem check Date: Tue, 5 May 2009 09:18:36 +0200 Message-ID: <20090505091836.9c60feb8.hjclaes@web.de> References: <1240839448.26451.13.camel@think.oraclecorp.com> <20090428173251.GB7217@cip.informatik.uni-erlangen.de> <49F73FC9.3070607@partiallystapled.com> <49FEFBE6.40209@redhat.com> <49FEFE27.5090804@wpkg.org> <49FEFF9A.8060803@redhat.com> <20090504151518.GA13777@cip.informatik.uni-erlangen.de> <49FF11EE.2060404@redhat.com> <20090504162650.GD13777@cip.informatik.uni-erlangen.de> <49FF3DCD.40306@web.de> <3a7f57190905041429u14c16412rc25b10018a19abd6@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Cc: linux-btrfs@vger.kernel.org To: Dmitri Nikulin Return-path: In-Reply-To: <3a7f57190905041429u14c16412rc25b10018a19abd6@mail.gmail.com> List-ID: On Tue, 5 May 2009 07:29:45 +1000 Dmitri Nikulin wrote: > On Tue, May 5, 2009 at 5:11 AM, Heinz-Josef Claes wrote: > > Hi, during the last half year I thought a little bit about doing dedup for > > my backup program: not only with fixed blocks (which is implemented), but > > with moving blocks (with all offsets in a file: 1 byte, 2 byte, ...). That > > means, I have to have *lots* of comparisions (size of file - blocksize). > > Even it's not the same, it must be very fast and that's the same problem > > like the one discussed here. > > > > My solution (not yet implemented) is as follows (hopefully I remember well): > > > > I calculate a checksum of 24 bit. (there can be another size) > > > > This means, I can have 2^24 different checksums. > > > > Therefore, I hold a bit verctor of 0,5 GB in memory (I hope I remember well, > > I'm just in a hotel and have no calculator): one bit for each possibility. > > This verctor is initialized with zeros. > > > > For each calculated checksum of a block, I set the according bit in the bit > > vector. > > > > It's very fast, to check if a block with a special checksum exists in the > > filesystem (backup for me) by checking the appropriate bit in the bit > > vector. > > > > If it doesn't exist, it's a new block > > > > If it exists, there need to be a separate 'real' check if it's really the > > same block (which is slow, but's that's happening <<1% of the time). > > Which means you have to refer to each block in some unique way from > the bit vector, making it a block pointer vector instead. That's only > 64 times more expensive for a 64 bit offset... > It was not the idea to have a pointer vector, only a bit vector. A pointer vector would be too big to hold it in RAM. Therefore, I need access to the disk after using the more exact md5sum (I wanted to use). The bitvector is only needed to have a very quick decision for most of the cases (speedup). But I have no idea if it fits to this use case. I'm not a filesystem developer ;-) > Since the overwhelming majority of combinations will never appear in > practice, you are much better served with a self-sizing data structure > like a hash map or even a binary tree, or a hash map with each bucket > being a binary tree, etc... You can use any sized hash and it won't > affect the number of nodes you have to store. You can trade off to CPU > or RAM easily as required, just by selecting an appropriate data > structure. A bit vector and especially a pointer vector have extremely > bad "any" case RAM requirements because even if you're deduping a mere > 10 blocks you're still allocating and initialising 2^24 offsets. The > least you could do is adaptively switch to a more efficient data > structure if you see the number of blocks is low enough. > > -- > Dmitri Nikulin > > Centre for Synchrotron Science > Monash University > Victoria 3800, Australia > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Heinz-Josef Claes