Online Deduplication for Btrfs (Master's thesis)

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Online Deduplication for Btrfs (Master's thesis)
@ 2012-12-17 12:05 Martin Křížek
  2012-12-17 13:12 ` Hubert Kario
  2012-12-17 13:33 ` Alexander Block
  0 siblings, 2 replies; 8+ messages in thread
From: Martin Křížek @ 2012-12-17 12:05 UTC (permalink / raw)
  To: linux-btrfs; +Cc: lczerner

Hello everyone,

my name is Martin Krizek. I am a student at Faculty of Information, Brno
University of Technology, Czech Republic. As my master's thesis I chose to work
on Online Deduplication for Btrfs.

My goal is to study Btrfs design, the offline deduplication patch [1] and to
come up with a design for the online dedup, this semester. I will be
implementing
the feature next semester (spring, that is).

I would like to shortly introduce my ideas on what I think the feature
should look
like.

* Use cases
The two main use cases for the dedup are:
1. virtual machine images
2. backup servers

* When to dedup
The choice of 'when to dedup' is not up to me as the master's thesis
specification
states "online". :)

* What to dedup
I'd say the most reasonable is block-level deduplication as it seems the most
general approach to me.

* Controlling dedup
- turn on/off deduplication - specify subvolumes on which
deduplication is turned on
 (mount, ioctl - inherited),
- turn on/off byte-by-byte comparison of blocks that have same hashes
(mount, ioctl),
- deduplication statistics (ioctl)

* Limitations
Not really limitations, but this is a list of situations when dedup will not
be triggered:
- encryption,
- compression - basically, dedup is kind of compression, might be worth to into
  it in the future though
- inline/prealloc extents,
- data across subvolumes

* How to store hashes
The obvious choice would be to use the checksum tree that holds block checksums
of each extent. The problem with the checksum tree is that the
checksums are looked up by logical address for the start of the extent data.
This is undesirable since it needs to be done the other way around. Logical
addresses need to be looked up by a hash.
To solve this, another key type would be created inside the checksum tree (or
maybe better, a new tree would be introduced) that
would have a hash as item's right-hand key value. This way, items could be
looked up on a hash:
(root, HASH_ITEM, hash)
The root value says which root (subvolume) is hashed block stored on. The hash
value is hash itself.
The item data would be of the following structure:
struct btrfs_hash_item {
__le64 disk_bytenr;
__le64 disk_num_bytes;
__le64 offset;
}
Fields disk_bytenr and disk_num_bytes would point to a
extent item of the extent allocation tree. The offset is a offset of a hashed
block in the extent.

* When to trigger dedup exactly
So It might be appropriate to delay the process just before data are
written out to disk to give data the chance to be changed. I looked a bit into
the compression code and it seems like dedup could use the similar aproach -
to have threads that would do dedup on sync or general writeback.

* Hash algorithm
sha256 seems like a safe choice.

* Future improvements
There are some features that might be considered if time permits or as something
that could be implemented in the future:
- dedup across subvolumes
- specify a hash algorithm
- specify number of deduplication blocks

I would appreciate any comments you might have, thanks!

Best regards,
Martin Krizek

[1] http://lwn.net/Articles/422331/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Online Deduplication for Btrfs (Master's thesis)
  2012-12-17 12:05 Online Deduplication for Btrfs (Master's thesis) Martin Křížek
@ 2012-12-17 13:12 ` Hubert Kario
  2012-12-19 16:58   ` Martin Křížek
  2012-12-17 13:33 ` Alexander Block
  1 sibling, 1 reply; 8+ messages in thread
From: Hubert Kario @ 2012-12-17 13:12 UTC (permalink / raw)
  To: Martin Křížek; +Cc: linux-btrfs, lczerner

On Monday 17 of December 2012 13:05:01 Martin Křížek wrote:
> * Limitations
> Not really limitations, but this is a list of situations when dedup will
> not be triggered:
> - compression - basically, dedup is kind of compression, might be worth to
> into it in the future though

I don't see why it would be incompatible, compressed blocks are data like 
any other. COW and subvolume snapshots work with compressed nodes just as 
well as with regular ones...

> - data across subvolumes

Wasn't the "cp --reflink" across subvolumes merged to mainline quite a bit 
ago? Under 3.6.9 it works fine for me... Also, the blocks are shared between 
subvolumes if the other subvolume is a snapshot of the first one.

Besides, I think that doing a rsync from remote server and snapshotting the 
subvolume dedicated for server will be the standard approach.

Regards,
-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Online Deduplication for Btrfs (Master's thesis)
  2012-12-17 12:05 Online Deduplication for Btrfs (Master's thesis) Martin Křížek
  2012-12-17 13:12 ` Hubert Kario
@ 2012-12-17 13:33 ` Alexander Block
  2012-12-18  1:31   ` Chris Mason
  2012-12-19 17:40   ` Martin Křížek
  1 sibling, 2 replies; 8+ messages in thread
From: Alexander Block @ 2012-12-17 13:33 UTC (permalink / raw)
  To: Martin Křížek; +Cc: linux-btrfs@vger.kernel.org, lczerner

I did some research on deduplication in the past and there are some
problems that you will face. I'll try to list some of them (for sure
not all).

On Mon, Dec 17, 2012 at 1:05 PM, Martin Křížek <martin.krizek@gmail.com> wrote:
> Hello everyone,
>
> my name is Martin Krizek. I am a student at Faculty of Information, Brno
> University of Technology, Czech Republic. As my master's thesis I chose to work
> on Online Deduplication for Btrfs.
>
> My goal is to study Btrfs design, the offline deduplication patch [1] and to
> come up with a design for the online dedup, this semester. I will be
> implementing
> the feature next semester (spring, that is).
The offline dedup patch is quite old and won't apply/compile anymore.
You should probably look into the clone ioctl which basically does the
same as the extent_same ioctl from the patch. Based on the clone ioctl
you can at least learn how to "inject" existing extents into other
inodes.
>
> I would like to shortly introduce my ideas on what I think the feature
> should look
> like.
>
> * Use cases
> The two main use cases for the dedup are:
> 1. virtual machine images
> 2. backup servers
>
> * When to dedup
> The choice of 'when to dedup' is not up to me as the master's thesis
> specification
> states "online". :)
>
> * What to dedup
> I'd say the most reasonable is block-level deduplication as it seems the most
> general approach to me.
Here you have two options:
1. Based on the per volume tree where you'll find btrfs_file_extent
items which point to the global extent tree.
2. Based on the global extent tree. By using the backref resolving
code, you can find which volumes refer to these extents.
>
> * Controlling dedup
> - turn on/off deduplication - specify subvolumes on which
> deduplication is turned on
>  (mount, ioctl - inherited),
> - turn on/off byte-by-byte comparison of blocks that have same hashes
> (mount, ioctl),
> - deduplication statistics (ioctl)
You'll get trouble when online dedup is turned on and off again. While
it is offline, extents still get written, but you won't have your hash
tree up-to-date. You'll need to find a way to update when dedup is
online again, without too much performance loos while updating.
>
> * Limitations
> Not really limitations, but this is a list of situations when dedup will not
> be triggered:
> - encryption,
I've already heard somewhere else that encryption+dedup is not
possible but I don't understand why. Can someone explain this
limitation?
> - compression - basically, dedup is kind of compression, might be worth to into
>   it in the future though
> - inline/prealloc extents,
Should be possible to dedup inline extents, but must be configurable
(e.g. minimum block size). People should also be able to completely
disable it when performance is important.
> - data across subvolumes
Should be possible. See my comment on the key in the hash tree.
>
> * How to store hashes
> The obvious choice would be to use the checksum tree that holds block checksums
> of each extent. The problem with the checksum tree is that the
> checksums are looked up by logical address for the start of the extent data.
> This is undesirable since it needs to be done the other way around. Logical
> addresses need to be looked up by a hash.
> To solve this, another key type would be created inside the checksum tree (or
> maybe better, a new tree would be introduced) that
> would have a hash as item's right-hand key value. This way, items could be
> looked up on a hash:
> (root, HASH_ITEM, hash)
> The root value says which root (subvolume) is hashed block stored on. The hash
> value is hash itself.
With the root inside the key you make it impossible to allow
cross-subvolume deduplication. Also, the offset field in the key that
you plan to use for the hash is only 64bit, so you can at best store a
part of the hash in the key. You should probably split the hash into 3
parts: 64bit to put into the objectid field, 64bit to put into the
offset field and the remainder into the item data. A lookup would then
do the necessary splitting and in case a match is found also compare
the remainder found in the items data.
> The item data would be of the following structure:
> struct btrfs_hash_item {
> __le64 disk_bytenr;
> __le64 disk_num_bytes;
> __le64 offset;
> }
You could omit the offset here and and store the sum of disk_bytenr
and offset in one field. You could also omit the disk_num_bytes field
if you have a fixed dedup block length. The reason: No matter what you
store here, you never know if the referred extent is still existent
and still contains the same data. So when you found a possible
duplicate, a lookup into the extent tree is required to make sure it
is still valid. Probably the extents generation can help here.
> Fields disk_bytenr and disk_num_bytes would point to a
> extent item of the extent allocation tree. The offset is a offset of a hashed
> block in the extent.
>
> * When to trigger dedup exactly
> So It might be appropriate to delay the process just before data are
> written out to disk to give data the chance to be changed. I looked a bit into
> the compression code and it seems like dedup could use the similar aproach -
> to have threads that would do dedup on sync or general writeback.
It's important that you avoid unnecessary writes of duplicated blocks
in case you do delayed deduplication. See my final comment for
details.
>
> * Hash algorithm
> sha256 seems like a safe choice.
Doesn't fit into the dedup tree's key. See my above comments.
>
> * Future improvements
> There are some features that might be considered if time permits or as something
> that could be implemented in the future:
> - dedup across subvolumes
> - specify a hash algorithm
> - specify number of deduplication blocks
>
>
> I would appreciate any comments you might have, thanks!
>
> Best regards,
> Martin Krizek
>
>
> [1] http://lwn.net/Articles/422331/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

After my research I came to the conclusion that a real online
deduplication feature is not worth the trouble. The only advantage of
online deduplication compared to offline deduplication is that you can
avoid unnecessary writes for duplicate data and thus increase
performance in theory. In practice, I don't know if you really gain
much performance, because for small writes of duplicated data you
still have to write out meta data. In all cases, at least the inodes
extent items need to written, no matter if they point to new data or
duplicate data. The performance gain will only be measurable for large
duplicated blocks, in all other cases you'll loose performance due to
the overhead of deduplication.

Storing the hashes inside a btrfs tree is also a bad idea. The hashes
are randomly distributed and thus have completely randomly ordered
keys. Lookup and insertion will be very slow when there is not enough
ram to hold the whole hash tree in memory...which is very likely to
happen. You should think about an alternative, e.g. storing the hash
table inside a dedicated inode (in a format that is more appropriate
for hash tables). But even with this solution, you'll get out of ram
very fast. You can use bloom filters or some comparable probabilistic
data structure to avoid most of the lookups, but then you still have a
lot of insertions because no matter if a duplicate is written or not,
you need to update the hash table for later lookups.

Even if you find a good and fast solution for storing the hashes,
you'll get in trouble when data gets overwritten or deleted. Then you
have old hash values in the hash tree/table. This means, for each
lookup you'll also need to check if the found extent is still there
and unchanged. As an alternative, you could update the hash table
on-the-fly when data gets overwritten or deleted...but this again
means some additional lookups and writes and finally performance loss.
Also, when dedup is enabled for the first time (or after it was
disabled for some time), you have no hash table at all (or only a
partial out-of-date hash table)...what to do in this case?

Next problem is...the backref resolving code gets very slow when there
are too many references to the same block. This will slow down
everything that depends on backref resolving, e.g. scrubbing (in case
of errors), send and deduplication itself. So you'll need to limit the
number of allowed references to each block, which may give poor
deduplication results. Probably the backref resolving code can be
optimized...I tried it but with limited results.

Then you have fragmentation...which maybe can be ignored in some cases
(e.g. for SSD's).

In my opinion, the only acceptable deduplication solution would be
either an offline or semi online solution which basically does
periodic offline deduplication. I'm a big fan of doing this in-kernel,
because then it can use a lot of information already present. For
example, the checksums that are already present in the checksum tree
can be used as a first phase hint. Some time ago I had a in-kernel
prototype running that used this approach:

1. Run through the checksum tree and do the following for each checksum:
    a. Check if we (probably) have the checksum in bloom_filter1.
    b. If yes, add the checksum to bloom_filter2
    c. In any case, add the checksum to bloom_filter1
2. Run through the checksum tree a seconds time and do the following
for each checksum:
    a. Check if we (probably) have the checksum in bloom_filter2.
    b. If yes, put the extent info (sha256, disk_bytenr, ...) into a
list with a defined maximum size. If the list gets full, sort by hash
and write out the previously collected extents to a dedicated inode.
Then start with a fresh empty list.
3. We now have a dedicated file with multiple sorted blocks. Each
sorted block contains a list of extents. We can now do a very fast
merge sort like read and thus get a large stream of hash sorted
extents. While reading in the sorted extents, we can check if the
previous extent has the same hash as the current. If yes, we have a
duplicate.

I stopped working on that due to missing time. There are still a lot
of problems to solve with this solution which I won't go into detail
now. The main advantages of this solution are low memory consumption
and time mostly being dependent on the number of duplicates instead of
the whole filesystem size as there is no need to hash the whole
filesystem. Also, you don't have a permanent large table on-disk, as
you can delete the dedicated inodes after you're finished.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Online Deduplication for Btrfs (Master's thesis)
  2012-12-17 13:33 ` Alexander Block
@ 2012-12-18  1:31   ` Chris Mason
  2013-01-07 17:27     ` Martin Křížek
  2012-12-19 17:40   ` Martin Křížek
  1 sibling, 1 reply; 8+ messages in thread
From: Chris Mason @ 2012-12-18  1:31 UTC (permalink / raw)
  To: Alexander Block
  Cc: Martin Křížek, linux-btrfs@vger.kernel.org,
	lczerner@redhat.com

On Mon, Dec 17, 2012 at 06:33:24AM -0700, Alexander Block wrote:
> I did some research on deduplication in the past and there are some
> problems that you will face. I'll try to list some of them (for sure
> not all).

Thanks Alexander for writing all of this up.  There are a lot of great
points here, but I'll summarize with:

[ many challenges to online dedup ]

[ offline dedup is the best way ]

So, the big problem with offline dedup is you're suddenly read bound.  I
don't disagree that offline makes a lot of the dedup problems easier,
and Alexander describes a very interesting system here.

I've tried to avoid features that rely on scanning though, just because
idle disk time may not really exist.  But with scrub, we have the scan
as a feature, and it may make a lot of sense to leverage that.

online dedup has a different set of tradeoffs, but as Alexander says the
hard part really is the data structure to index the hashes.  I think
there are a few different options here, including changing the file
extent pointers to point to a sha instead of a logical disk offset.

So, part of my answer really depends on where you want to go with your
thesis.  I expect the data structure work for efficient hash lookup is
going to be closer to what your course work requires?

-chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Online Deduplication for Btrfs (Master's thesis)
  2012-12-17 13:12 ` Hubert Kario
@ 2012-12-19 16:58   ` Martin Křížek
  0 siblings, 0 replies; 8+ messages in thread
From: Martin Křížek @ 2012-12-19 16:58 UTC (permalink / raw)
  To: Hubert Kario; +Cc: linux-btrfs, lczerner

On Mon, Dec 17, 2012 at 2:12 PM, Hubert Kario <hka@qbs.com.pl> wrote:
> On Monday 17 of December 2012 13:05:01 Martin Křížek wrote:
>> * Limitations
>> Not really limitations, but this is a list of situations when dedup will
>> not be triggered:
>> - compression - basically, dedup is kind of compression, might be worth to
>> into it in the future though
>
> I don't see why it would be incompatible, compressed blocks are data like
> any other. COW and subvolume snapshots work with compressed nodes just as
> well as with regular ones...
>

I agree, I am not saying it's incompatible, just that I probably won't
deal with that unless time permits.

Thanks,
Martin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Online Deduplication for Btrfs (Master's thesis)
  2012-12-17 13:33 ` Alexander Block
  2012-12-18  1:31   ` Chris Mason
@ 2012-12-19 17:40   ` Martin Křížek
  1 sibling, 0 replies; 8+ messages in thread
From: Martin Křížek @ 2012-12-19 17:40 UTC (permalink / raw)
  To: Alexander Block; +Cc: linux-btrfs@vger.kernel.org, lczerner

On Mon, Dec 17, 2012 at 2:33 PM, Alexander Block <ablock84@gmail.com> wrote:
> I did some research on deduplication in the past and there are some
> problems that you will face. I'll try to list some of them (for sure
> not all).
>
> On Mon, Dec 17, 2012 at 1:05 PM, Martin Křížek <martin.krizek@gmail.com> wrote:
>> Hello everyone,
>>
>> my name is Martin Krizek. I am a student at Faculty of Information, Brno
>> University of Technology, Czech Republic. As my master's thesis I chose to work
>> on Online Deduplication for Btrfs.
>>
>> My goal is to study Btrfs design, the offline deduplication patch [1] and to
>> come up with a design for the online dedup, this semester. I will be
>> implementing
>> the feature next semester (spring, that is).
> The offline dedup patch is quite old and won't apply/compile anymore.
> You should probably look into the clone ioctl which basically does the
> same as the extent_same ioctl from the patch. Based on the clone ioctl
> you can at least learn how to "inject" existing extents into other
> inodes.

I am aware of this, thanks for pointing that out though.

>>
>> I would like to shortly introduce my ideas on what I think the feature
>> should look
>> like.
>>
>> * Use cases
>> The two main use cases for the dedup are:
>> 1. virtual machine images
>> 2. backup servers
>>
>> * When to dedup
>> The choice of 'when to dedup' is not up to me as the master's thesis
>> specification
>> states "online". :)
>>
>> * What to dedup
>> I'd say the most reasonable is block-level deduplication as it seems the most
>> general approach to me.
> Here you have two options:
> 1. Based on the per volume tree where you'll find btrfs_file_extent
> items which point to the global extent tree.
> 2. Based on the global extent tree. By using the backref resolving
> code, you can find which volumes refer to these extents.

Agreed.

>>
>> * Controlling dedup
>> - turn on/off deduplication - specify subvolumes on which
>> deduplication is turned on
>>  (mount, ioctl - inherited),
>> - turn on/off byte-by-byte comparison of blocks that have same hashes
>> (mount, ioctl),
>> - deduplication statistics (ioctl)
> You'll get trouble when online dedup is turned on and off again. While
> it is offline, extents still get written, but you won't have your hash
> tree up-to-date. You'll need to find a way to update when dedup is
> online again, without too much performance loos while updating.

Well, yes.

>>
>> * Limitations
>> Not really limitations, but this is a list of situations when dedup will not
>> be triggered:
>> - encryption,
> I've already heard somewhere else that encryption+dedup is not
> possible but I don't understand why. Can someone explain this
> limitation?
>> - compression - basically, dedup is kind of compression, might be worth to into
>>   it in the future though
>> - inline/prealloc extents,
> Should be possible to dedup inline extents, but must be configurable
> (e.g. minimum block size). People should also be able to completely
> disable it when performance is important.

I agree that it's possible, but I don't think it's worth doing. You
won't save much storage that way.

>> - data across subvolumes
> Should be possible. See my comment on the key in the hash tree.

Again, I agree that it's possible but this would be something I
probably won't go into.

>>
>> * How to store hashes
>> The obvious choice would be to use the checksum tree that holds block checksums
>> of each extent. The problem with the checksum tree is that the
>> checksums are looked up by logical address for the start of the extent data.
>> This is undesirable since it needs to be done the other way around. Logical
>> addresses need to be looked up by a hash.
>> To solve this, another key type would be created inside the checksum tree (or
>> maybe better, a new tree would be introduced) that
>> would have a hash as item's right-hand key value. This way, items could be
>> looked up on a hash:
>> (root, HASH_ITEM, hash)
>> The root value says which root (subvolume) is hashed block stored on. The hash
>> value is hash itself.
> With the root inside the key you make it impossible to allow
> cross-subvolume deduplication. Also, the offset field in the key that
> you plan to use for the hash is only 64bit, so you can at best store a
> part of the hash in the key. You should probably split the hash into 3
> parts: 64bit to put into the objectid field, 64bit to put into the
> offset field and the remainder into the item data. A lookup would then
> do the necessary splitting and in case a match is found also compare
> the remainder found in the items data.

Right, I thought of this but somehow forgot to mention it. Yes, the
hash would need to be split if it does not fit into the offset. Thanks
for noticing.

>> The item data would be of the following structure:
>> struct btrfs_hash_item {
>> __le64 disk_bytenr;
>> __le64 disk_num_bytes;
>> __le64 offset;
>> }
> You could omit the offset here and and store the sum of disk_bytenr
> and offset in one field. You could also omit the disk_num_bytes field
> if you have a fixed dedup block length. The reason: No matter what you
> store here, you never know if the referred extent is still existent
> and still contains the same data. So when you found a possible
> duplicate, a lookup into the extent tree is required to make sure it
> is still valid. Probably the extents generation can help here.

Good points.

>> Fields disk_bytenr and disk_num_bytes would point to a
>> extent item of the extent allocation tree. The offset is a offset of a hashed
>> block in the extent.
>>
>> * When to trigger dedup exactly
>> So It might be appropriate to delay the process just before data are
>> written out to disk to give data the chance to be changed. I looked a bit into
>> the compression code and it seems like dedup could use the similar aproach -
>> to have threads that would do dedup on sync or general writeback.
> It's important that you avoid unnecessary writes of duplicated blocks
> in case you do delayed deduplication. See my final comment for
> details.
>>
>> * Hash algorithm
>> sha256 seems like a safe choice.
> Doesn't fit into the dedup tree's key. See my above comments.
>>
>> * Future improvements
>> There are some features that might be considered if time permits or as something
>> that could be implemented in the future:
>> - dedup across subvolumes
>> - specify a hash algorithm
>> - specify number of deduplication blocks
>>
>>
>> I would appreciate any comments you might have, thanks!
>>
>> Best regards,
>> Martin Krizek
>>
>>
>> [1] http://lwn.net/Articles/422331/
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> After my research I came to the conclusion that a real online
> deduplication feature is not worth the trouble. The only advantage of
> online deduplication compared to offline deduplication is that you can
> avoid unnecessary writes for duplicate data and thus increase
> performance in theory. In practice, I don't know if you really gain
> much performance, because for small writes of duplicated data you
> still have to write out meta data. In all cases, at least the inodes
> extent items need to written, no matter if they point to new data or
> duplicate data. The performance gain will only be measurable for large
> duplicated blocks, in all other cases you'll loose performance due to
> the overhead of deduplication.
>
> Storing the hashes inside a btrfs tree is also a bad idea. The hashes
> are randomly distributed and thus have completely randomly ordered
> keys. Lookup and insertion will be very slow when there is not enough
> ram to hold the whole hash tree in memory...which is very likely to
> happen. You should think about an alternative, e.g. storing the hash
> table inside a dedicated inode (in a format that is more appropriate
> for hash tables). But even with this solution, you'll get out of ram
> very fast. You can use bloom filters or some comparable probabilistic
> data structure to avoid most of the lookups, but then you still have a
> lot of insertions because no matter if a duplicate is written or not,
> you need to update the hash table for later lookups.
>
> Even if you find a good and fast solution for storing the hashes,
> you'll get in trouble when data gets overwritten or deleted. Then you
> have old hash values in the hash tree/table. This means, for each
> lookup you'll also need to check if the found extent is still there
> and unchanged. As an alternative, you could update the hash table
> on-the-fly when data gets overwritten or deleted...but this again
> means some additional lookups and writes and finally performance loss.
> Also, when dedup is enabled for the first time (or after it was
> disabled for some time), you have no hash table at all (or only a
> partial out-of-date hash table)...what to do in this case?
>
> Next problem is...the backref resolving code gets very slow when there
> are too many references to the same block. This will slow down
> everything that depends on backref resolving, e.g. scrubbing (in case
> of errors), send and deduplication itself. So you'll need to limit the
> number of allowed references to each block, which may give poor
> deduplication results. Probably the backref resolving code can be
> optimized...I tried it but with limited results.
>
> Then you have fragmentation...which maybe can be ignored in some cases
> (e.g. for SSD's).
>
> In my opinion, the only acceptable deduplication solution would be
> either an offline or semi online solution which basically does
> periodic offline deduplication. I'm a big fan of doing this in-kernel,
> because then it can use a lot of information already present. For
> example, the checksums that are already present in the checksum tree
> can be used as a first phase hint. Some time ago I had a in-kernel
> prototype running that used this approach:
>
> 1. Run through the checksum tree and do the following for each checksum:
>     a. Check if we (probably) have the checksum in bloom_filter1.
>     b. If yes, add the checksum to bloom_filter2
>     c. In any case, add the checksum to bloom_filter1
> 2. Run through the checksum tree a seconds time and do the following
> for each checksum:
>     a. Check if we (probably) have the checksum in bloom_filter2.
>     b. If yes, put the extent info (sha256, disk_bytenr, ...) into a
> list with a defined maximum size. If the list gets full, sort by hash
> and write out the previously collected extents to a dedicated inode.
> Then start with a fresh empty list.
> 3. We now have a dedicated file with multiple sorted blocks. Each
> sorted block contains a list of extents. We can now do a very fast
> merge sort like read and thus get a large stream of hash sorted
> extents. While reading in the sorted extents, we can check if the
> previous extent has the same hash as the current. If yes, we have a
> duplicate.
>

I am not really following this as I am not familiar with bloom
filters, I will need to get back to it later. Thanks for sharing.

> I stopped working on that due to missing time. There are still a lot
> of problems to solve with this solution which I won't go into detail
> now. The main advantages of this solution are low memory consumption
> and time mostly being dependent on the number of duplicates instead of
> the whole filesystem size as there is no need to hash the whole
> filesystem. Also, you don't have a permanent large table on-disk, as
> you can delete the dedicated inodes after you're finished.

Thanks for your comments!
Martin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Online Deduplication for Btrfs (Master's thesis)
  2012-12-18  1:31   ` Chris Mason
@ 2013-01-07 17:27     ` Martin Křížek
  2013-03-17 22:57       ` Martin Křížek
  0 siblings, 1 reply; 8+ messages in thread
From: Martin Křížek @ 2013-01-07 17:27 UTC (permalink / raw)
  To: Chris Mason, Alexander Block, Martin Křížek,
	linux-btrfs@vger.kernel.org, lczerner@redhat.com

On Tue, Dec 18, 2012 at 2:31 AM, Chris Mason <chris.mason@fusionio.com> wrote:
> On Mon, Dec 17, 2012 at 06:33:24AM -0700, Alexander Block wrote:
>> I did some research on deduplication in the past and there are some
>> problems that you will face. I'll try to list some of them (for sure
>> not all).
>
> Thanks Alexander for writing all of this up.  There are a lot of great
> points here, but I'll summarize with:
>
> [ many challenges to online dedup ]
>
> [ offline dedup is the best way ]
>
> So, the big problem with offline dedup is you're suddenly read bound.  I
> don't disagree that offline makes a lot of the dedup problems easier,
> and Alexander describes a very interesting system here.
>
> I've tried to avoid features that rely on scanning though, just because
> idle disk time may not really exist.  But with scrub, we have the scan
> as a feature, and it may make a lot of sense to leverage that.
>
> online dedup has a different set of tradeoffs, but as Alexander says the
> hard part really is the data structure to index the hashes.  I think
> there are a few different options here, including changing the file
> extent pointers to point to a sha instead of a logical disk offset.
>

I am not sure if I am following the approach with replacing the
pointers, could you please explain?

I agree that the data structure is the key part in designing the feature.

> So, part of my answer really depends on where you want to go with your
> thesis.  I expect the data structure work for efficient hash lookup is
> going to be closer to what your course work requires?

No, I don't think it is really necessary that the data structure
should be close to my course work. What is your opinion on the most
suitable data structure to index the hashes?

Thanks!
Martin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Online Deduplication for Btrfs (Master's thesis)
  2013-01-07 17:27     ` Martin Křížek
@ 2013-03-17 22:57       ` Martin Křížek
  0 siblings, 0 replies; 8+ messages in thread
From: Martin Křížek @ 2013-03-17 22:57 UTC (permalink / raw)
  To: Chris Mason, Alexander Block, Martin Křížek,
	linux-btrfs@vger.kernel.org, lczerner@redhat.com, jbacik

Hi,

I am looking for a place (source code wise) where the actual
online/immediate dedup would happen:
-- splitting an extent about to be written into new extents that are
unique (that would be actually written) and into blocks of the
original extent that would be thrown away (duplicates) and
file_extent_item pointing to portion of existing extent created
instead.

Could anyone please give me some pointers on that place and/or
functions that could be used for that? Would delalloc time would be
appropriate?

Thanks!
Martin

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-03-17 22:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-17 12:05 Online Deduplication for Btrfs (Master's thesis) Martin Křížek
2012-12-17 13:12 ` Hubert Kario
2012-12-19 16:58   ` Martin Křížek
2012-12-17 13:33 ` Alexander Block
2012-12-18  1:31   ` Chris Mason
2013-01-07 17:27     ` Martin Křížek
2013-03-17 22:57       ` Martin Křížek
2012-12-19 17:40   ` Martin Křížek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).