All of lore.kernel.org
 help / color / mirror / Atom feed
* de-duplication algos
@ 2015-05-13 16:24 Learner Study
  2015-05-13 16:35 ` Hugo Mills
  0 siblings, 1 reply; 5+ messages in thread
From: Learner Study @ 2015-05-13 16:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Learner Study

Hello,

I have been reading on de-duplication and how algorithms such as Bloom
and Cuckoo filters are used for this purpose.

Does BTRFS dedup use any of these, or are there plans to incorporate
these in future?

Thanks for your guidance!

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: de-duplication algos
  2015-05-13 16:24 de-duplication algos Learner Study
@ 2015-05-13 16:35 ` Hugo Mills
  2015-05-13 16:48   ` David Sterba
  0 siblings, 1 reply; 5+ messages in thread
From: Hugo Mills @ 2015-05-13 16:35 UTC (permalink / raw)
  To: Learner Study; +Cc: linux-btrfs, Mark Fasheh

[-- Attachment #1: Type: text/plain, Size: 1507 bytes --]

On Wed, May 13, 2015 at 09:24:17AM -0700, Learner Study wrote:
> Hello,
> 
> I have been reading on de-duplication and how algorithms such as Bloom
> and Cuckoo filters are used for this purpose.
> 
> Does BTRFS dedup use any of these, or are there plans to incorporate
> these in future?

   There was a long discussion on IRC about different approaches that
could be taken. I think Mark Fasheh captured most of that somewhere --
I thought he'd put it on the duperemove github site somewhere, but I
can't see it right now.

   One outcome of the discussion is that a probabilistic set
implementation is potentially useful, but there's still a lot of work
that needs to be done around those core algorithms to make a useful
deduplicator. A related outcome was that there's a lot of different
approaches that are possible, which can optimise for RAM, storage
space (both at runtime, and in the resulting deduplicated FS), or
execution time in different ways. I think we ended up with about 7 or
8 different possible algorithms by the end of it, even before looking
at the implementation details of which probabilistic set algorithm to
pick.

   If Mark doesn't have those notes any more, I can try to dig out the
original IRC discussion.

   Hugo.

-- 
Hugo Mills             | My code is never released, it escapes from the git
hugo@... carfax.org.uk | repo and kills a few beta testers on the way out.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                             Diablo-D3

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: de-duplication algos
  2015-05-13 16:35 ` Hugo Mills
@ 2015-05-13 16:48   ` David Sterba
  2015-05-14  1:08     ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: David Sterba @ 2015-05-13 16:48 UTC (permalink / raw)
  To: Hugo Mills, Learner Study, linux-btrfs, Mark Fasheh

On Wed, May 13, 2015 at 04:35:53PM +0000, Hugo Mills wrote:
> On Wed, May 13, 2015 at 09:24:17AM -0700, Learner Study wrote:
> > Hello,
> > 
> > I have been reading on de-duplication and how algorithms such as Bloom
> > and Cuckoo filters are used for this purpose.
> > 
> > Does BTRFS dedup use any of these, or are there plans to incorporate
> > these in future?
> 
>    There was a long discussion on IRC about different approaches that
> could be taken. I think Mark Fasheh captured most of that somewhere --
> I thought he'd put it on the duperemove github site somewhere, but I
> can't see it right now.

The bloom filter for duperemove has been implemented (as of commit
b7c03422ea9fd11f915804df2b6598a6ed10dfce) and works fine, the memory
footprint is much lower than before.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: de-duplication algos
  2015-05-13 16:48   ` David Sterba
@ 2015-05-14  1:08     ` Qu Wenruo
  2015-05-14  2:10       ` Timofey Titovets
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2015-05-14  1:08 UTC (permalink / raw)
  To: dsterba, Hugo Mills, Learner Study, linux-btrfs, Mark Fasheh



-------- Original Message  --------
Subject: Re: de-duplication algos
From: David Sterba <dsterba@suse.cz>
To: Hugo Mills <hugo@carfax.org.uk>, Learner Study 
<learner.study@gmail.com>, linux-btrfs <linux-btrfs@vger.kernel.org>, 
Mark Fasheh <mfasheh@suse.de>
Date: 2015年05月14日 00:48

> On Wed, May 13, 2015 at 04:35:53PM +0000, Hugo Mills wrote:
>> On Wed, May 13, 2015 at 09:24:17AM -0700, Learner Study wrote:
>>> Hello,
>>>
>>> I have been reading on de-duplication and how algorithms such as Bloom
>>> and Cuckoo filters are used for this purpose.
>>>
>>> Does BTRFS dedup use any of these, or are there plans to incorporate
>>> these in future?
>>
>>     There was a long discussion on IRC about different approaches that
>> could be taken. I think Mark Fasheh captured most of that somewhere --
>> I thought he'd put it on the duperemove github site somewhere, but I
>> can't see it right now.
>
> The bloom filter for duperemove has been implemented (as of commit
> b7c03422ea9fd11f915804df2b6598a6ed10dfce) and works fine, the memory
> footprint is much lower than before.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Zhao Lei and I are also trying to implement *in-band* de-duplication.
Our idea is to implement a memory pool to keep a csum<->extent map to do 
*PARTIAL* dedup.
As we consider de-duplication doesn't need to de-dup 100% of 
duplications, it's just a nice addition but not a fundamental function.

The memory pool bahaviors as last-recent-use, and user can adjust how 
big the memory pool is. (Yeah, put the dirty work to user)

Bloom filter seems quite interesting, but it also seems hard to remove 
items from them, so also hard to limit memory usage in kernel.
Since I'm not familiar with algorithms like Bloom filter, any advice on 
such algorithms available is welcomed.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: de-duplication algos
  2015-05-14  1:08     ` Qu Wenruo
@ 2015-05-14  2:10       ` Timofey Titovets
  0 siblings, 0 replies; 5+ messages in thread
From: Timofey Titovets @ 2015-05-14  2:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, Hugo Mills, Learner Study, linux-btrfs, Mark Fasheh

2015-05-14 4:08 GMT+03:00 Qu Wenruo <quwenruo@cn.fujitsu.com>:
>
>
> -------- Original Message  --------
> Subject: Re: de-duplication algos
> From: David Sterba <dsterba@suse.cz>
> To: Hugo Mills <hugo@carfax.org.uk>, Learner Study
> <learner.study@gmail.com>, linux-btrfs <linux-btrfs@vger.kernel.org>, Mark
> Fasheh <mfasheh@suse.de>
> Date: 2015年05月14日 00:48
>
>> On Wed, May 13, 2015 at 04:35:53PM +0000, Hugo Mills wrote:
>>>
>>> On Wed, May 13, 2015 at 09:24:17AM -0700, Learner Study wrote:
>>>>
>>>> Hello,
>>>>
>>>> I have been reading on de-duplication and how algorithms such as Bloom
>>>> and Cuckoo filters are used for this purpose.
>>>>
>>>> Does BTRFS dedup use any of these, or are there plans to incorporate
>>>> these in future?
>>>
>>>
>>>     There was a long discussion on IRC about different approaches that
>>> could be taken. I think Mark Fasheh captured most of that somewhere --
>>> I thought he'd put it on the duperemove github site somewhere, but I
>>> can't see it right now.
>>
>>
>> The bloom filter for duperemove has been implemented (as of commit
>> b7c03422ea9fd11f915804df2b6598a6ed10dfce) and works fine, the memory
>> footprint is much lower than before.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> Zhao Lei and I are also trying to implement *in-band* de-duplication.
> Our idea is to implement a memory pool to keep a csum<->extent map to do
> *PARTIAL* dedup.
> As we consider de-duplication doesn't need to de-dup 100% of duplications,
> it's just a nice addition but not a fundamental function.
>
> The memory pool bahaviors as last-recent-use, and user can adjust how big
> the memory pool is. (Yeah, put the dirty work to user)
>
> Bloom filter seems quite interesting, but it also seems hard to remove items
> from them, so also hard to limit memory usage in kernel.
> Since I'm not familiar with algorithms like Bloom filter, any advice on such
> algorithms available is welcomed.

cuckoo?
https://github.com/efficient/cuckoofilter

> Thanks,
> Qu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Have a nice day,
Timofey.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-05-14  2:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-13 16:24 de-duplication algos Learner Study
2015-05-13 16:35 ` Hugo Mills
2015-05-13 16:48   ` David Sterba
2015-05-14  1:08     ` Qu Wenruo
2015-05-14  2:10       ` Timofey Titovets

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.