linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* About per-file dedup flag
@ 2016-01-12  3:09 Qu Wenruo
  2016-01-12  4:13 ` Duncan
  0 siblings, 1 reply; 4+ messages in thread
From: Qu Wenruo @ 2016-01-12  3:09 UTC (permalink / raw)
  To: btrfs

Hi all,

As some already know, we are implement btrfs in-band de-duplication.
And we already have a working and stable version internal.
But that's filesystem wide de-duplication.


Now we hope to add support to enable/disable dedup per-file.
Much like current NODATACOW/NOCOMPRESS for inode.


But we are not sure where to add such flag.
Here is our current ideas:

1) XATTR
Make a btrfs internal xattr, just like some btrfs prop.

2) inode flag, like FS_NOCOMP_FL
Although only btrfs is going to support in-band dedup, who knows
what will happen in future?

Any advice is welcomed.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: About per-file dedup flag
  2016-01-12  3:09 About per-file dedup flag Qu Wenruo
@ 2016-01-12  4:13 ` Duncan
  2016-01-12  4:51   ` Qu Wenruo
  0 siblings, 1 reply; 4+ messages in thread
From: Duncan @ 2016-01-12  4:13 UTC (permalink / raw)
  To: linux-btrfs

Qu Wenruo posted on Tue, 12 Jan 2016 11:09:23 +0800 as excerpted:

> Now we hope to add support to enable/disable dedup per-file.
> Much like current NODATACOW/NOCOMPRESS for inode.

How is this going to work?

NODATACOW/NOCOMPRESS can apply to a single file.  But a dup flag, by 
definition, needs two files, except for the special case of parts of a 
file duplicating other parts of the same file.  Is there going to be some 
background thread that checks for dups and reflinks duplicated extents if 
both files have the dup flag set?  What if one has it on and one has it 
off?

Presumably, if a file has it on and it is copied (so a new file), the 
copy would be reflinked.  But if the flag is off, does that make the file 
actually data-copy, by default, even if cp decides to do a reflink copy 
by default?  And does the copy automatically have the dup flag set as 
well, or does the original instance set dup, while the new copy, reflinked 
to the old one due to that dup flag, still have the dup flag unset, until 
the user sets it?

OTOH, I can see such an attribute for dirs making more sense, since it 
could be inherited much like the NOCOW attribute, and new files created 
there could automatically be checked against the current files to see if 
parts are dup, and reflink them if so.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: About per-file dedup flag
  2016-01-12  4:13 ` Duncan
@ 2016-01-12  4:51   ` Qu Wenruo
  2016-01-12  5:11     ` Duncan
  0 siblings, 1 reply; 4+ messages in thread
From: Qu Wenruo @ 2016-01-12  4:51 UTC (permalink / raw)
  To: Duncan, linux-btrfs



Duncan wrote on 2016/01/12 04:13 +0000:
> Qu Wenruo posted on Tue, 12 Jan 2016 11:09:23 +0800 as excerpted:
>
>> Now we hope to add support to enable/disable dedup per-file.
>> Much like current NODATACOW/NOCOMPRESS for inode.
>
> How is this going to work?
>
> NODATACOW/NOCOMPRESS can apply to a single file.  But a dup flag, by
> definition, needs two files, except for the special case of parts of a
> file duplicating other parts of the same file.  Is there going to be some
> background thread that checks for dups and reflinks duplicated extents if
> both files have the dup flag set?  What if one has it on and one has it
> off?

You are still thinking in the way off-band dedup.

For off-band dedup, we need two extents to compare.

But for in-band dedup, we are not using reflink or similar facility.
Instead, we have a hash pool, recording part or all of our known hashes 
of extents.

So the things should be quite easy to understand:

For normal case (no NODEDUP flag), valid data(page cache) will be hashed 
to find if it's a duplicated one.

For NODEDUP flag case, all its page cache just direct write to disk or 
compressed then write to disk.
No hash will be calculated.

Thanks,
Qu

>
> Presumably, if a file has it on and it is copied (so a new file), the
> copy would be reflinked.  But if the flag is off, does that make the file
> actually data-copy, by default, even if cp decides to do a reflink copy
> by default?  And does the copy automatically have the dup flag set as
> well, or does the original instance set dup, while the new copy, reflinked
> to the old one due to that dup flag, still have the dup flag unset, until
> the user sets it?
>
> OTOH, I can see such an attribute for dirs making more sense, since it
> could be inherited much like the NOCOW attribute, and new files created
> there could automatically be checked against the current files to see if
> parts are dup, and reflink them if so.
>



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: About per-file dedup flag
  2016-01-12  4:51   ` Qu Wenruo
@ 2016-01-12  5:11     ` Duncan
  0 siblings, 0 replies; 4+ messages in thread
From: Duncan @ 2016-01-12  5:11 UTC (permalink / raw)
  To: linux-btrfs

Qu Wenruo posted on Tue, 12 Jan 2016 12:51:33 +0800 as excerpted:

> Duncan wrote on 2016/01/12 04:13 +0000:
>> Qu Wenruo posted on Tue, 12 Jan 2016 11:09:23 +0800 as excerpted:
>>
>>> Now we hope to add support to enable/disable dedup per-file.
>>> Much like current NODATACOW/NOCOMPRESS for inode.
>>
>> How is this going to work?
>>
>> NODATACOW/NOCOMPRESS can apply to a single file.  But a dup flag, by
>> definition, needs two files, except for the special case of parts of a
>> file duplicating other parts of the same file.
> 
> You are still thinking in the way off-band dedup.

> So the things should be quite easy to understand:
> 
> For normal case (no NODEDUP flag), valid data(page cache) will be hashed
> to find if it's a duplicated one.
> 
> For NODEDUP flag case, all its page cache just direct write to disk or
> compressed then write to disk.
> No hash will be calculated.

Oh, _NO_DEDUP.  =:^)

Opposite the dedup logic implied by the subject, with no hint in the 
original post indicating logic actually the reverse of that.

NODEDUP indeed makes more sense, since with a mount or filesystem option 
enabling dedup, it would then be the default and nodedup as a per-file 
exception is the next logical extension.

Thanks.  I knew I must be missing something.  A little negation makes a 
big difference!  =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-01-12  5:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-12  3:09 About per-file dedup flag Qu Wenruo
2016-01-12  4:13 ` Duncan
2016-01-12  4:51   ` Qu Wenruo
2016-01-12  5:11     ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).