linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* About fi du and reflink/dedupe
@ 2016-04-22  2:57 Qu Wenruo
  2016-04-22 17:46 ` Mark Fasheh
  0 siblings, 1 reply; 3+ messages in thread
From: Qu Wenruo @ 2016-04-22  2:57 UTC (permalink / raw)
  To: Mark Fasheh, btrfs

Hi Mark,

Thanks for your contribution to btrfs-filesystem-du command.

However there seems to be some strange behavior related to reflinke(and 
further in-band dedupe).
(And the root cause is lying quite deep into kernel backref resolving codes)

["Exclusive" value not really exclsuive]
When a file with 2 file extents, and the 2nd file extent points to the 
1st one, the fi du gives wrong answer

The following command can create such file easily.

# mkfs.btrfs -f /dev/sdb5
# mount /dev/sdb5 /mnt/test
# xfs_io -f -c "pwrite 0 128K" /mnt/test/tmp
# xfs_io -c "reflink /mnt/test/tmp 0 128K 128K" /mnt/test/tmp
# btrfs fi du /mnt/test
      Total   Exclusive  Set shared  Filename
  256.00KiB   256.00KiB           -  /mnt/test//tmp
  256.00KiB   256.00KiB       0.00B  /mnt/test/

Total seems to be OK, while I am confused of the exclusive value.

As the above method will only create one real data extent, which takes 
128K, and if following the qgroup definition, its exclusive should be 
128K other than 256K.


Fi du uses FIEMAP ioctl to get the fiemap, and fi du uses the SHARED 
flag to determine whether it is shared.

However that SHARED flag doesn't handle case like this, in which 
ino/root are all the same, only extent offset is different.


And what's more, if we modify btrfs_check_shared() to return SHARED flag 
for such case, we will get 0 exclusive value for it.
Which is quite strang. (I assume the exclusive should be 128K)

[Slow btrfs_check_shared() performance]
In above case, btrfs fi du returns very fast.
But when the file is in-band deduped and size goes to 1G.
btrfs_check_shared() will take a lot of time to return, as it will do 
backref walk through.

This would be a super huge problem for inband dedupe.


[Possible solution]
Would you please consider to judge shared extent in user space?
And don't rely on the SHARED flag from fiemap.

The work flow would be like:

1) Call fiemap skipping FIEMAP_EXTENT_SHARED flag
    Although we still need to modify kernel to avoid btrfs_check_shared()

2) Get the disk bytenr and record it in user space bytenr pool
3) Compare each file extent disk bytenr with bytenr pool
    And like qgroup, use this to build a rfer/excl data for each inode.

At least, this method would handle above exclusive value and avoid 
year-long fiemap ioctl call in in-band dedupe case.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: About fi du and reflink/dedupe
  2016-04-22  2:57 About fi du and reflink/dedupe Qu Wenruo
@ 2016-04-22 17:46 ` Mark Fasheh
  2016-04-25  0:46   ` Qu Wenruo
  0 siblings, 1 reply; 3+ messages in thread
From: Mark Fasheh @ 2016-04-22 17:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: btrfs

On Fri, Apr 22, 2016 at 10:57:29AM +0800, Qu Wenruo wrote:
> Hi Mark,
> 
> Thanks for your contribution to btrfs-filesystem-du command.
> 
> However there seems to be some strange behavior related to
> reflinke(and further in-band dedupe).
> (And the root cause is lying quite deep into kernel backref resolving codes)
> 
> ["Exclusive" value not really exclsuive]
> When a file with 2 file extents, and the 2nd file extent points to
> the 1st one, the fi du gives wrong answer
> 
> The following command can create such file easily.
> 
> # mkfs.btrfs -f /dev/sdb5
> # mount /dev/sdb5 /mnt/test
> # xfs_io -f -c "pwrite 0 128K" /mnt/test/tmp
> # xfs_io -c "reflink /mnt/test/tmp 0 128K 128K" /mnt/test/tmp
> # btrfs fi du /mnt/test
>      Total   Exclusive  Set shared  Filename
>  256.00KiB   256.00KiB           -  /mnt/test//tmp
>  256.00KiB   256.00KiB       0.00B  /mnt/test/
> 
> Total seems to be OK, while I am confused of the exclusive value.
> 
> As the above method will only create one real data extent, which
> takes 128K, and if following the qgroup definition, its exclusive
> should be 128K other than 256K.

Ok that's a bug in how we're counting these. We already record extent start
offsets so it's easy enough to see when we have the same extent in a file
while we fiemap it. Thanks for reporting this I'll take a look at a fix.


> And what's more, if we modify btrfs_check_shared() to return SHARED
> flag for such case, we will get 0 exclusive value for it.
> Which is quite strang. (I assume the exclusive should be 128K)
> 
> [Slow btrfs_check_shared() performance]
> In above case, btrfs fi du returns very fast.
> But when the file is in-band deduped and size goes to 1G.
> btrfs_check_shared() will take a lot of time to return, as it will
> do backref walk through.
> 
> This would be a super huge problem for inband dedupe.
> 
> 
> [Possible solution]
> Would you please consider to judge shared extent in user space?
> And don't rely on the SHARED flag from fiemap.

_Absoletely Not_

We don't ask userspace to modify their applications if there's a peformance
problem in fiemap, we fix the performance problem in fiemap. Off the top of
my head I can think of at least TWO other applications which rely on fiemap
heavily. You will have very little luck in asking them to modify their
applications.

If btrfs fiemap is broken, we fix that full stop.

More specifically, If in-band dedupe is causing fiemap to go out to lunch
'for a year', we need to address the core problem in in-band dedupe. If it's
a general problem in btrfs fiemap when we need to track it down before users
start yelling at us.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: About fi du and reflink/dedupe
  2016-04-22 17:46 ` Mark Fasheh
@ 2016-04-25  0:46   ` Qu Wenruo
  0 siblings, 0 replies; 3+ messages in thread
From: Qu Wenruo @ 2016-04-25  0:46 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: btrfs



Mark Fasheh wrote on 2016/04/22 10:46 -0700:
> On Fri, Apr 22, 2016 at 10:57:29AM +0800, Qu Wenruo wrote:
>> Hi Mark,
>>
>> Thanks for your contribution to btrfs-filesystem-du command.
>>
>> However there seems to be some strange behavior related to
>> reflinke(and further in-band dedupe).
>> (And the root cause is lying quite deep into kernel backref resolving codes)
>>
>> ["Exclusive" value not really exclsuive]
>> When a file with 2 file extents, and the 2nd file extent points to
>> the 1st one, the fi du gives wrong answer
>>
>> The following command can create such file easily.
>>
>> # mkfs.btrfs -f /dev/sdb5
>> # mount /dev/sdb5 /mnt/test
>> # xfs_io -f -c "pwrite 0 128K" /mnt/test/tmp
>> # xfs_io -c "reflink /mnt/test/tmp 0 128K 128K" /mnt/test/tmp
>> # btrfs fi du /mnt/test
>>      Total   Exclusive  Set shared  Filename
>>  256.00KiB   256.00KiB           -  /mnt/test//tmp
>>  256.00KiB   256.00KiB       0.00B  /mnt/test/
>>
>> Total seems to be OK, while I am confused of the exclusive value.
>>
>> As the above method will only create one real data extent, which
>> takes 128K, and if following the qgroup definition, its exclusive
>> should be 128K other than 256K.
>
> Ok that's a bug in how we're counting these. We already record extent start
> offsets so it's easy enough to see when we have the same extent in a file
> while we fiemap it. Thanks for reporting this I'll take a look at a fix.

The root cause is how we implement btrfs_check_shared().

It's using the most time consuming one, by walk through all backref 
until hit one backref whose inode/root is not the same with given one.

This works fine for small and non-deduped(no matter inband or 
out-of-band) files.

But when things go large, for example 8192 file extents, and no other 
inode/root is referring it, the backref walk takes long long time.

>
>
>> And what's more, if we modify btrfs_check_shared() to return SHARED
>> flag for such case, we will get 0 exclusive value for it.
>> Which is quite strang. (I assume the exclusive should be 128K)
>>
>> [Slow btrfs_check_shared() performance]
>> In above case, btrfs fi du returns very fast.
>> But when the file is in-band deduped and size goes to 1G.
>> btrfs_check_shared() will take a lot of time to return, as it will
>> do backref walk through.
>>
>> This would be a super huge problem for inband dedupe.
>>
>>
>> [Possible solution]
>> Would you please consider to judge shared extent in user space?
>> And don't rely on the SHARED flag from fiemap.
>
> _Absoletely Not_
>
> We don't ask userspace to modify their applications if there's a peformance
> problem in fiemap, we fix the performance problem in fiemap. Off the top of
> my head I can think of at least TWO other applications which rely on fiemap
> heavily. You will have very little luck in asking them to modify their
> applications.
>
> If btrfs fiemap is broken, we fix that full stop.

OK we will try to fix it by using faster lookup method and check backref 
offset.

But still, I am a little concerned on the fi du behavior on the 
exclusive values.

For example, if we fixed the fiemap performance bug and returning 
correct SHARED flag. For the following file layout: (a perfectly deduped 
one)
Offset: 0, len: 128K, disk bytenr: X, disk num bytes: 128K
Offset: 128K, len: 128K, disk bytenr: X, disk num bytes: 128K
Offset: 256K, len: 128K, disk bytenr: X, disk num bytes: 128K

And the disk bytenr X is only refered by this file.

What's the correct exclusive value for the file?
For current implement it would be 0 as all file extents have SHARED 
flag, but I thought it would be 128K, as the X is only referred by this 
file.

Any idea on this?

>
> More specifically, If in-band dedupe is causing fiemap to go out to lunch
> 'for a year', we need to address the core problem in in-band dedupe. If it's
> a general problem in btrfs fiemap when we need to track it down before users
> start yelling at us.
> 	--Mark

That's a fiemap problem, and dedupe (inband or out-of-band both) just 
makes it more easy to trigger.
(From this respect of view, inband dedupe is quite good at spotting 
corner case bugs)

Even without dedupe, user can still trigger it by reflinking.

As the old backref codes doesn't expect we will have so many reference 
on a single small extent.

This also reminds me that, we need to further enhance backref iteration 
codes, to make it better handle such case.

Thanks,
Qu

>
> --
> Mark Fasheh
>
>



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-04-25  0:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-22  2:57 About fi du and reflink/dedupe Qu Wenruo
2016-04-22 17:46 ` Mark Fasheh
2016-04-25  0:46   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).