Strange data backref offset?

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Strange data backref offset?
@ 2015-07-17  2:38 Qu Wenruo
  2015-07-18 11:35 ` Liu Bo
  2015-07-29 16:24 ` Strange data backref offset? Filipe David Manana
  0 siblings, 2 replies; 11+ messages in thread
From: Qu Wenruo @ 2015-07-17  2:38 UTC (permalink / raw)
  To: btrfs

Hi all,

While I'm developing a new btrfs inband dedup mechanism, I found btrfsck 
and kernel doing strange behavior for clone.

[Reproducer]
# mount /dev/sdc -t btrfs /mnt/test
# dd if=/dev/zero of=/mnt/test/file1 bs=4K count=4
# sync
# ~/xfstests/src/cloner -s 4096 -l 4096 /mnt/test/file1 /mnt/test/file2
# sync

Then btrfs-debug-tree gives quite strange result on the data backref:
------
<EXTENT TREE>
         item 4 key (12845056 EXTENT_ITEM 16384) itemoff 16047 itemsize 111
                 extent refs 3 gen 6 flags DATA
                 extent data backref root 5 objectid 257 offset 0 count 1
                 extent data backref root 5 objectid 258 offset 
18446744073709547520 count 1

<FS TREE>
         item 8 key (257 EXTENT_DATA 0) itemoff 15743 itemsize 53
                 extent data disk byte 12845056 nr 16384
                 extent data offset 0 nr 16384 ram 16384
                 extent compression 0
         item 9 key (257 EXTENT_DATA 16384) itemoff 15690 itemsize 53
                 extent data disk byte 12845056 nr 16384
                 extent data offset 4096 nr 4096 ram 16384
                 extent compression 0
------

The offset is file extent's key.offset - file exntent's offset,
Which is 0 - 4096, causing the overflow result.

Kernel and fsck all uses that behavior, so fsck can pass the strange thing.

But shouldn't the offset in data backref matches with the key.offset of 
the file extent?

And I'm quite sure the change of behavior can hugely break the fsck and 
kernel, but I'm wondering is this a known BUG or feature, and will it be 
handled?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange data backref offset?
  2015-07-17  2:38 Strange data backref offset? Qu Wenruo
@ 2015-07-18 11:35 ` Liu Bo
  2015-07-19  7:23   ` Zygo Blaxell
  2015-07-29 16:24 ` Strange data backref offset? Filipe David Manana
  1 sibling, 1 reply; 11+ messages in thread
From: Liu Bo @ 2015-07-18 11:35 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: btrfs

On Fri, Jul 17, 2015 at 10:38:32AM +0800, Qu Wenruo wrote:
> Hi all,
> 
> While I'm developing a new btrfs inband dedup mechanism, I found btrfsck and
> kernel doing strange behavior for clone.
> 
> [Reproducer]
> # mount /dev/sdc -t btrfs /mnt/test
> # dd if=/dev/zero of=/mnt/test/file1 bs=4K count=4
> # sync
> # ~/xfstests/src/cloner -s 4096 -l 4096 /mnt/test/file1 /mnt/test/file2
> # sync
> 
> Then btrfs-debug-tree gives quite strange result on the data backref:
> ------
> <EXTENT TREE>
>         item 4 key (12845056 EXTENT_ITEM 16384) itemoff 16047 itemsize 111
>                 extent refs 3 gen 6 flags DATA
>                 extent data backref root 5 objectid 257 offset 0 count 1
>                 extent data backref root 5 objectid 258 offset
> 18446744073709547520 count 1
> 
> <FS TREE>
>         item 8 key (257 EXTENT_DATA 0) itemoff 15743 itemsize 53
>                 extent data disk byte 12845056 nr 16384
>                 extent data offset 0 nr 16384 ram 16384
>                 extent compression 0
>         item 9 key (257 EXTENT_DATA 16384) itemoff 15690 itemsize 53
>                 extent data disk byte 12845056 nr 16384
>                 extent data offset 4096 nr 4096 ram 16384
>                 extent compression 0
> ------
> 
> The offset is file extent's key.offset - file exntent's offset,
> Which is 0 - 4096, causing the overflow result.
> 
> Kernel and fsck all uses that behavior, so fsck can pass the strange thing.
> 
> But shouldn't the offset in data backref matches with the key.offset of the
> file extent?
> 
> And I'm quite sure the change of behavior can hugely break the fsck and
> kernel, but I'm wondering is this a known BUG or feature, and will it be
> handled?

Also found this before, one of the benefits is to save metadata in extent
tree, that is, if we overwrite inside an extent, the extent ref count
increases to 2 since both use (key.offset - extent_offset).

0k                  12k
|-------------------|
     |--------|
     4k       8k

one EXTENT_DATA item is 0k and the other one is 8k, the corresponding
refs will be (0k - 0) and (8k - 8k).

It's a format change so it won't be easy to make a change.  I'd prefer a
workaround on clone side.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange data backref offset?
  2015-07-18 11:35 ` Liu Bo
@ 2015-07-19  7:23   ` Zygo Blaxell
  2015-07-20  2:24     ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Zygo Blaxell @ 2015-07-19  7:23 UTC (permalink / raw)
  To: Liu Bo; +Cc: Qu Wenruo, btrfs

[-- Attachment #1: Type: text/plain, Size: 9515 bytes --]

On Sat, Jul 18, 2015 at 07:35:31PM +0800, Liu Bo wrote:
> On Fri, Jul 17, 2015 at 10:38:32AM +0800, Qu Wenruo wrote:
> > Hi all,
> > 
> > While I'm developing a new btrfs inband dedup mechanism, I found btrfsck and
> > kernel doing strange behavior for clone.
> > 
> > [Reproducer]
> > # mount /dev/sdc -t btrfs /mnt/test
> > # dd if=/dev/zero of=/mnt/test/file1 bs=4K count=4
> > # sync
> > # ~/xfstests/src/cloner -s 4096 -l 4096 /mnt/test/file1 /mnt/test/file2
> > # sync
> > 
> > Then btrfs-debug-tree gives quite strange result on the data backref:
> > ------
> > <EXTENT TREE>
> >         item 4 key (12845056 EXTENT_ITEM 16384) itemoff 16047 itemsize 111
> >                 extent refs 3 gen 6 flags DATA
> >                 extent data backref root 5 objectid 257 offset 0 count 1
> >                 extent data backref root 5 objectid 258 offset
> > 18446744073709547520 count 1
> > 
> > <FS TREE>
> >         item 8 key (257 EXTENT_DATA 0) itemoff 15743 itemsize 53
> >                 extent data disk byte 12845056 nr 16384
> >                 extent data offset 0 nr 16384 ram 16384
> >                 extent compression 0
> >         item 9 key (257 EXTENT_DATA 16384) itemoff 15690 itemsize 53
> >                 extent data disk byte 12845056 nr 16384
> >                 extent data offset 4096 nr 4096 ram 16384
> >                 extent compression 0
> > ------
> > 
> > The offset is file extent's key.offset - file exntent's offset,
> > Which is 0 - 4096, causing the overflow result.
> > 
> > Kernel and fsck all uses that behavior, so fsck can pass the strange thing.
> > 
> > But shouldn't the offset in data backref matches with the key.offset of the
> > file extent?
> > 
> > And I'm quite sure the change of behavior can hugely break the fsck and
> > kernel, but I'm wondering is this a known BUG or feature, and will it be
> > handled?
> 
> Also found this before, one of the benefits is to save metadata in extent
> tree, that is, if we overwrite inside an extent, the extent ref count
> increases to 2 since both use (key.offset - extent_offset).
> 
> 0k                  12k
> |-------------------|
>      |--------|
>      4k       8k
> 
> one EXTENT_DATA item is 0k and the other one is 8k, the corresponding
> refs will be (0k - 0) and (8k - 8k).
> 
> It's a format change so it won't be easy to make a change.  I'd prefer a
> workaround on clone side.

The workaround for dedup currently involves copying data.  It could be
done by manipulating the metadata directly, but btrfs doesn't seem to
provide this.

Suppose our dedup engine has identified duplicate blocks in distinct
extents, and we want to replace them by cloning:

0k                  12k
|AAAAABBBBBBBBCCCCCC|     <- file #1 extent #1
     |BBBBBBBB|           <- file #2 extent #2
     4k       8k

If we clone the "B" part of extent #1 over extent #2, we save one block
by eliminating the last reference to extent #2; however, if file #1 is
later deleted, we waste 8K.  File #2 will refer to only 4K out of extent
#1's 12K.  Any further clones of file #2 will add references to this
12K extent.  The 8K of wasted data cannot be accessed from userspace,
so it's unavailable for allocation and can't be reused by future clones.

Assuming dedup finds many further copies of the "B" data, and they're
all replaced by clones referring to the "B" blocks of extent #1, those
8K are effectively wasted forever.  If extent #1 is larger (up to 1GB),
more space is wasted.  The only way to get the space back after this
failure is to locate every reference to extent #1 in the entire filesystem
(i.e. every instance of the "B" data) and delete it.

0k                  12k
|AAAAABBBBBBBBCCCCCC|     <- file #1 extent #1, "B" shared with file #2
 aaaa|BBBBBBBB|ccccc      <- file #2 "a" and "c" blocks referenced
     4k       8k             but not used (i.e. wasted) in file #2

If we clone extent #2 over extent #1, we avoid the above problem, but
we save nothing.  The refs to extent #1 from file #1 will keep the "B"
blocks from extent #1 on disk even though they are not accessible:

0k   /bbbbbbbb\     12k   <- file #1 unused blocks from extent #1
|AAAA|^wasted^|CCCCC|     <- file #1 blocks "A" and "C" from extent #1
     \BBBBBBBB/           <- file #1 blocks "B" from extent #2
     |BBBBBBBB|           <- file #2 extent #2 shared with file #1
     4k       8k

If extent #2 is larger than 4K, but only 4K of data is duplicate, then
we get the same result as when we replaced extent #2 with a clone of
part of extent #1:  wasted space.

A few popular block values on a filesystem can end up being duplicated
thousands of times, so we can't afford to just ignore them.  Such blocks
are usually part of much larger extents where the extents as a whole are
unique, so we can't get rid of the duplicate blocks without modifying
the extent boundaries.

Dedup has to replace extent #1 (and in some cases also extent #2) with
up to three new extents so that all duplicate extents have matching
boundaries before cloning:

0k                  12k   <- file #1 extent #1 deleted
|AAAA|BBBBBBBB|CCCCC|     <- file #1 extent #3, #4, #5
     |BBBBBBBB|           <- file #2 extent #2
     4k       8k          <- extent boundaries around "B" aligned

Then we can clone extent #2 from file #2 over extent #4 in file #1:

0k                  12k   <- file #1 is now extents #3, #2, #5
|AAAA|BBBBBBBB|CCCCC|     <- replace extent #4 with clone of #2
     |BBBBBBBB|           <- file #2 extent #2 shared with file #1
     4k       8k          <- no wasted space in either file

Since we are just deleting extent #4 in this case, we didn't have to
bother creating it.  An optimized implementation could skip that step.
A differently optimized implementation keeps #4 but deletes #2 because
then file #1 can be read with fewer seeks:

0k                  12k   <- file #1 data physically contiguous
|AAAA|BBBBBBBB|CCCCC|     <- file #1 is now extents #3, #4, #5
     |BBBBBBBB|           <- file #2 extent #4 shared with file #1
     4k       8k          <- file #2 extent #2 replaced by extent #4

Both extents might have other shared references.  When any extent is
split up, all references to the extent have to be split the same way.

How do we split extents into smaller pieces?  Copy the contents to
multiple new extents, then use those new extents to replace the original
extent along the same boundaries as the duplicate data block ranges.

Currently a user-space dedup agent has to rely on ioctls like
FILE_EXTENT_SAME and DEFRAG_RANGE to provide the required atomic
copy-and-replace semantics, and FIEMAP to discover where existing
extent boundaries are.  FIEMAP doesn't provide a complete picture of the
underlying extent structure--in particular, FIEMAP does not tell us when
a logical extent boundary does not match a btrfs physical extent boundary
and a split is required.  DEFRAG_RANGE refuses to split some contiguous
extents into smaller, non-contiguous extents, and even when it does,
there are other heuristics in btrfs that reassemble the carefully sized
smaller extents back into useless larger ones.  FILE_EXTENT_SAME doesn't
split physical extents when it needs to--it just creates broken shared
extent references with all the problems described above.

Copying the data to multiple temporary files in user-space (so they
can't be coalesced into larger extents) seems to create the split extents
required, and FILE_EXTENT_SAME can be used to atomically splice the new
extents back into the original files, replacing the original extents.
Userspace is left to find all the refs itself--there are several ways to
do that with various time/space/complexity trade-offs.  It's _possible_
to implement a working dedup in userspace now, but it's not easy to get
every corner case right.

It's a little easier in the kernel.  FILE_EXTENT_SAME could split
extents as required, and use a non-broken internal implementation
of DEFRAG_RANGE that doesn't refuse to copy data it was asked to.
A flag can be added somewhere to disable the heuristics that coalesce
consecutive extents together when those extents are intentionally split,
but preserve the existing behavior of coalescing consecutive duplicate
extents when they are the maximum length accepted by FILE_EXTENT_SAME.
The kernel can chase all the extent backrefs more easily and completely
than userspace can.  An in-kernel dedup might even be able to make wasted
blocks accessible again if their contents should reappear in new files.

On a heavily defragmented filesystem the data copy required could be a
little under 2GB for a single pair of cloned ranges (in the worst case
of 4K cloned in two maximum-sized extents).  Currently FILE_EXTENT_SAME
refuses to do work involving more than 32MB of reads for a single
extent pair.  There's a pretty big gap between 32MB of reads and 4GB of
reads _and_ writes.  FILE_EXTENT_SAME also takes a vector of extents,
which multiply the work.

It would be better if the kernel could manipulate the metadata directly to
split extents.  Dedup does work now with copies, but quite inefficiently.

> Thanks,
> 
> -liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange data backref offset?
  2015-07-19  7:23   ` Zygo Blaxell
@ 2015-07-20  2:24     ` Qu Wenruo
  2015-07-21  4:55       ` Zygo Blaxell
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2015-07-20  2:24 UTC (permalink / raw)
  To: Zygo Blaxell, Liu Bo; +Cc: btrfs



Zygo Blaxell wrote on 2015/07/19 03:23 -0400:
> On Sat, Jul 18, 2015 at 07:35:31PM +0800, Liu Bo wrote:
>> On Fri, Jul 17, 2015 at 10:38:32AM +0800, Qu Wenruo wrote:
>>> Hi all,
>>>
>>> While I'm developing a new btrfs inband dedup mechanism, I found btrfsck and
>>> kernel doing strange behavior for clone.
>>>
>>> [Reproducer]
>>> # mount /dev/sdc -t btrfs /mnt/test
>>> # dd if=/dev/zero of=/mnt/test/file1 bs=4K count=4
>>> # sync
>>> # ~/xfstests/src/cloner -s 4096 -l 4096 /mnt/test/file1 /mnt/test/file2
>>> # sync
>>>
>>> Then btrfs-debug-tree gives quite strange result on the data backref:
>>> ------
>>> <EXTENT TREE>
>>>          item 4 key (12845056 EXTENT_ITEM 16384) itemoff 16047 itemsize 111
>>>                  extent refs 3 gen 6 flags DATA
>>>                  extent data backref root 5 objectid 257 offset 0 count 1
>>>                  extent data backref root 5 objectid 258 offset
>>> 18446744073709547520 count 1
>>>
>>> <FS TREE>
>>>          item 8 key (257 EXTENT_DATA 0) itemoff 15743 itemsize 53
>>>                  extent data disk byte 12845056 nr 16384
>>>                  extent data offset 0 nr 16384 ram 16384
>>>                  extent compression 0
>>>          item 9 key (257 EXTENT_DATA 16384) itemoff 15690 itemsize 53
>>>                  extent data disk byte 12845056 nr 16384
>>>                  extent data offset 4096 nr 4096 ram 16384
>>>                  extent compression 0
>>> ------
>>>
>>> The offset is file extent's key.offset - file exntent's offset,
>>> Which is 0 - 4096, causing the overflow result.
>>>
>>> Kernel and fsck all uses that behavior, so fsck can pass the strange thing.
>>>
>>> But shouldn't the offset in data backref matches with the key.offset of the
>>> file extent?
>>>
>>> And I'm quite sure the change of behavior can hugely break the fsck and
>>> kernel, but I'm wondering is this a known BUG or feature, and will it be
>>> handled?
>>
>> Also found this before, one of the benefits is to save metadata in extent
>> tree, that is, if we overwrite inside an extent, the extent ref count
>> increases to 2 since both use (key.offset - extent_offset).
>>
>> 0k                  12k
>> |-------------------|
>>       |--------|
>>       4k       8k
>>
>> one EXTENT_DATA item is 0k and the other one is 8k, the corresponding
>> refs will be (0k - 0) and (8k - 8k).
Thanks Liu,
This makes sense and quite convincing, so I'm OK with it now.
>>
>> It's a format change so it won't be easy to make a change.  I'd prefer a
>> workaround on clone side.
As you explained, and since current kernel and fsck works, even it's a 
strange minus number, I'm OK with it.
So I prefer not to change behavior even it's odd.
>
> The workaround for dedup currently involves copying data.  It could be
> done by manipulating the metadata directly, but btrfs doesn't seem to
> provide this.
>
> Suppose our dedup engine has identified duplicate blocks in distinct
> extents, and we want to replace them by cloning:
>
> 0k                  12k
> |AAAAABBBBBBBBCCCCCC|     <- file #1 extent #1
>       |BBBBBBBB|           <- file #2 extent #2
>       4k       8k
>
> If we clone the "B" part of extent #1 over extent #2, we save one block
> by eliminating the last reference to extent #2; however, if file #1 is
> later deleted, we waste 8K.  File #2 will refer to only 4K out of extent
> #1's 12K.  Any further clones of file #2 will add references to this
> 12K extent.  The 8K of wasted data cannot be accessed from userspace,
> so it's unavailable for allocation and can't be reused by future clones.
>
> Assuming dedup finds many further copies of the "B" data, and they're
> all replaced by clones referring to the "B" blocks of extent #1, those
> 8K are effectively wasted forever.  If extent #1 is larger (up to 1GB),
> more space is wasted.  The only way to get the space back after this
> failure is to locate every reference to extent #1 in the entire filesystem
> (i.e. every instance of the "B" data) and delete it.
>
> 0k                  12k
> |AAAAABBBBBBBBCCCCCC|     <- file #1 extent #1, "B" shared with file #2
>   aaaa|BBBBBBBB|ccccc      <- file #2 "a" and "c" blocks referenced
>       4k       8k             but not used (i.e. wasted) in file #2
>
> If we clone extent #2 over extent #1, we avoid the above problem, but
> we save nothing.  The refs to extent #1 from file #1 will keep the "B"
> blocks from extent #1 on disk even though they are not accessible:
>
> 0k   /bbbbbbbb\     12k   <- file #1 unused blocks from extent #1
> |AAAA|^wasted^|CCCCC|     <- file #1 blocks "A" and "C" from extent #1
>       \BBBBBBBB/           <- file #1 blocks "B" from extent #2
>       |BBBBBBBB|           <- file #2 extent #2 shared with file #1
>       4k       8k
>
> If extent #2 is larger than 4K, but only 4K of data is duplicate, then
> we get the same result as when we replaced extent #2 with a clone of
> part of extent #1:  wasted space.
>
> A few popular block values on a filesystem can end up being duplicated
> thousands of times, so we can't afford to just ignore them.  Such blocks
> are usually part of much larger extents where the extents as a whole are
> unique, so we can't get rid of the duplicate blocks without modifying
> the extent boundaries.
>
> Dedup has to replace extent #1 (and in some cases also extent #2) with
> up to three new extents so that all duplicate extents have matching
> boundaries before cloning:
>
> 0k                  12k   <- file #1 extent #1 deleted
> |AAAA|BBBBBBBB|CCCCC|     <- file #1 extent #3, #4, #5
>       |BBBBBBBB|           <- file #2 extent #2
>       4k       8k          <- extent boundaries around "B" aligned
>
> Then we can clone extent #2 from file #2 over extent #4 in file #1:
>
> 0k                  12k   <- file #1 is now extents #3, #2, #5
> |AAAA|BBBBBBBB|CCCCC|     <- replace extent #4 with clone of #2
>       |BBBBBBBB|           <- file #2 extent #2 shared with file #1
>       4k       8k          <- no wasted space in either file
>
> Since we are just deleting extent #4 in this case, we didn't have to
> bother creating it.  An optimized implementation could skip that step.
> A differently optimized implementation keeps #4 but deletes #2 because
> then file #1 can be read with fewer seeks:
>
> 0k                  12k   <- file #1 data physically contiguous
> |AAAA|BBBBBBBB|CCCCC|     <- file #1 is now extents #3, #4, #5
>       |BBBBBBBB|           <- file #2 extent #4 shared with file #1
>       4k       8k          <- file #2 extent #2 replaced by extent #4
>
> Both extents might have other shared references.  When any extent is
> split up, all references to the extent have to be split the same way.
>
> How do we split extents into smaller pieces?  Copy the contents to
> multiple new extents, then use those new extents to replace the original
> extent along the same boundaries as the duplicate data block ranges.
>
> Currently a user-space dedup agent has to rely on ioctls like
> FILE_EXTENT_SAME and DEFRAG_RANGE to provide the required atomic
> copy-and-replace semantics, and FIEMAP to discover where existing
> extent boundaries are.  FIEMAP doesn't provide a complete picture of the
> underlying extent structure--in particular, FIEMAP does not tell us when
> a logical extent boundary does not match a btrfs physical extent boundary
> and a split is required.  DEFRAG_RANGE refuses to split some contiguous
> extents into smaller, non-contiguous extents, and even when it does,
> there are other heuristics in btrfs that reassemble the carefully sized
> smaller extents back into useless larger ones.  FILE_EXTENT_SAME doesn't
> split physical extents when it needs to--it just creates broken shared
> extent references with all the problems described above.
>
> Copying the data to multiple temporary files in user-space (so they
> can't be coalesced into larger extents) seems to create the split extents
> required, and FILE_EXTENT_SAME can be used to atomically splice the new
> extents back into the original files, replacing the original extents.
> Userspace is left to find all the refs itself--there are several ways to
> do that with various time/space/complexity trade-offs.  It's _possible_
> to implement a working dedup in userspace now, but it's not easy to get
> every corner case right.
>
> It's a little easier in the kernel.  FILE_EXTENT_SAME could split
> extents as required, and use a non-broken internal implementation
> of DEFRAG_RANGE that doesn't refuse to copy data it was asked to.
> A flag can be added somewhere to disable the heuristics that coalesce
> consecutive extents together when those extents are intentionally split,
> but preserve the existing behavior of coalescing consecutive duplicate
> extents when they are the maximum length accepted by FILE_EXTENT_SAME.
> The kernel can chase all the extent backrefs more easily and completely
> than userspace can.  An in-kernel dedup might even be able to make wasted
> blocks accessible again if their contents should reappear in new files.
Yes, your method about split extents sound quite good, and can reduce on 
disk space usage to minimal.

Before:
0	4K	8K	12K	20K
File A:
|AAAAAAA|AAAAAAA|BBBBBBB|CCCCCCC|
On disk extent:
|<--------Extent A------------->|

File B:
|AAAAAAA|AAAAAAA|CCCCCCC|CCCCCCC|
|<--------Extent B------------->|

Total data: 2x20K extents, use 40K on disk
Metadata: 2 extent items, with 1 inline backref in each.
2 file extent items.

After:
File A:
|AAAAAAA|AAAAAAA|BBBBBBB|CCCCCCC|
|--ExtC-|--ExtC-|--ExtD-|--ExtE-|

File B:
|AAAAAAA|AAAAAAA|CCCCCCC|CCCCCCC|
|--ExtC-|--ExtC-|--ExtE-|--ExtE-|

Total data: 3x4K extents, use 12K on disk
Metadata: 3 extent items, ExtC with 3 inline backref, ExtD with 1 inline 
backref, ExtE with 2 inline backref.
8 file extent items.

If I'm wrong, I'm happy if anyone can point it out.

But I'm a little considered about the facts that extents get quite 
small(4K) and the increasing number of backref/file extents may affect 
performance.

[inband vs outband]
I think kernel inband deduplication should be *fast*, *simple*, and *flex*.
It doesn't need to be perfectly dedup every extent to its smallest space 
usage.
It should be a quite nice additional option for btrfs, but only impacts 
the normal routine to minimal.

On the other hand, user space dedup can do all the hard work that kernel 
dedup can't do, like what you do currently, try to do the best to reduce 
every space usage.

So from this respect, I'm OK with the space waste in in-band 
deduplication if the inband dedup is simple and fast.

[Liu Bo's patchset]
As Liu Bo already submit a V10 patchset, I think most of its implement 
is quite good and straightforward, quite suitable for kernel in band 
dedup if can be improved a little.

The workflow is quite easy to understand:
1. Hash(SHA256) the data at the unit of dedup_size.
    For data smaller than dedup_size, fallback to normal routine.
2. If hash matches, add file extent item and backref.
    No need to submit the bio
3. If hash not match, adds the hash to dedup tree.
    The bio submit/metadata modification will be done by dedup routine.
4. When free a extent, deletes the hash from dedup tree.

For the space waste problem, as dedup is done at dedup_size (can be set 
to page size) and extent is at that size, so no space waste problem, but 
break the original large extent into small ones even it will not have 
further duplicated one.

And a new dedup tree with dedup ioctl to disable/enabled it.

[My developing dedup patchset]
Learned a lot from Liu Bo's patchset, mixed with some of my understanding.

The workflow is somewhat like Liu's
1. Hash(SHA256) the data at the unit of page size.
    Only happens for uncompressed non-inline data yet.
2. If hash matches, add file extent item and backref.
    Skip bio submit just like Liu.
3. If hash not match, go through the orinal routine.
    But with hook to add hash into dedup hash<->extent mapping tree.
4. Remove Last Recent Use hash<->extent mapping if there is too many.

5. Also add hash<->extent mapping at read page time.
6. Also remove all the hash<->extent mapping belongs to the extent when
    freeing

As you can see, all hash<->extent mapping is stored only in memory,
so no new tree, even no need to use a new ioctl, just mount option is 
enough to disable/enable dedup.
And large extents will still be large, don't need to be broken into 
smaller one. Only duplicated page will be small extent(4K)

With limited (can be tuned with mount option) number of in memory 
mapping, overhead of read/write can be limited to a constant value.

Also, the number of lines is small.
Current working one without the code to parse mount option(just hard 
coded for testing yet) only adds about 780 lines to do dedup.

On the other hand the cost is obvious too, not every duplicated page 
will be deduplicated, and still can't solve the space waste problem.

But IMHO such implement can be improved further easily:
TODO list:
1. merge duplicated extent into larger one if possible.
For example:
Existing File 1:
0	4K	8K	12K
|AAAAAAA|BBBBBBB|CCCCCCC|

|<------Extent A------->|
New write into file2:
Current:
|AAAAAAA|BBBBBBB|
|-ExtB--|-ExtC--|

Ext B will point to extent A with offset 0, nr 4096.
Ext C will point to extent A with offset 4096, nr 4096.

TODO version:
|AAAAAAA|BBBBBBB|
|-----ExtD------|
Ext D will point to extent A with offset 0, nr 8192.

This will both reduce space waste and metadata size.
But like the implement itself, such behavior won't always do it best, 
and may fallback to the old one.

2. Persist dedup hash<->extent map.
As we already have the dedup ioctl number, in this implement, we can use 
it to pin a dedup hash mapping into memory and out of the LRU control.
This could hugely improve the flexibility of kernel inband dedup, and 
put the dedup policy out of kernel, making application handle things 
better than kernel.


>
> On a heavily defragmented filesystem the data copy required could be a
> little under 2GB for a single pair of cloned ranges (in the worst case
> of 4K cloned in two maximum-sized extents).  Currently FILE_EXTENT_SAME
> refuses to do work involving more than 32MB of reads for a single
> extent pair.  There's a pretty big gap between 32MB of reads and 4GB of
> reads _and_ writes.  FILE_EXTENT_SAME also takes a vector of extents,
> which multiply the work.
So breaking large extent is always needed, even in my implement, I limit 
the largest uncompressed extent to be 512K
(For 4K page size, 512K needs 128 hashes, at each takes 32bytes, hash 
will be just 1 page).

Thanks,
Qu
>
> It would be better if the kernel could manipulate the metadata directly to
> split extents.  Dedup does work now with copies, but quite inefficiently.
>
>> Thanks,
>>
>> -liubo
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange data backref offset?
  2015-07-20  2:24     ` Qu Wenruo
@ 2015-07-21  4:55       ` Zygo Blaxell
  2015-07-21  6:52         ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Zygo Blaxell @ 2015-07-21  4:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Liu Bo, btrfs

On Mon, Jul 20, 2015 at 10:24:38AM +0800, Qu Wenruo wrote:
> Zygo Blaxell wrote on 2015/07/19 03:23 -0400:
> But I'm a little considered about the facts that extents get quite small(4K)
> and the increasing number of backref/file extents may affect performance.

At the moment I just ignore any block that already has more than 100
references, since deduplicating those blocks can save at most 1% space,
and I'm afraid of finding out the hard way what happens if there's over
a million references to a single extent in btrfs.  ;)

> [inband vs outband]
> I think kernel inband deduplication should be *fast*, *simple*, and *flex*.
> It doesn't need to be perfectly dedup every extent to its smallest space
> usage.
> It should be a quite nice additional option for btrfs, but only impacts the
> normal routine to minimal.
> 
> On the other hand, user space dedup can do all the hard work that kernel
> dedup can't do, like what you do currently, try to do the best to reduce
> every space usage.

An in-band dedup can avoid some of these problems, especially if it
intercepts writes before they make it to disk.  There is no need for
complicated on-disk-extent-splitting algorithms if we can avoid writing
extents that needed to be split in the first place.

> So from this respect, I'm OK with the space waste in in-band deduplication
> if the inband dedup is simple and fast.

My version of that rule is "don't compromise a working design to do
something that gives less than 1% benefit."  ;)

> [Liu Bo's patchset]
> As Liu Bo already submit a V10 patchset, I think most of its implement is
> quite good and straightforward, quite suitable for kernel in band dedup if
> can be improved a little.
> 
> The workflow is quite easy to understand:
> 1. Hash(SHA256) the data at the unit of dedup_size.
>    For data smaller than dedup_size, fallback to normal routine.
> 2. If hash matches, add file extent item and backref.
>    No need to submit the bio
> 3. If hash not match, adds the hash to dedup tree.
>    The bio submit/metadata modification will be done by dedup routine.
> 4. When free a extent, deletes the hash from dedup tree.
> 
> For the space waste problem, as dedup is done at dedup_size (can be set to
> page size) and extent is at that size, so no space waste problem, but break
> the original large extent into small ones even it will not have further
> duplicated one.
> 
> And a new dedup tree with dedup ioctl to disable/enabled it.
> 
> [My developing dedup patchset]
> Learned a lot from Liu Bo's patchset, mixed with some of my understanding.
> 
> The workflow is somewhat like Liu's
> 1. Hash(SHA256) the data at the unit of page size.
>    Only happens for uncompressed non-inline data yet.
> 2. If hash matches, add file extent item and backref.
>    Skip bio submit just like Liu.
> 3. If hash not match, go through the orinal routine.
>    But with hook to add hash into dedup hash<->extent mapping tree.
> 4. Remove Last Recent Use hash<->extent mapping if there is too many.
> 
> 5. Also add hash<->extent mapping at read page time.
> 6. Also remove all the hash<->extent mapping belongs to the extent when
>    freeing
> 
> As you can see, all hash<->extent mapping is stored only in memory,
> so no new tree, even no need to use a new ioctl, just mount option is enough
> to disable/enable dedup.
> And large extents will still be large, don't need to be broken into smaller
> one. Only duplicated page will be small extent(4K)
> 
> With limited (can be tuned with mount option) number of in memory mapping,
> overhead of read/write can be limited to a constant value.
> 
> Also, the number of lines is small.
> Current working one without the code to parse mount option(just hard coded
> for testing yet) only adds about 780 lines to do dedup.
> 
> On the other hand the cost is obvious too, not every duplicated page will be
> deduplicated, and still can't solve the space waste problem.

My approach is a little different.  RAM usage is my limiting factor, so
I've selected data structures to minimize it, and traded against some
extra I/O.

I use a much smaller hash (40-50 bits, depending on the size of the
filesystem and expected duplicate rate) that is intentionally sized
to false-positive on about 0.1% of matching hashes--no more, no less.
The cost is having to read all possibly duplicate blocks to determine if
the data is really duplicate, and 0.1% of the time I get a false positive.
The benefit is that I can get more hash table coverage per unit of RAM,
so I can use a tiny machine to dedup a huge filesystem.

I use random drop instead of LRU on the hash table.  Hashes of duplicate
blocks are ordered to expire after hashes of unique blocks, but hashes
within both groups expire in random order.  When I find a pair of
duplicate blocks in two extents, I check the adjacent blocks to see if
they are also identical.  If they are, I expand the duplicate block range
and repeat, resulting in a maximum-length sequence of duplicate blocks.
Typical duplicate files contain long sequences of duplicate blocks where
each sequence contains blocks that are mostly unique except when they
appear in the sequence (e.g. a JPEG photo will contain thousands of
blocks, each of which are only ever found in other copies of the same
JPEG photo).  It's only necessary to find any one of those blocks in the
block hash table--the others will be found because they are adjacent
to the block in the hash table.  In other words, for typical cases we
can just throw most of the block hash table away, keeping only a small
sample of the blocks.

These two techniques save a _lot_ of RAM compared to approaches based
on large, collision-free hashes and LRU.  I can intentionally undersize
the block hash table's RAM allocation by multiple orders of magnitude
and still find 99% of duplicate blocks on a filesystem.

In my case I'm *really* starved for RAM--I push 25TB of data through a
10GB hash table with a 45% duplicate rate.  That's only 10% of the RAM
I would require for full hash table coverage coverage of the filesystem.

> But IMHO such implement can be improved further easily:
> TODO list:
> 1. merge duplicated extent into larger one if possible.
> For example:
> Existing File 1:
> 0	4K	8K	12K
> |AAAAAAA|BBBBBBB|CCCCCCC|
> 
> |<------Extent A------->|
> New write into file2:
> Current:
> |AAAAAAA|BBBBBBB|
> |-ExtB--|-ExtC--|
> 
> Ext B will point to extent A with offset 0, nr 4096.
> Ext C will point to extent A with offset 4096, nr 4096.
> 
> TODO version:
> |AAAAAAA|BBBBBBB|
> |-----ExtD------|
> Ext D will point to extent A with offset 0, nr 8192.

My experiments with this so far found some nasty feedback loops with
extent coalescing, where we coalesce AAA and BBB together into AAABBB,
then split them apart again later on when we discover files with AAACCC
or BBBDDD sequences.

I find that once an extent has been split for dedup, the best thing to
do is to leave it alone.  In the vast majority of cases such extents are
split into smaller pieces until the extent edges match some logical
boundary in the data, but then they remain stable after that.

E.g. if several files are copied from a filesystem into a VM filesystem
image, the VM image will be broken up into extents that match the
boundaries of the files.  Once that is done, there will typically be no
further fragmentation as any further duplicates of these extents will
be more copies of the same files--and therefore have the same length.
Another copy of the VM image just gets broken into the same sized
extents as the first copy.

There are examples where this is not true, but they tend to be really
bad use cases for deduplication.  e.g. a database file might have lots
of small duplicate fragments--but typically one wants a database file
to be nocow and defragmented, and the gains from these operations far
outweigh the benefits (and costs!) of allowing dedup to touch the file
at all.

If many small consecutive extents can be *proven* unique, they might be
worth defragmenting.  Maybe.

> This will both reduce space waste and metadata size.
> But like the implement itself, such behavior won't always do it best, and
> may fallback to the old one.
> 
> 2. Persist dedup hash<->extent map.
> As we already have the dedup ioctl number, in this implement, we can use it
> to pin a dedup hash mapping into memory and out of the LRU control.
> This could hugely improve the flexibility of kernel inband dedup, and put
> the dedup policy out of kernel, making application handle things better than
> kernel.

My persistent hash table is just a copy of the in-memory hash table (or
mmap the table as a file on disk).  The in-memory hash table is just a
giant array of (hash, physical) tuples arranged to encode random-drop
information and some of the hash bits in the address of each table entry.
It is fixed-size.  The administrator decides how much RAM to devote to
dedup, and once the table is loaded the RAM usage is constant.

On the other hand, my design has trade-offs:  because each table entry
is a hash and a physical block, and there's no synchronization with the
kernel, balancing the filesystem effectively destroys the hash table.  :-P

> >On a heavily defragmented filesystem the data copy required could be a
> >little under 2GB for a single pair of cloned ranges (in the worst case
> >of 4K cloned in two maximum-sized extents).  Currently FILE_EXTENT_SAME
> >refuses to do work involving more than 32MB of reads for a single
> >extent pair.  There's a pretty big gap between 32MB of reads and 4GB of
> >reads _and_ writes.  FILE_EXTENT_SAME also takes a vector of extents,
> >which multiply the work.
> So breaking large extent is always needed, even in my implement, I limit the
> largest uncompressed extent to be 512K
> (For 4K page size, 512K needs 128 hashes, at each takes 32bytes, hash will
> be just 1 page).

Huge!  ;)

I have the same 4K page size, but even with a fully populated hash table
(all blocks on disk indexed) I use only 1712 bytes per 512K--including
the block pointers and indexing overhead.  At about 157 bytes per 512K
(the amount of RAM I have to run dedup) the dedup still covers 92%
of duplicate extents with length 64K, with less coverage for smaller
extents and more coverage for larger extents.  The size distribution
of typical filesystems means I get 99% of all duplicate blocks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange data backref offset?
  2015-07-21  4:55       ` Zygo Blaxell
@ 2015-07-21  6:52         ` Qu Wenruo
  2015-07-21 22:14           ` Zygo Blaxell
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2015-07-21  6:52 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Liu Bo, btrfs



Zygo Blaxell wrote on 2015/07/21 00:55 -0400:
> On Mon, Jul 20, 2015 at 10:24:38AM +0800, Qu Wenruo wrote:
>> Zygo Blaxell wrote on 2015/07/19 03:23 -0400:
>> But I'm a little considered about the facts that extents get quite small(4K)
>> and the increasing number of backref/file extents may affect performance.
>
> At the moment I just ignore any block that already has more than 100
> references, since deduplicating those blocks can save at most 1% space,
> and I'm afraid of finding out the hard way what happens if there's over
> a million references to a single extent in btrfs.  ;)
>
>> [inband vs outband]
>> I think kernel inband deduplication should be *fast*, *simple*, and *flex*.
>> It doesn't need to be perfectly dedup every extent to its smallest space
>> usage.
>> It should be a quite nice additional option for btrfs, but only impacts the
>> normal routine to minimal.
>>
>> On the other hand, user space dedup can do all the hard work that kernel
>> dedup can't do, like what you do currently, try to do the best to reduce
>> every space usage.
>
> An in-band dedup can avoid some of these problems, especially if it
> intercepts writes before they make it to disk.  There is no need for
> complicated on-disk-extent-splitting algorithms if we can avoid writing
> extents that needed to be split in the first place.
Yes, that would be best, but it can get very tricky when things comes to 
detail.

For example, if inband dedup finds the write is duplicated with one page 
that is already written into disk, what it should do?

The already on disk one maybe written by old kernels without inband 
dedup support.
>
>> So from this respect, I'm OK with the space waste in in-band deduplication
>> if the inband dedup is simple and fast.
>
> My version of that rule is "don't compromise a working design to do
> something that gives less than 1% benefit."  ;)
>
>> [Liu Bo's patchset]
>> As Liu Bo already submit a V10 patchset, I think most of its implement is
>> quite good and straightforward, quite suitable for kernel in band dedup if
>> can be improved a little.
>>
>> The workflow is quite easy to understand:
>> 1. Hash(SHA256) the data at the unit of dedup_size.
>>     For data smaller than dedup_size, fallback to normal routine.
>> 2. If hash matches, add file extent item and backref.
>>     No need to submit the bio
>> 3. If hash not match, adds the hash to dedup tree.
>>     The bio submit/metadata modification will be done by dedup routine.
>> 4. When free a extent, deletes the hash from dedup tree.
>>
>> For the space waste problem, as dedup is done at dedup_size (can be set to
>> page size) and extent is at that size, so no space waste problem, but break
>> the original large extent into small ones even it will not have further
>> duplicated one.
>>
>> And a new dedup tree with dedup ioctl to disable/enabled it.
>>
>> [My developing dedup patchset]
>> Learned a lot from Liu Bo's patchset, mixed with some of my understanding.
>>
>> The workflow is somewhat like Liu's
>> 1. Hash(SHA256) the data at the unit of page size.
>>     Only happens for uncompressed non-inline data yet.
>> 2. If hash matches, add file extent item and backref.
>>     Skip bio submit just like Liu.
>> 3. If hash not match, go through the orinal routine.
>>     But with hook to add hash into dedup hash<->extent mapping tree.
>> 4. Remove Last Recent Use hash<->extent mapping if there is too many.
>>
>> 5. Also add hash<->extent mapping at read page time.
>> 6. Also remove all the hash<->extent mapping belongs to the extent when
>>     freeing
>>
>> As you can see, all hash<->extent mapping is stored only in memory,
>> so no new tree, even no need to use a new ioctl, just mount option is enough
>> to disable/enable dedup.
>> And large extents will still be large, don't need to be broken into smaller
>> one. Only duplicated page will be small extent(4K)
>>
>> With limited (can be tuned with mount option) number of in memory mapping,
>> overhead of read/write can be limited to a constant value.
>>
>> Also, the number of lines is small.
>> Current working one without the code to parse mount option(just hard coded
>> for testing yet) only adds about 780 lines to do dedup.
>>
>> On the other hand the cost is obvious too, not every duplicated page will be
>> deduplicated, and still can't solve the space waste problem.
>
> My approach is a little different.  RAM usage is my limiting factor, so
> I've selected data structures to minimize it, and traded against some
> extra I/O.
>
> I use a much smaller hash (40-50 bits, depending on the size of the
> filesystem and expected duplicate rate) that is intentionally sized
> to false-positive on about 0.1% of matching hashes--no more, no less.
> The cost is having to read all possibly duplicate blocks to determine if
> the data is really duplicate, and 0.1% of the time I get a false positive.
> The benefit is that I can get more hash table coverage per unit of RAM,
> so I can use a tiny machine to dedup a huge filesystem.
I tried to use CRC32 as hash other than SHA256, as it can reuse as checksum.

But I found things get tricky if I needs to compare page by page to 
determine if they are really the same.
Which means you may start a lot of new read just to dedup one new write,
not only impact the performance but also make the original read routine 
to be a mess.
And that's more harder than my previous expect when it comes to coding.

So I choose SHA256 as that's will skip the hard to code compare parts.
>
> I use random drop instead of LRU on the hash table.  Hashes of duplicate
> blocks are ordered to expire after hashes of unique blocks, but hashes
> within both groups expire in random order.  When I find a pair of
> duplicate blocks in two extents, I check the adjacent blocks to see if
> they are also identical.  If they are, I expand the duplicate block range
> and repeat, resulting in a maximum-length sequence of duplicate blocks.
> Typical duplicate files contain long sequences of duplicate blocks where
> each sequence contains blocks that are mostly unique except when they
> appear in the sequence (e.g. a JPEG photo will contain thousands of
> blocks, each of which are only ever found in other copies of the same
> JPEG photo).  It's only necessary to find any one of those blocks in the
> block hash table--the others will be found because they are adjacent
> to the block in the hash table.  In other words, for typical cases we
> can just throw most of the block hash table away, keeping only a small
> sample of the blocks.
The adjust blocks theory is completely right and makes sense, but I 
think this idea is trading IO for memory usage.

The problem is still the same, when write, you still needs to read 
unrelated(although potential duplicated) contents to compare or do 
adjustment check.

If that's all happens at user space, I'm completely OK with it.
User knows the idea/mechanism/behavior of it, so user choose to use it.

But when it comes to kernel, any write will probably cause not only 
write bio, but also read bio first, making original pure write sequence 
into rw mixed one, impacting the performance.

IMHO, any implement involved re-read at write time, is just trading I/O 
for memory.
If I'm wrong, please point it out.
In typical case, with your method, it can restore one example blocks.
But when a hash of first write hits the hash of example blocks, adjusted 
blocks needs to be read out to test if they matches.
The result would be, you need to write 16 pages original, but dedup 
needs to do 15(the example one is in memory already) pages read first.

That would be OK for SSD as it reads are faster than write, but for HDD,
read speed is the same with write, so IO time is not really saved, only 
disk space usage.

So unfortunately, I'm not a big fan of the idea that needs read before 
really write.

>
> These two techniques save a _lot_ of RAM compared to approaches based
> on large, collision-free hashes and LRU.  I can intentionally undersize
> the block hash table's RAM allocation by multiple orders of magnitude
> and still find 99% of duplicate blocks on a filesystem.
>
> In my case I'm *really* starved for RAM--I push 25TB of data through a
> 10GB hash table with a 45% duplicate rate.  That's only 10% of the RAM
> I would require for full hash table coverage coverage of the filesystem.
>
I'm a little interesting about the read I/O.
How much read I/O does it start?

>> But IMHO such implement can be improved further easily:
>> TODO list:
>> 1. merge duplicated extent into larger one if possible.
>> For example:
>> Existing File 1:
>> 0	4K	8K	12K
>> |AAAAAAA|BBBBBBB|CCCCCCC|
>>
>> |<------Extent A------->|
>> New write into file2:
>> Current:
>> |AAAAAAA|BBBBBBB|
>> |-ExtB--|-ExtC--|
>>
>> Ext B will point to extent A with offset 0, nr 4096.
>> Ext C will point to extent A with offset 4096, nr 4096.
>>
>> TODO version:
>> |AAAAAAA|BBBBBBB|
>> |-----ExtD------|
>> Ext D will point to extent A with offset 0, nr 8192.
>
> My experiments with this so far found some nasty feedback loops with
> extent coalescing, where we coalesce AAA and BBB together into AAABBB,
> then split them apart again later on when we discover files with AAACCC
> or BBBDDD sequences.
>
> I find that once an extent has been split for dedup, the best thing to
> do is to leave it alone.  In the vast majority of cases such extents are
> split into smaller pieces until the extent edges match some logical
> boundary in the data, but then they remain stable after that.
>
> E.g. if several files are copied from a filesystem into a VM filesystem
> image, the VM image will be broken up into extents that match the
> boundaries of the files.  Once that is done, there will typically be no
> further fragmentation as any further duplicates of these extents will
> be more copies of the same files--and therefore have the same length.
> Another copy of the VM image just gets broken into the same sized
> extents as the first copy.
>
> There are examples where this is not true, but they tend to be really
> bad use cases for deduplication.  e.g. a database file might have lots
> of small duplicate fragments--but typically one wants a database file
> to be nocow and defragmented, and the gains from these operations far
> outweigh the benefits (and costs!) of allowing dedup to touch the file
> at all.
>
> If many small consecutive extents can be *proven* unique, they might be
> worth defragmenting.  Maybe.
Thanks for the advice a lot.
But maybe you got me wrong, in my implement, not extent is split, so for 
the Extent A in the example, it's always a whole extent.
Extent B is just a new reference to it, using the part of it.

One debug-tree output from my already running test could explain it better:
------
Extent tree:
         item 4 key (12845056 EXTENT_ITEM 16384) itemoff 15989 itemsize 169
                 extent refs 5 gen 6 flags DATA
                 extent data backref root 5 objectid 257 offset 24576 
count 1
                 extent data backref root 5 objectid 257 offset 16384 
count 1
                 extent data backref root 5 objectid 257 offset 0 count 1
                 extent data backref root 5 objectid 257 offset 20480 
count 1
                 extent data backref root 5 objectid 257 offset 12288 
count 1

Fs tree:
# the first 16K is written as a large extent.
         item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
                 extent data disk byte 12845056 nr 16384
                 extent data offset 0 nr 16384 ram 16384
                 extent compression 0
# the rest are all deduped.
         item 7 key (257 EXTENT_DATA 16384) itemoff 15760 itemsize 53
                 extent data disk byte 12845056 nr 16384
                 extent data offset 4096 nr 4096 ram 16384
                 extent compression 0
         item 8 key (257 EXTENT_DATA 20480) itemoff 15707 itemsize 53
                 extent data disk byte 12845056 nr 16384
                 extent data offset 4096 nr 4096 ram 16384
                 extent compression 0
         item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
                 extent data disk byte 12845056 nr 16384
                 extent data offset 4096 nr 4096 ram 16384
                 extent compression 0
         item 10 key (257 EXTENT_DATA 28672) itemoff 15601 itemsize 53
                 extent data disk byte 12845056 nr 16384
                 extent data offset 4096 nr 4096 ram 16384
                 extent compression 0
------

item 7~10 just points to one part of the bigger extents, not split.
So merge would be quite nice to reduce the backref and file extents.

Even it can't expend the range, no split will happen anyway.
>
>> This will both reduce space waste and metadata size.
>> But like the implement itself, such behavior won't always do it best, and
>> may fallback to the old one.
>>
>> 2. Persist dedup hash<->extent map.
>> As we already have the dedup ioctl number, in this implement, we can use it
>> to pin a dedup hash mapping into memory and out of the LRU control.
>> This could hugely improve the flexibility of kernel inband dedup, and put
>> the dedup policy out of kernel, making application handle things better than
>> kernel.
>
> My persistent hash table is just a copy of the in-memory hash table (or
> mmap the table as a file on disk).  The in-memory hash table is just a
> giant array of (hash, physical) tuples arranged to encode random-drop
> information and some of the hash bits in the address of each table entry.
> It is fixed-size.  The administrator decides how much RAM to devote to
> dedup, and once the table is loaded the RAM usage is constant.

The persist hash map is just for special use case, like system admin 
knows some files are easy to have duplication, then pin the hash map 
into memory to improve dedup rate.

And to move the responsibility of OOM or other problem to sysadmins. :)
Only use the simplest policy in kernel, if user not happy with it,
then implement a good policy in user space.

But this is just a TODO, no goal or milestone yet.

>
> On the other hand, my design has trade-offs:  because each table entry
> is a hash and a physical block, and there's no synchronization with the
> kernel, balancing the filesystem effectively destroys the hash table.  :-P
>
>>> On a heavily defragmented filesystem the data copy required could be a
>>> little under 2GB for a single pair of cloned ranges (in the worst case
>>> of 4K cloned in two maximum-sized extents).  Currently FILE_EXTENT_SAME
>>> refuses to do work involving more than 32MB of reads for a single
>>> extent pair.  There's a pretty big gap between 32MB of reads and 4GB of
>>> reads _and_ writes.  FILE_EXTENT_SAME also takes a vector of extents,
>>> which multiply the work.
>> So breaking large extent is always needed, even in my implement, I limit the
>> largest uncompressed extent to be 512K
>> (For 4K page size, 512K needs 128 hashes, at each takes 32bytes, hash will
>> be just 1 page).
>
> Huge!  ;)
>
> I have the same 4K page size, but even with a fully populated hash table
> (all blocks on disk indexed) I use only 1712 bytes per 512K--including
> the block pointers and indexing overhead.
Did you mean 1712 bytes for a dedup unit? Or 1712 is for several dedup 
unit that covers the 512K? As 1712 can't be dived by 128...

In fact, I can reduce the space usage by just(not so easy to call it 
just anyway) change dedup size to 512K. So just one btrfs_dedup_hash for 
512K other than original 128 dedup_hash for 512K.
That's also why Liu Bo uses tunable dedup_size.

Anyway, my dedup_hash struct is really huge...
It's 120 bytes with all overhead for one btrfs_dedup_hash (and one page 
yet).

>  At about 157 bytes per 512K
> (the amount of RAM I have to run dedup) the dedup still covers 92%
> of duplicate extents with length 64K, with less coverage for smaller
> extents and more coverage for larger extents.  The size distribution
> of typical filesystems means I get 99% of all duplicate blocks.
>
Pretty impressive, but still a little worried about the code 
complexibility especially when it needs to go into kernel.

At last, thanks for your impressive dedup ideas and test data.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange data backref offset?
  2015-07-21  6:52         ` Qu Wenruo
@ 2015-07-21 22:14           ` Zygo Blaxell
  2015-07-22  1:49             ` Discuss on inband dedup implement (Original "strange data backref offset") Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Zygo Blaxell @ 2015-07-21 22:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Liu Bo, btrfs

[-- Attachment #1: Type: text/plain, Size: 14506 bytes --]

On Tue, Jul 21, 2015 at 02:52:38PM +0800, Qu Wenruo wrote:
> Zygo Blaxell wrote on 2015/07/21 00:55 -0400:
> >An in-band dedup can avoid some of these problems, especially if it
> >intercepts writes before they make it to disk.  There is no need for
> >complicated on-disk-extent-splitting algorithms if we can avoid writing
> >extents that needed to be split in the first place.
> Yes, that would be best, but it can get very tricky when things comes to
> detail.
> 
> For example, if inband dedup finds the write is duplicated with one page
> that is already written into disk, what it should do?

Replace all the duplicate blocks with refs to existing data on disk
(making partial references to existing extents) and write the rest to
disk the usual way.  I think that's what you're already doing.  ;)

In my dedup I divide extent match candidates into "perfect" and
"imperfect" cases.  A perfect match has the boundaries of duplicate and
unique block sequences aligned to the boundaries of the extents, so no
extent splitting is required and no waste accumulates.  An imperfect match
requires splitting one or more extents in order to create perfect matches.

It's reasonable to run dedup in a mode where imperfect matches are
disabled, so only perfect matches are used for dedup.  This still gets
a lot of duplicated data, it doesn't waste space, and it's _much_ faster
when the data includes a lot of big uncompressed extents.  It works well
on a development machine with lots of similar source trees checked out,
or many chroot build environments populated with similar binaries.
Unfortunately it doesn't work at all when there are big defragmented
VM images with copies of filesystem files.  Sometimes this tradeoff is
worth it, sometimes it's not.  It depends on the data and it's just
one if statement to turn it on or off.

It's also possible to operate in a mode where the distinction between
perfect and imperfect matches is ignored.  This is also fast but it
wastes space.  How much space is wasted depends on the writing pattern
on the disk.  It doesn't seem to waste more space than btrfs already
does when random writes occur over a large contiguous extent--and if
that problem can be solved, it can be solved for deup the same way.

> The already on disk one maybe written by old kernels without inband dedup
> support.

I don't think it matters if the blocks were written with or without
dedup support before (I've never had to make that distinction in code).
If there is a persistent hash table then that table might be missing data
from an old kernel, but if we're just hashing blocks as they appear in
RAM then no kernel other than the currently running one matters.

> >I use a much smaller hash (40-50 bits, depending on the size of the
> >filesystem and expected duplicate rate) that is intentionally sized
> >to false-positive on about 0.1% of matching hashes--no more, no less.
> >The cost is having to read all possibly duplicate blocks to determine if
> >the data is really duplicate, and 0.1% of the time I get a false positive.
> >The benefit is that I can get more hash table coverage per unit of RAM,
> >so I can use a tiny machine to dedup a huge filesystem.
> I tried to use CRC32 as hash other than SHA256, as it can reuse as checksum.
> 
> But I found things get tricky if I needs to compare page by page to
> determine if they are really the same.
> Which means you may start a lot of new read just to dedup one new write,
> not only impact the performance but also make the original read routine to
> be a mess.
> And that's more harder than my previous expect when it comes to coding.

I looked at CRC32 because it would have been really nice to reuse the
existing checksums...but it's too small.  The maximum filesystem size for a
32-bit hash and a 0.1% FP rate is about 32MB.

> So I choose SHA256 as that's will skip the hard to code compare parts.

There's already a read-and-compare in the extent-same ioctl.  :-/

> >I use random drop instead of LRU on the hash table.  Hashes of duplicate
> >blocks are ordered to expire after hashes of unique blocks, but hashes
> >within both groups expire in random order.  When I find a pair of
> >duplicate blocks in two extents, I check the adjacent blocks to see if
> >they are also identical.  If they are, I expand the duplicate block range
> >and repeat, resulting in a maximum-length sequence of duplicate blocks.
> >Typical duplicate files contain long sequences of duplicate blocks where
> >each sequence contains blocks that are mostly unique except when they
> >appear in the sequence (e.g. a JPEG photo will contain thousands of
> >blocks, each of which are only ever found in other copies of the same
> >JPEG photo).  It's only necessary to find any one of those blocks in the
> >block hash table--the others will be found because they are adjacent
> >to the block in the hash table.  In other words, for typical cases we
> >can just throw most of the block hash table away, keeping only a small
> >sample of the blocks.
> The adjust blocks theory is completely right and makes sense, but I think
> this idea is trading IO for memory usage.
> 
> The problem is still the same, when write, you still needs to read
> unrelated(although potential duplicated) contents to compare or do
> adjustment check.
> 
> If that's all happens at user space, I'm completely OK with it.
> User knows the idea/mechanism/behavior of it, so user choose to use it.
> 
> But when it comes to kernel, any write will probably cause not only write
> bio, but also read bio first, making original pure write sequence into rw
> mixed one, impacting the performance.
> 
> IMHO, any implement involved re-read at write time, is just trading I/O for
> memory.
> If I'm wrong, please point it out.

That's precisely the tradeoff I made:  up to 0.1% more I/O for unique
blocks, and duplicate writes would become reads instead (assuming I moved
my dedup daemon into the kernel).  There is some extra seeking on writes
to confirm duplicate pages--but when they are confirmed, the writes are
no longer necessary and their I/O can be dropped.

The payoff is that it requires 100 times less RAM, so it can coexist
with other things on a medium-sized machine.

> In typical case, with your method, it can restore one example blocks.
> But when a hash of first write hits the hash of example blocks, adjusted
> blocks needs to be read out to test if they matches.
> The result would be, you need to write 16 pages original, but dedup needs to
> do 15(the example one is in memory already) pages read first.
> 
> That would be OK for SSD as it reads are faster than write, but for HDD,
> read speed is the same with write, so IO time is not really saved, only disk
> space usage.
> 
> So unfortunately, I'm not a big fan of the idea that needs read before
> really write.

If you insist on doing the ZFS trick where duplicate writes never touch
the disk at all, you just need big hashes and absurd quantities of RAM.
I'm trying to make live dedup practical on much smaller systems.

> >These two techniques save a _lot_ of RAM compared to approaches based
> >on large, collision-free hashes and LRU.  I can intentionally undersize
> >the block hash table's RAM allocation by multiple orders of magnitude
> >and still find 99% of duplicate blocks on a filesystem.
> >
> >In my case I'm *really* starved for RAM--I push 25TB of data through a
> >10GB hash table with a 45% duplicate rate.  That's only 10% of the RAM
> >I would require for full hash table coverage coverage of the filesystem.
> >
> I'm a little interesting about the read I/O.
> How much read I/O does it start?

For each new block added to the filesystem, calculate its hash and
(if known) its physical address.  Usually no read is required because
we're talking about a block already in memory.

If the block hash is in the table, but the table has the same physical
address for the block (in case of multiple entries, any matching entry
will suffice) the block is already deduped and we are done (this
would only be relevant to an in-kernel implementation if it wanted to
opportunistically dedup on reads).

Next we go through all the hash table entries at different physical
addresses from the input block.  We read the block listed in the hash
table (on average, we do this twice 0.1% of the time, and 3 times 0.0001%
of the time) and compare it to the new block.  If all the blocks
are different, our new block is unique, so we add it to the hash table
(possibly forcing allocation so it has a known physical address) and
we are done.  By this point we've read an extra 4.004K on average for
duplicate blocks, and just 0.004K for unique blocks.

If we have a pair of identical blocks in different physical locations,
we read forward and backward in from both until we hit some reason to
stop (including:  perfect match found (all the extent boundaries match
exactly), different block data, end or beginning of one of the files,
exceeding the 16MB work limit of extent-same, I/O errors, etc).  This
can read up to 16x2 = 32MB, although it is likely that half of those
reads were already in the page cache.  If we have a perfect match at
this point we have all the arguments required for extent-same, so we
can call that and we are done.  Extent-same will do no I/O since the
data was very recently read into cache.

Note that when we read ahead from a matching block, we are also processing
later new blocks entering the filesystem.  Typically when these blocks are
duplicate it means that we have deleted them before the hash calculator
gets to process them, so some of the I/O we spent above is reclaimed here.

If we found only imperfect matches, we keep searching (up to some
arbitrary limit, like 10) other files containing the physical block,
repeating the steps above.  This means we are now reading up to 16MB
times whatever the limit is.  If we find a perfect match at any point
we handle it as above.

If we found only imperfect matches, we do extent-splitting on the best
imperfect match found.  Extent-splitting requires up to 2GB of reads
and 2GB of writes in the worst possible case.  If this is too much I/O
we can stop just before here and create a new block hash table entry
as if the block was a non-duplicate.

When extents are split by copying, they show up as new writes which
are immediately fed back into dedup.  This repeats until the extents
(and all their references) are fully fragmented into runs of unique and
duplicate blocks, and the duplicate blocks are deduped.  The number
of passes required depends on the history of previous deduplication,
defrag, snapshots, and whatever optimizations were implemented in the
dedup engine.

> The persist hash map is just for special use case, like system admin knows
> some files are easy to have duplication, then pin the hash map into memory
> to improve dedup rate.
> 
> And to move the responsibility of OOM or other problem to sysadmins. :)
> Only use the simplest policy in kernel, if user not happy with it,
> then implement a good policy in user space.
> 
> But this is just a TODO, no goal or milestone yet.

My target environment is a file server device where up to 75% of the
system RAM might be devoted to dedup.  It is not something that is
intended to run unnoticed on a smartphone.

> >I have the same 4K page size, but even with a fully populated hash table
> >(all blocks on disk indexed) I use only 1712 bytes per 512K--including
> >the block pointers and indexing overhead.
> Did you mean 1712 bytes for a dedup unit? Or 1712 is for several dedup unit
> that covers the 512K? As 1712 can't be dived by 128...

My hash table entries look like this:

	64 bit hash
	AAAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

	64 bit physical address
	PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPZZZZZZZZZZZZ

The 21 "A" bits are all the same within a page--I have 2^21 pages in
my test configuration (8GB).  They don't need to be stored inside the
hash table entry since they can be inferred from the address of the hash
table entry.  Similarly the last 12 "Z" bits are always zero in physical
addresses because of the 4K block size, so I don't store those either.

That means my hash table entries are actually 64 + 64 - 21 - 12 = 95
bits long.  On a 16TB filesystem with 4KB block size every single bit
in a hash table entry requires another 512MB of RAM, so I pack them into
pages without aligning them to byte boundaries.

For some time I was experimenting with making the hash table entries
even smaller (variable-length encodings and LZMA compressing each page)
and I got as low as 41 bits per entry with a high CPU overhead.  Later I
made the observation about adjacent blocks and random drop in the block
hash table which allowed much larger RAM savings without as much CPU
overhead, so my hash table entries are now much larger than they were
in earlier versions of my code.

> In fact, I can reduce the space usage by just(not so easy to call it just
> anyway) change dedup size to 512K. So just one btrfs_dedup_hash for 512K
> other than original 128 dedup_hash for 512K.
> That's also why Liu Bo uses tunable dedup_size.

Anyone can increase the block size to make their dedup faster.  We lose
100% of all duplicate extents that are smaller than (or just not aligned
to) the block size.

> Anyway, my dedup_hash struct is really huge...
> It's 120 bytes with all overhead for one btrfs_dedup_hash (and one page
> yet).

I'm told ZFS is closer to 320.  ;)

> > At about 157 bytes per 512K
> >(the amount of RAM I have to run dedup) the dedup still covers 92%
> >of duplicate extents with length 64K, with less coverage for smaller
> >extents and more coverage for larger extents.  The size distribution
> >of typical filesystems means I get 99% of all duplicate blocks.
> >
> Pretty impressive, but still a little worried about the code complexibility
> especially when it needs to go into kernel.
> 
> At last, thanks for your impressive dedup ideas and test data.
> 
> Thanks,
> Qu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Discuss on inband dedup implement (Original "strange data backref offset")
  2015-07-21 22:14           ` Zygo Blaxell
@ 2015-07-22  1:49             ` Qu Wenruo
  2015-07-22  3:49               ` Zygo Blaxell
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2015-07-22  1:49 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Liu Bo, btrfs

Change subject to reflect the core of the conversation.

Zygo Blaxell wrote on 2015/07/21 18:14 -0400:
> On Tue, Jul 21, 2015 at 02:52:38PM +0800, Qu Wenruo wrote:
>> Zygo Blaxell wrote on 2015/07/21 00:55 -0400:
>>> An in-band dedup can avoid some of these problems, especially if it
>>> intercepts writes before they make it to disk.  There is no need for
>>> complicated on-disk-extent-splitting algorithms if we can avoid writing
>>> extents that needed to be split in the first place.
>> Yes, that would be best, but it can get very tricky when things comes to
>> detail.
>>
>> For example, if inband dedup finds the write is duplicated with one page
>> that is already written into disk, what it should do?
>
> Replace all the duplicate blocks with refs to existing data on disk
> (making partial references to existing extents) and write the rest to
> disk the usual way.  I think that's what you're already doing.  ;)
Yes, completely what am I doing.
And it would be quite soon to see the first version patchset. :)

>
> In my dedup I divide extent match candidates into "perfect" and
> "imperfect" cases.  A perfect match has the boundaries of duplicate and
> unique block sequences aligned to the boundaries of the extents, so no
> extent splitting is required and no waste accumulates.  An imperfect match
> requires splitting one or more extents in order to create perfect matches.
>
> It's reasonable to run dedup in a mode where imperfect matches are
> disabled, so only perfect matches are used for dedup.  This still gets
> a lot of duplicated data, it doesn't waste space, and it's _much_ faster
> when the data includes a lot of big uncompressed extents.  It works well
> on a development machine with lots of similar source trees checked out,
> or many chroot build environments populated with similar binaries.
> Unfortunately it doesn't work at all when there are big defragmented
> VM images with copies of filesystem files.  Sometimes this tradeoff is
> worth it, sometimes it's not.  It depends on the data and it's just
> one if statement to turn it on or off.
>
> It's also possible to operate in a mode where the distinction between
> perfect and imperfect matches is ignored.  This is also fast but it
> wastes space.  How much space is wasted depends on the writing pattern
> on the disk.  It doesn't seem to waste more space than btrfs already
> does when random writes occur over a large contiguous extent--and if
> that problem can be solved, it can be solved for deup the same way.
>
>> The already on disk one maybe written by old kernels without inband dedup
>> support.
>
> I don't think it matters if the blocks were written with or without
> dedup support before (I've never had to make that distinction in code).
> If there is a persistent hash table then that table might be missing data
> from an old kernel, but if we're just hashing blocks as they appear in
> RAM then no kernel other than the currently running one matters.
>
>>> I use a much smaller hash (40-50 bits, depending on the size of the
>>> filesystem and expected duplicate rate) that is intentionally sized
>>> to false-positive on about 0.1% of matching hashes--no more, no less.
>>> The cost is having to read all possibly duplicate blocks to determine if
>>> the data is really duplicate, and 0.1% of the time I get a false positive.
>>> The benefit is that I can get more hash table coverage per unit of RAM,
>>> so I can use a tiny machine to dedup a huge filesystem.
>> I tried to use CRC32 as hash other than SHA256, as it can reuse as checksum.
>>
>> But I found things get tricky if I needs to compare page by page to
>> determine if they are really the same.
>> Which means you may start a lot of new read just to dedup one new write,
>> not only impact the performance but also make the original read routine to
>> be a mess.
>> And that's more harder than my previous expect when it comes to coding.
>
> I looked at CRC32 because it would have been really nice to reuse the
> existing checksums...but it's too small.  The maximum filesystem size for a
> 32-bit hash and a 0.1% FP rate is about 32MB.
>
>> So I choose SHA256 as that's will skip the hard to code compare parts.
>
> There's already a read-and-compare in the extent-same ioctl.  :-/
Yes, but the extent same ioctl needs 2 inode. (one already provided by 
ioctl system call)
One for the source and one for the dest, which makes codes easy to 
implement as we can just use VFS calls to read the needed pages.

But the problem is, for dedup, there is only hash <-> extent mapping.
We only know the corresponding bytenr of a hash, not a inode.

If even using the read-compare method, which means at write routine, we 
will resuse read routine, causing more complicated read/write path 
combination to test, and later read routine change will be quite dirty 
to avoid impact on write routine.

And what's worse is, an extent can be referred by several different 
inodes, so even for the best case, we need to find an inode using 
backref and then use address space calls to read the neede pages.

That will leads to the hell of delayed ref.
As at write time, backref is delayed, it needs not only search the 
extent tree for backref item but also go through delayed refs. (Such 
things were the reason causing crazy quota numbers before and I tried to 
fix in 4.2-rc1)

And that's what I feared to face, complicated read/write coupling with 
backref search nightmare.
>
>>> I use random drop instead of LRU on the hash table.  Hashes of duplicate
>>> blocks are ordered to expire after hashes of unique blocks, but hashes
>>> within both groups expire in random order.  When I find a pair of
>>> duplicate blocks in two extents, I check the adjacent blocks to see if
>>> they are also identical.  If they are, I expand the duplicate block range
>>> and repeat, resulting in a maximum-length sequence of duplicate blocks.
>>> Typical duplicate files contain long sequences of duplicate blocks where
>>> each sequence contains blocks that are mostly unique except when they
>>> appear in the sequence (e.g. a JPEG photo will contain thousands of
>>> blocks, each of which are only ever found in other copies of the same
>>> JPEG photo).  It's only necessary to find any one of those blocks in the
>>> block hash table--the others will be found because they are adjacent
>>> to the block in the hash table.  In other words, for typical cases we
>>> can just throw most of the block hash table away, keeping only a small
>>> sample of the blocks.
>> The adjust blocks theory is completely right and makes sense, but I think
>> this idea is trading IO for memory usage.
>>
>> The problem is still the same, when write, you still needs to read
>> unrelated(although potential duplicated) contents to compare or do
>> adjustment check.
>>
>> If that's all happens at user space, I'm completely OK with it.
>> User knows the idea/mechanism/behavior of it, so user choose to use it.
>>
>> But when it comes to kernel, any write will probably cause not only write
>> bio, but also read bio first, making original pure write sequence into rw
>> mixed one, impacting the performance.
>>
>> IMHO, any implement involved re-read at write time, is just trading I/O for
>> memory.
>> If I'm wrong, please point it out.
>
> That's precisely the tradeoff I made:  up to 0.1% more I/O for unique
> blocks, and duplicate writes would become reads instead (assuming I moved
> my dedup daemon into the kernel).  There is some extra seeking on writes
> to confirm duplicate pages--but when they are confirmed, the writes are
> no longer necessary and their I/O can be dropped.
>
> The payoff is that it requires 100 times less RAM, so it can coexist
> with other things on a medium-sized machine.

Quite convincing trade off now.

But the above extent reading problem is still here.
Although I may also be wrong as I'm not quite similar with VFS and filemap.

If there is any good kernel API to read pages directly by bytenr without 
using filemap/address space infrastructure, I'll be quite happy to try it.

>
>> In typical case, with your method, it can restore one example blocks.
>> But when a hash of first write hits the hash of example blocks, adjusted
>> blocks needs to be read out to test if they matches.
>> The result would be, you need to write 16 pages original, but dedup needs to
>> do 15(the example one is in memory already) pages read first.
>>
>> That would be OK for SSD as it reads are faster than write, but for HDD,
>> read speed is the same with write, so IO time is not really saved, only disk
>> space usage.
>>
>> So unfortunately, I'm not a big fan of the idea that needs read before
>> really write.
>
> If you insist on doing the ZFS trick where duplicate writes never touch
> the disk at all, you just need big hashes and absurd quantities of RAM.
> I'm trying to make live dedup practical on much smaller systems.
Or use LRU to further reduce the deduplication rate and code size. :)

Another reason I'm a little worried about the your near perfect dedup 
method is the code complexility and the policy.

For complexility problem, more codes means more bug, and if bug 
happens(it's a miracle no happen) it's a kernel bug...
 From kernel warning to crash to data corruption...
In fact, even I tried my best to reuse every possible existing routine, 
my patchset has already reach 800 lines (with about 50~100 lines 
comments), even without the last mount option part.
I'm quite happy to reduce lines of codes, but not adding so much...

Another problem is, although more political problem other than practical 
one, kernel should only provide mechanism, not a policy, which should be 
done at user space.
For me, the adjustment check one is already a policy, as it makes the 
assumption that most duplicated page will be adjusted.

But this is not so convincing anyway, as kernel already made a lot of 
decision using different policy, like memory watermark, swappiness and 
so on.
>
>>> These two techniques save a _lot_ of RAM compared to approaches based
>>> on large, collision-free hashes and LRU.  I can intentionally undersize
>>> the block hash table's RAM allocation by multiple orders of magnitude
>>> and still find 99% of duplicate blocks on a filesystem.
>>>
>>> In my case I'm *really* starved for RAM--I push 25TB of data through a
>>> 10GB hash table with a 45% duplicate rate.  That's only 10% of the RAM
>>> I would require for full hash table coverage coverage of the filesystem.
>>>
>> I'm a little interesting about the read I/O.
>> How much read I/O does it start?
>
> For each new block added to the filesystem, calculate its hash and
> (if known) its physical address.  Usually no read is required because
> we're talking about a block already in memory.
>
> If the block hash is in the table, but the table has the same physical
> address for the block (in case of multiple entries, any matching entry
> will suffice) the block is already deduped and we are done (this
> would only be relevant to an in-kernel implementation if it wanted to
> opportunistically dedup on reads).
>
> Next we go through all the hash table entries at different physical
> addresses from the input block.  We read the block listed in the hash
> table (on average, we do this twice 0.1% of the time, and 3 times 0.0001%
> of the time) and compare it to the new block.  If all the blocks
> are different, our new block is unique, so we add it to the hash table
> (possibly forcing allocation so it has a known physical address) and
> we are done.  By this point we've read an extra 4.004K on average for
> duplicate blocks, and just 0.004K for unique blocks.
>
> If we have a pair of identical blocks in different physical locations,
> we read forward and backward in from both until we hit some reason to
> stop (including:  perfect match found (all the extent boundaries match
> exactly), different block data, end or beginning of one of the files,
> exceeding the 16MB work limit of extent-same, I/O errors, etc).  This
> can read up to 16x2 = 32MB, although it is likely that half of those
> reads were already in the page cache.  If we have a perfect match at
> this point we have all the arguments required for extent-same, so we
> can call that and we are done.  Extent-same will do no I/O since the
> data was very recently read into cache.
Normally, I won't consider the kernel page cache, just consider that any 
read will start a real I/O.

As page cache is not what we can take control, especially under memory 
pressure.
>
> Note that when we read ahead from a matching block, we are also processing
> later new blocks entering the filesystem.  Typically when these blocks are
> duplicate it means that we have deleted them before the hash calculator
> gets to process them, so some of the I/O we spent above is reclaimed here.
>
> If we found only imperfect matches, we keep searching (up to some
> arbitrary limit, like 10) other files containing the physical block,
> repeating the steps above.  This means we are now reading up to 16MB
> times whatever the limit is.  If we find a perfect match at any point
> we handle it as above.
>
> If we found only imperfect matches, we do extent-splitting on the best
> imperfect match found.  Extent-splitting requires up to 2GB of reads
> and 2GB of writes in the worst possible case.  If this is too much I/O
> we can stop just before here and create a new block hash table entry
> as if the block was a non-duplicate.
>
> When extents are split by copying, they show up as new writes which
> are immediately fed back into dedup.  This repeats until the extents
> (and all their references) are fully fragmented into runs of unique and
> duplicate blocks, and the duplicate blocks are deduped.  The number
> of passes required depends on the history of previous deduplication,
> defrag, snapshots, and whatever optimizations were implemented in the
> dedup engine.
>
>> The persist hash map is just for special use case, like system admin knows
>> some files are easy to have duplication, then pin the hash map into memory
>> to improve dedup rate.
>>
>> And to move the responsibility of OOM or other problem to sysadmins. :)
>> Only use the simplest policy in kernel, if user not happy with it,
>> then implement a good policy in user space.
>>
>> But this is just a TODO, no goal or milestone yet.
>
> My target environment is a file server device where up to 75% of the
> system RAM might be devoted to dedup.  It is not something that is
> intended to run unnoticed on a smartphone.
The original thought on persist hash map is for database, hard to 
imagine right?

 From what I read from Oracle datasheet, they claim their database can 
take full use of ZFS deduplication, so I consider such persist hash map 
idea as a supplement to the uncertain LRU drop.
But that's just a future idea yet...

>
>>> I have the same 4K page size, but even with a fully populated hash table
>>> (all blocks on disk indexed) I use only 1712 bytes per 512K--including
>>> the block pointers and indexing overhead.
>> Did you mean 1712 bytes for a dedup unit? Or 1712 is for several dedup unit
>> that covers the 512K? As 1712 can't be dived by 128...
>
> My hash table entries look like this:
>
> 	64 bit hash
> 	AAAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
>
> 	64 bit physical address
> 	PPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPZZZZZZZZZZZZ
>
> The 21 "A" bits are all the same within a page--I have 2^21 pages in
> my test configuration (8GB).  They don't need to be stored inside the
> hash table entry since they can be inferred from the address of the hash
> table entry.  Similarly the last 12 "Z" bits are always zero in physical
> addresses because of the 4K block size, so I don't store those either.
>
> That means my hash table entries are actually 64 + 64 - 21 - 12 = 95
> bits long.  On a 16TB filesystem with 4KB block size every single bit
> in a hash table entry requires another 512MB of RAM, so I pack them into
> pages without aligning them to byte boundaries.
>
> For some time I was experimenting with making the hash table entries
> even smaller (variable-length encodings and LZMA compressing each page)
> and I got as low as 41 bits per entry with a high CPU overhead.  Later I
> made the observation about adjacent blocks and random drop in the block
> hash table which allowed much larger RAM savings without as much CPU
> overhead, so my hash table entries are now much larger than they were
> in earlier versions of my code.
Already quite awesome.

BTW, is your userspace dedup daemon open-source?
It would be quite nice if we can refer it to improve further.

Thanks,
Qu
>
>> In fact, I can reduce the space usage by just(not so easy to call it just
>> anyway) change dedup size to 512K. So just one btrfs_dedup_hash for 512K
>> other than original 128 dedup_hash for 512K.
>> That's also why Liu Bo uses tunable dedup_size.
>
> Anyone can increase the block size to make their dedup faster.  We lose
> 100% of all duplicate extents that are smaller than (or just not aligned
> to) the block size.
>
>> Anyway, my dedup_hash struct is really huge...
>> It's 120 bytes with all overhead for one btrfs_dedup_hash (and one page
>> yet).
>
> I'm told ZFS is closer to 320.  ;)
>
>>> At about 157 bytes per 512K
>>> (the amount of RAM I have to run dedup) the dedup still covers 92%
>>> of duplicate extents with length 64K, with less coverage for smaller
>>> extents and more coverage for larger extents.  The size distribution
>>> of typical filesystems means I get 99% of all duplicate blocks.
>>>
>> Pretty impressive, but still a little worried about the code complexibility
>> especially when it needs to go into kernel.
>>
>> At last, thanks for your impressive dedup ideas and test data.
>>
>> Thanks,
>> Qu
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Discuss on inband dedup implement (Original "strange data backref offset")
  2015-07-22  1:49             ` Discuss on inband dedup implement (Original "strange data backref offset") Qu Wenruo
@ 2015-07-22  3:49               ` Zygo Blaxell
  2015-07-22  6:03                 ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Zygo Blaxell @ 2015-07-22  3:49 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Liu Bo, btrfs

[-- Attachment #1: Type: text/plain, Size: 8105 bytes --]

On Wed, Jul 22, 2015 at 09:49:52AM +0800, Qu Wenruo wrote:
> Change subject to reflect the core of the conversation.
> 
> Zygo Blaxell wrote on 2015/07/21 18:14 -0400:
> >On Tue, Jul 21, 2015 at 02:52:38PM +0800, Qu Wenruo wrote:
> >>Zygo Blaxell wrote on 2015/07/21 00:55 -0400:

> >There's already a read-and-compare in the extent-same ioctl.  :-/
> Yes, but the extent same ioctl needs 2 inode. (one already provided by ioctl
> system call)
> One for the source and one for the dest, which makes codes easy to implement
> as we can just use VFS calls to read the needed pages.
> 
> But the problem is, for dedup, there is only hash <-> extent mapping.
> We only know the corresponding bytenr of a hash, not a inode.
> 
> If even using the read-compare method, which means at write routine, we will
> resuse read routine, causing more complicated read/write path combination to
> test, and later read routine change will be quite dirty to avoid impact on
> write routine.
> 
> And what's worse is, an extent can be referred by several different inodes,
> so even for the best case, we need to find an inode using backref and then
> use address space calls to read the neede pages.
> 
> That will leads to the hell of delayed ref.
> As at write time, backref is delayed, it needs not only search the extent
> tree for backref item but also go through delayed refs. (Such things were
> the reason causing crazy quota numbers before and I tried to fix in 4.2-rc1)
> 
> And that's what I feared to face, complicated read/write coupling with
> backref search nightmare.

Yes, putting a small-memory dedup into the write path on Linux does
cause some challenging (some might say insurmountable ;) ) problems.

A natural result of having a userspace daemon is that it can keep up
with writes but it is not on the critical path for them, i.e. if we have
memory pressure we just throw the blocks at the disk to free up some RAM,
but remember the block addresses so they can be rescanned later when we're
more idle.  A daemon with a persistent store and a find-new style scanner
can be paused and allowed to run overnight if it's too heavy to run live.

I could see such a facility living inside the kernel and being started and
stopped by an admin...but it might be a better idea to just adjust the
kernel's interface to communicate better with a dedup daemon.

e.g. I have a wishlist item to have something like fanotify which gives me
a file descriptor when a file is modified--but that also includes dirty
block numbers.  I can connect those modified blocks with their duplicate
on-disk counterparts almost immediately, before delalloc even considers
writing them.

> >That's precisely the tradeoff I made:  up to 0.1% more I/O for unique
> >blocks, and duplicate writes would become reads instead (assuming I moved
> >my dedup daemon into the kernel).  There is some extra seeking on writes
> >to confirm duplicate pages--but when they are confirmed, the writes are
> >no longer necessary and their I/O can be dropped.
> >
> >The payoff is that it requires 100 times less RAM, so it can coexist
> >with other things on a medium-sized machine.
> 
> Quite convincing trade off now.
> 
> But the above extent reading problem is still here.
> Although I may also be wrong as I'm not quite similar with VFS and filemap.
> 
> If there is any good kernel API to read pages directly by bytenr without
> using filemap/address space infrastructure, I'll be quite happy to try it.

I had assumed (but not checked) that the kernel can do it more easily than
userspace.  The ugliest pieces of code for userspace dedup are the pieces
that translate a block number into a path into a file descriptor, then have
to verify that they ended up with the same physical block that was requested.

So another wishlist item would be an ioctl for "give me a file descriptor
containing a ref to this blocknr without having to go through all this
root-inode-path resolution crap."  ;)

> >My target environment is a file server device where up to 75% of the
> >system RAM might be devoted to dedup.  It is not something that is
> >intended to run unnoticed on a smartphone.
> The original thought on persist hash map is for database, hard to imagine
> right?
> 
> From what I read from Oracle datasheet, they claim their database can take
> full use of ZFS deduplication, so I consider such persist hash map idea as a
> supplement to the uncertain LRU drop.
> But that's just a future idea yet...

I looked at a number of databases, hash table structures, bloom filters,
cuckoo tables, exotic modern C++ hashtable libraries from Google, and
ancient articles on hash tables by Knuth.  In the end I gave up and
rolled my own.

I have a lot of duplicate data--almost half--so I can't use algorithms
designed for a 10% hit rate (like bloom and cuckoo).  I don't want to
scan the entire filesystem from scratch every time, so Bloom was out
(no way to delete items, and the variations on Bloom that allow deletion
require almost as much space as the hash table itself).  Indexing overhead
for b-trees and most database engines would be larger than the hash table.

With a 50% duplicate hit rate, _any_ algorithm that spilled out to
disk would be hitting the disk with a random I/O for every 8K of data.
This meant that any data that wasn't going to fit into RAM might as well
be on the moon.  If I tried to keep it, I'd be swapping all the time.
I found a way to work in constant space, because RAM was all the storage
I'd ever be able to use.

Persistence is trivial:  just dump the RAM out to disk every few hours.

> BTW, is your userspace dedup daemon open-source?
> It would be quite nice if we can refer it to improve further.

It is open-source in as much as it is source at all, but it's not in
a usable state yet.  Right now it is a pile of command-line utilities,
several experimental test programs which perform different parts of the
task, some ad-hoc Perl scripts to glue them together, and test-driven
rework that I am slowly converting into a coherent working program
that other human beings can read and use.  The "persistent" on-disk
format changes every three weeks.  I'm on my fourth rewrite of the
extent-matching loop because the first three ran into logical landmines
in my test data.

It was originally supposed to be a 1000-line replacement for
faster-dupemerge to be finished in a week.  This is now month 14, it
has crossed the 2000 line mark, and I'm hoping this month will be
the first month that I rewrite _less_ than 30% of the code.  :-P

> Thanks,
> Qu
> >
> >>In fact, I can reduce the space usage by just(not so easy to call it just
> >>anyway) change dedup size to 512K. So just one btrfs_dedup_hash for 512K
> >>other than original 128 dedup_hash for 512K.
> >>That's also why Liu Bo uses tunable dedup_size.
> >
> >Anyone can increase the block size to make their dedup faster.  We lose
> >100% of all duplicate extents that are smaller than (or just not aligned
> >to) the block size.
> >
> >>Anyway, my dedup_hash struct is really huge...
> >>It's 120 bytes with all overhead for one btrfs_dedup_hash (and one page
> >>yet).
> >
> >I'm told ZFS is closer to 320.  ;)
> >
> >>>At about 157 bytes per 512K
> >>>(the amount of RAM I have to run dedup) the dedup still covers 92%
> >>>of duplicate extents with length 64K, with less coverage for smaller
> >>>extents and more coverage for larger extents.  The size distribution
> >>>of typical filesystems means I get 99% of all duplicate blocks.
> >>>
> >>Pretty impressive, but still a little worried about the code complexibility
> >>especially when it needs to go into kernel.
> >>
> >>At last, thanks for your impressive dedup ideas and test data.
> >>
> >>Thanks,
> >>Qu
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>the body of a message to majordomo@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Discuss on inband dedup implement (Original "strange data backref offset")
  2015-07-22  3:49               ` Zygo Blaxell
@ 2015-07-22  6:03                 ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2015-07-22  6:03 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Liu Bo, btrfs



Zygo Blaxell wrote on 2015/07/21 23:49 -0400:
> On Wed, Jul 22, 2015 at 09:49:52AM +0800, Qu Wenruo wrote:
>> Change subject to reflect the core of the conversation.
>>
>> Zygo Blaxell wrote on 2015/07/21 18:14 -0400:
>>> On Tue, Jul 21, 2015 at 02:52:38PM +0800, Qu Wenruo wrote:
>>>> Zygo Blaxell wrote on 2015/07/21 00:55 -0400:
>
>>> There's already a read-and-compare in the extent-same ioctl.  :-/
>> Yes, but the extent same ioctl needs 2 inode. (one already provided by ioctl
>> system call)
>> One for the source and one for the dest, which makes codes easy to implement
>> as we can just use VFS calls to read the needed pages.
>>
>> But the problem is, for dedup, there is only hash <-> extent mapping.
>> We only know the corresponding bytenr of a hash, not a inode.
>>
>> If even using the read-compare method, which means at write routine, we will
>> resuse read routine, causing more complicated read/write path combination to
>> test, and later read routine change will be quite dirty to avoid impact on
>> write routine.
>>
>> And what's worse is, an extent can be referred by several different inodes,
>> so even for the best case, we need to find an inode using backref and then
>> use address space calls to read the neede pages.
>>
>> That will leads to the hell of delayed ref.
>> As at write time, backref is delayed, it needs not only search the extent
>> tree for backref item but also go through delayed refs. (Such things were
>> the reason causing crazy quota numbers before and I tried to fix in 4.2-rc1)
>>
>> And that's what I feared to face, complicated read/write coupling with
>> backref search nightmare.
>
> Yes, putting a small-memory dedup into the write path on Linux does
> cause some challenging (some might say insurmountable ;) ) problems.
>
> A natural result of having a userspace daemon is that it can keep up
> with writes but it is not on the critical path for them, i.e. if we have
> memory pressure we just throw the blocks at the disk to free up some RAM,
> but remember the block addresses so they can be rescanned later when we're
> more idle.  A daemon with a persistent store and a find-new style scanner
> can be paused and allowed to run overnight if it's too heavy to run live.
>
> I could see such a facility living inside the kernel and being started and
> stopped by an admin...but it might be a better idea to just adjust the
> kernel's interface to communicate better with a dedup daemon.
>
> e.g. I have a wishlist item to have something like fanotify which gives me
> a file descriptor when a file is modified--but that also includes dirty
> block numbers.  I can connect those modified blocks with their duplicate
> on-disk counterparts almost immediately, before delalloc even considers
> writing them.

If dedup daemon is only running at userspace, then I'm completely OK 
with a better ioctl in kernel just for the daemon.
(Not sure about other guys though).

But I don't have some good idea about the new ioctl interface though.
I used to think a ioctl to info btrfs to read out the given range and 
pin the hashes into memory is good enough, but that's just another ZFS 
dedup supplement, will never be on par with your method anyway.

>
>>> That's precisely the tradeoff I made:  up to 0.1% more I/O for unique
>>> blocks, and duplicate writes would become reads instead (assuming I moved
>>> my dedup daemon into the kernel).  There is some extra seeking on writes
>>> to confirm duplicate pages--but when they are confirmed, the writes are
>>> no longer necessary and their I/O can be dropped.
>>>
>>> The payoff is that it requires 100 times less RAM, so it can coexist
>>> with other things on a medium-sized machine.
>>
>> Quite convincing trade off now.
>>
>> But the above extent reading problem is still here.
>> Although I may also be wrong as I'm not quite similar with VFS and filemap.
>>
>> If there is any good kernel API to read pages directly by bytenr without
>> using filemap/address space infrastructure, I'll be quite happy to try it.
>
> I had assumed (but not checked) that the kernel can do it more easily than
> userspace.  The ugliest pieces of code for userspace dedup are the pieces
> that translate a block number into a path into a file descriptor, then have
> to verify that they ended up with the same physical block that was requested.
>
> So another wishlist item would be an ioctl for "give me a file descriptor
> containing a ref to this blocknr without having to go through all this
> root-inode-path resolution crap."  ;)
>
>>> My target environment is a file server device where up to 75% of the
>>> system RAM might be devoted to dedup.  It is not something that is
>>> intended to run unnoticed on a smartphone.
>> The original thought on persist hash map is for database, hard to imagine
>> right?
>>
>>  From what I read from Oracle datasheet, they claim their database can take
>> full use of ZFS deduplication, so I consider such persist hash map idea as a
>> supplement to the uncertain LRU drop.
>> But that's just a future idea yet...
>
> I looked at a number of databases, hash table structures, bloom filters,
> cuckoo tables, exotic modern C++ hashtable libraries from Google, and
> ancient articles on hash tables by Knuth.  In the end I gave up and
> rolled my own.
>
> I have a lot of duplicate data--almost half--so I can't use algorithms
> designed for a 10% hit rate (like bloom and cuckoo).  I don't want to
> scan the entire filesystem from scratch every time, so Bloom was out
> (no way to delete items, and the variations on Bloom that allow deletion
> require almost as much space as the hash table itself).  Indexing overhead
> for b-trees and most database engines would be larger than the hash table.
>
> With a 50% duplicate hit rate, _any_ algorithm that spilled out to
> disk would be hitting the disk with a random I/O for every 8K of data.
> This meant that any data that wasn't going to fit into RAM might as well
> be on the moon.  If I tried to keep it, I'd be swapping all the time.
> I found a way to work in constant space, because RAM was all the storage
> I'd ever be able to use.
>
> Persistence is trivial:  just dump the RAM out to disk every few hours.
>
>> BTW, is your userspace dedup daemon open-source?
>> It would be quite nice if we can refer it to improve further.
>
> It is open-source in as much as it is source at all, but it's not in
> a usable state yet.  Right now it is a pile of command-line utilities,
> several experimental test programs which perform different parts of the
> task, some ad-hoc Perl scripts to glue them together, and test-driven
> rework that I am slowly converting into a coherent working program
> that other human beings can read and use.  The "persistent" on-disk
> format changes every three weeks.  I'm on my fourth rewrite of the
> extent-matching loop because the first three ran into logical landmines
> in my test data.
>
> It was originally supposed to be a 1000-line replacement for
> faster-dupemerge to be finished in a week.  This is now month 14, it
> has crossed the 2000 line mark, and I'm hoping this month will be
> the first month that I rewrite _less_ than 30% of the code.  :-P
>

I can't wait to see it release the first version.
 From what I read so far it would be quite impressive, especially the 
daemon idea.

Another impressive thing is the number of lines, which is much lower 
than my expectation.
I suppose it would be over 3000 lines to do the adjustment check logic.

Thanks,
Qu


>> Thanks,
>> Qu
>>>
>>>> In fact, I can reduce the space usage by just(not so easy to call it just
>>>> anyway) change dedup size to 512K. So just one btrfs_dedup_hash for 512K
>>>> other than original 128 dedup_hash for 512K.
>>>> That's also why Liu Bo uses tunable dedup_size.
>>>
>>> Anyone can increase the block size to make their dedup faster.  We lose
>>> 100% of all duplicate extents that are smaller than (or just not aligned
>>> to) the block size.
>>>
>>>> Anyway, my dedup_hash struct is really huge...
>>>> It's 120 bytes with all overhead for one btrfs_dedup_hash (and one page
>>>> yet).
>>>
>>> I'm told ZFS is closer to 320.  ;)
>>>
>>>>> At about 157 bytes per 512K
>>>>> (the amount of RAM I have to run dedup) the dedup still covers 92%
>>>>> of duplicate extents with length 64K, with less coverage for smaller
>>>>> extents and more coverage for larger extents.  The size distribution
>>>>> of typical filesystems means I get 99% of all duplicate blocks.
>>>>>
>>>> Pretty impressive, but still a little worried about the code complexibility
>>>> especially when it needs to go into kernel.
>>>>
>>>> At last, thanks for your impressive dedup ideas and test data.
>>>>
>>>> Thanks,
>>>> Qu
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange data backref offset?
  2015-07-17  2:38 Strange data backref offset? Qu Wenruo
  2015-07-18 11:35 ` Liu Bo
@ 2015-07-29 16:24 ` Filipe David Manana
  1 sibling, 0 replies; 11+ messages in thread
From: Filipe David Manana @ 2015-07-29 16:24 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: btrfs

On Fri, Jul 17, 2015 at 3:38 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> Hi all,
>
> While I'm developing a new btrfs inband dedup mechanism, I found btrfsck and
> kernel doing strange behavior for clone.
>
> [Reproducer]
> # mount /dev/sdc -t btrfs /mnt/test
> # dd if=/dev/zero of=/mnt/test/file1 bs=4K count=4
> # sync
> # ~/xfstests/src/cloner -s 4096 -l 4096 /mnt/test/file1 /mnt/test/file2
> # sync
>
> Then btrfs-debug-tree gives quite strange result on the data backref:
> ------
> <EXTENT TREE>
>         item 4 key (12845056 EXTENT_ITEM 16384) itemoff 16047 itemsize 111
>                 extent refs 3 gen 6 flags DATA
>                 extent data backref root 5 objectid 257 offset 0 count 1
>                 extent data backref root 5 objectid 258 offset
> 18446744073709547520 count 1
>
> <FS TREE>
>         item 8 key (257 EXTENT_DATA 0) itemoff 15743 itemsize 53
>                 extent data disk byte 12845056 nr 16384
>                 extent data offset 0 nr 16384 ram 16384
>                 extent compression 0
>         item 9 key (257 EXTENT_DATA 16384) itemoff 15690 itemsize 53
>                 extent data disk byte 12845056 nr 16384
>                 extent data offset 4096 nr 4096 ram 16384
>                 extent compression 0
> ------
>
> The offset is file extent's key.offset - file exntent's offset,
> Which is 0 - 4096, causing the overflow result.
>
> Kernel and fsck all uses that behavior, so fsck can pass the strange thing.
>
> But shouldn't the offset in data backref matches with the key.offset of the
> file extent?
>
> And I'm quite sure the change of behavior can hugely break the fsck and
> kernel, but I'm wondering is this a known BUG or feature, and will it be
> handled?

Obviously a bug.

I was recently investigating incremental send failures after
cloning/deduping extents and that lead me to this as well.
It's a bug but it's not too bad as it effects only backref walking,
which can have a simple workaround (I just sent a patch for it). For
the purposes of incrementing/decrementing the data backref's count we
do the same calculation everywhere, always leading to the same large
and unexpected value, so we don't get bogus backrefs added/left
around.

>
> Thanks,
> Qu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-07-29 16:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-17  2:38 Strange data backref offset? Qu Wenruo
2015-07-18 11:35 ` Liu Bo
2015-07-19  7:23   ` Zygo Blaxell
2015-07-20  2:24     ` Qu Wenruo
2015-07-21  4:55       ` Zygo Blaxell
2015-07-21  6:52         ` Qu Wenruo
2015-07-21 22:14           ` Zygo Blaxell
2015-07-22  1:49             ` Discuss on inband dedup implement (Original "strange data backref offset") Qu Wenruo
2015-07-22  3:49               ` Zygo Blaxell
2015-07-22  6:03                 ` Qu Wenruo
2015-07-29 16:24 ` Strange data backref offset? Filipe David Manana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).