From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from james.kirk.hungrycats.org ([174.142.39.145]:36214 "EHLO james.kirk.hungrycats.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751026AbbGSHXq (ORCPT ); Sun, 19 Jul 2015 03:23:46 -0400 Date: Sun, 19 Jul 2015 03:23:45 -0400 From: Zygo Blaxell To: Liu Bo Cc: Qu Wenruo , btrfs Subject: Re: Strange data backref offset? Message-ID: <20150719072344.GA22025@hungrycats.org> References: <55A86AA8.6010404@cn.fujitsu.com> <20150718113412.GA11450@localhost.localdomain> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="vkogqOf2sHV7VnPd" In-Reply-To: <20150718113412.GA11450@localhost.localdomain> Sender: linux-btrfs-owner@vger.kernel.org List-ID: --vkogqOf2sHV7VnPd Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Jul 18, 2015 at 07:35:31PM +0800, Liu Bo wrote: > On Fri, Jul 17, 2015 at 10:38:32AM +0800, Qu Wenruo wrote: > > Hi all, > >=20 > > While I'm developing a new btrfs inband dedup mechanism, I found btrfsc= k and > > kernel doing strange behavior for clone. > >=20 > > [Reproducer] > > # mount /dev/sdc -t btrfs /mnt/test > > # dd if=3D/dev/zero of=3D/mnt/test/file1 bs=3D4K count=3D4 > > # sync > > # ~/xfstests/src/cloner -s 4096 -l 4096 /mnt/test/file1 /mnt/test/file2 > > # sync > >=20 > > Then btrfs-debug-tree gives quite strange result on the data backref: > > ------ > > > > item 4 key (12845056 EXTENT_ITEM 16384) itemoff 16047 itemsize = 111 > > extent refs 3 gen 6 flags DATA > > extent data backref root 5 objectid 257 offset 0 count 1 > > extent data backref root 5 objectid 258 offset > > 18446744073709547520 count 1 > >=20 > > > > item 8 key (257 EXTENT_DATA 0) itemoff 15743 itemsize 53 > > extent data disk byte 12845056 nr 16384 > > extent data offset 0 nr 16384 ram 16384 > > extent compression 0 > > item 9 key (257 EXTENT_DATA 16384) itemoff 15690 itemsize 53 > > extent data disk byte 12845056 nr 16384 > > extent data offset 4096 nr 4096 ram 16384 > > extent compression 0 > > ------ > >=20 > > The offset is file extent's key.offset - file exntent's offset, > > Which is 0 - 4096, causing the overflow result. > >=20 > > Kernel and fsck all uses that behavior, so fsck can pass the strange th= ing. > >=20 > > But shouldn't the offset in data backref matches with the key.offset of= the > > file extent? > >=20 > > And I'm quite sure the change of behavior can hugely break the fsck and > > kernel, but I'm wondering is this a known BUG or feature, and will it be > > handled? >=20 > Also found this before, one of the benefits is to save metadata in extent > tree, that is, if we overwrite inside an extent, the extent ref count > increases to 2 since both use (key.offset - extent_offset). >=20 > 0k 12k > |-------------------| > |--------| > 4k 8k >=20 > one EXTENT_DATA item is 0k and the other one is 8k, the corresponding > refs will be (0k - 0) and (8k - 8k). >=20 > It's a format change so it won't be easy to make a change. I'd prefer a > workaround on clone side. The workaround for dedup currently involves copying data. It could be done by manipulating the metadata directly, but btrfs doesn't seem to provide this. Suppose our dedup engine has identified duplicate blocks in distinct extents, and we want to replace them by cloning: 0k 12k |AAAAABBBBBBBBCCCCCC| <- file #1 extent #1 |BBBBBBBB| <- file #2 extent #2 4k 8k If we clone the "B" part of extent #1 over extent #2, we save one block by eliminating the last reference to extent #2; however, if file #1 is later deleted, we waste 8K. File #2 will refer to only 4K out of extent #1's 12K. Any further clones of file #2 will add references to this 12K extent. The 8K of wasted data cannot be accessed from userspace, so it's unavailable for allocation and can't be reused by future clones. Assuming dedup finds many further copies of the "B" data, and they're all replaced by clones referring to the "B" blocks of extent #1, those 8K are effectively wasted forever. If extent #1 is larger (up to 1GB), more space is wasted. The only way to get the space back after this failure is to locate every reference to extent #1 in the entire filesystem (i.e. every instance of the "B" data) and delete it. 0k 12k |AAAAABBBBBBBBCCCCCC| <- file #1 extent #1, "B" shared with file #2 aaaa|BBBBBBBB|ccccc <- file #2 "a" and "c" blocks referenced 4k 8k but not used (i.e. wasted) in file #2 If we clone extent #2 over extent #1, we avoid the above problem, but we save nothing. The refs to extent #1 from file #1 will keep the "B" blocks from extent #1 on disk even though they are not accessible: 0k /bbbbbbbb\ 12k <- file #1 unused blocks from extent #1 |AAAA|^wasted^|CCCCC| <- file #1 blocks "A" and "C" from extent #1 \BBBBBBBB/ <- file #1 blocks "B" from extent #2 |BBBBBBBB| <- file #2 extent #2 shared with file #1 4k 8k If extent #2 is larger than 4K, but only 4K of data is duplicate, then we get the same result as when we replaced extent #2 with a clone of part of extent #1: wasted space. A few popular block values on a filesystem can end up being duplicated thousands of times, so we can't afford to just ignore them. Such blocks are usually part of much larger extents where the extents as a whole are unique, so we can't get rid of the duplicate blocks without modifying the extent boundaries. Dedup has to replace extent #1 (and in some cases also extent #2) with up to three new extents so that all duplicate extents have matching boundaries before cloning: 0k 12k <- file #1 extent #1 deleted |AAAA|BBBBBBBB|CCCCC| <- file #1 extent #3, #4, #5 |BBBBBBBB| <- file #2 extent #2 4k 8k <- extent boundaries around "B" aligned Then we can clone extent #2 from file #2 over extent #4 in file #1: 0k 12k <- file #1 is now extents #3, #2, #5 |AAAA|BBBBBBBB|CCCCC| <- replace extent #4 with clone of #2 |BBBBBBBB| <- file #2 extent #2 shared with file #1 4k 8k <- no wasted space in either file Since we are just deleting extent #4 in this case, we didn't have to bother creating it. An optimized implementation could skip that step. A differently optimized implementation keeps #4 but deletes #2 because then file #1 can be read with fewer seeks: 0k 12k <- file #1 data physically contiguous |AAAA|BBBBBBBB|CCCCC| <- file #1 is now extents #3, #4, #5 |BBBBBBBB| <- file #2 extent #4 shared with file #1 4k 8k <- file #2 extent #2 replaced by extent #4 Both extents might have other shared references. When any extent is split up, all references to the extent have to be split the same way. How do we split extents into smaller pieces? Copy the contents to multiple new extents, then use those new extents to replace the original extent along the same boundaries as the duplicate data block ranges. Currently a user-space dedup agent has to rely on ioctls like FILE_EXTENT_SAME and DEFRAG_RANGE to provide the required atomic copy-and-replace semantics, and FIEMAP to discover where existing extent boundaries are. FIEMAP doesn't provide a complete picture of the underlying extent structure--in particular, FIEMAP does not tell us when a logical extent boundary does not match a btrfs physical extent boundary and a split is required. DEFRAG_RANGE refuses to split some contiguous extents into smaller, non-contiguous extents, and even when it does, there are other heuristics in btrfs that reassemble the carefully sized smaller extents back into useless larger ones. FILE_EXTENT_SAME doesn't split physical extents when it needs to--it just creates broken shared extent references with all the problems described above. Copying the data to multiple temporary files in user-space (so they can't be coalesced into larger extents) seems to create the split extents required, and FILE_EXTENT_SAME can be used to atomically splice the new extents back into the original files, replacing the original extents. Userspace is left to find all the refs itself--there are several ways to do that with various time/space/complexity trade-offs. It's _possible_ to implement a working dedup in userspace now, but it's not easy to get every corner case right. It's a little easier in the kernel. FILE_EXTENT_SAME could split extents as required, and use a non-broken internal implementation of DEFRAG_RANGE that doesn't refuse to copy data it was asked to. A flag can be added somewhere to disable the heuristics that coalesce consecutive extents together when those extents are intentionally split, but preserve the existing behavior of coalescing consecutive duplicate extents when they are the maximum length accepted by FILE_EXTENT_SAME. The kernel can chase all the extent backrefs more easily and completely than userspace can. An in-kernel dedup might even be able to make wasted blocks accessible again if their contents should reappear in new files. On a heavily defragmented filesystem the data copy required could be a little under 2GB for a single pair of cloned ranges (in the worst case of 4K cloned in two maximum-sized extents). Currently FILE_EXTENT_SAME refuses to do work involving more than 32MB of reads for a single extent pair. There's a pretty big gap between 32MB of reads and 4GB of reads _and_ writes. FILE_EXTENT_SAME also takes a vector of extents, which multiply the work. It would be better if the kernel could manipulate the metadata directly to split extents. Dedup does work now with copies, but quite inefficiently. > Thanks, >=20 > -liubo > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html --vkogqOf2sHV7VnPd Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iEYEARECAAYFAlWrUIAACgkQgfmLGlazG5yVJACeLws+js7x1SciEwoSGO2i34Za KEwAn14nnvqEuX45Yd4dsrFc1CrN+T2M =G9hp -----END PGP SIGNATURE----- --vkogqOf2sHV7VnPd--