* [PATCH] btrfs: defrag: don't try to merge regular extents with preallocated extents @ 2022-01-23 4:52 Qu Wenruo 2022-01-24 12:19 ` Filipe Manana 0 siblings, 1 reply; 3+ messages in thread From: Qu Wenruo @ 2022-01-23 4:52 UTC (permalink / raw) To: linux-btrfs [BUG] With older kernels (before v5.16), btrfs will defrag preallocated extents. While with newer kernels (v5.16 and newer) btrfs will not defrag preallocated extents, but it will defrag the extent just before the preallocated extent, even it's just a single sector. This can be exposed by the following small script: mkfs.btrfs -f $dev > /dev/null mount $dev $mnt xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file xfs_io -c "fiemap -v" $mnt/file btrfs fi defrag $mnt/file sync xfs_io -c "fiemap -v" $mnt/file The output looks like this on older kernels: /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 26624..26631 8 0x0 1: [8..39]: 26632..26663 32 0x801 /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..39]: 26664..26703 40 0x1 Which defrags the single sector along with the preallocated extent, and replace them with an regular extent into a new location (caused by data COW). This wastes most of the data IO just for the preallocated range. On the other hand, v5.16 is slightly better: /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 26624..26631 8 0x0 1: [8..39]: 26632..26663 32 0x801 /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 26664..26671 8 0x0 1: [8..39]: 26632..26663 32 0x801 The preallocated range is not defragged, but the sector before it still gets defragged, which has no need for it. [CAUSE] One of the function reused by the old and new behavior is defrag_check_next_extent(), it will determine if we should defrag current extent by checking the next one. It only checks if the next extent is a hole or inlined, but it doesn't check if it's preallocated. On the other hand, out of the function, both old and new kernel will reject preallocated extents. Such inconsistent behavior causes above behavior. [FIX] - Also check if next extent is preallocated If so, don't defrag current extent - Add comments on each case we don't defrag This will reduce the IO caused by defrag ioctl and autodefrag. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/ioctl.c | 29 +++++++++++++++++++++++------ 1 file changed, 23 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 91ba2efe9792..dfa81b377e89 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1049,23 +1049,40 @@ static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start, return em; } +/* + * Return if current extent @em is a good candidate for defrag. + * + * This is done by checking against the next extent after @em. + */ static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em, bool locked) { struct extent_map *next; - bool ret = true; + bool ret = false; /* this is the last extent */ if (em->start + em->len >= i_size_read(inode)) - return false; + return ret; next = defrag_lookup_extent(inode, em->start + em->len, locked); + /* No next extent or a hole, no way to merge */ if (!next || next->block_start >= EXTENT_MAP_LAST_BYTE) - ret = false; - else if ((em->block_start + em->block_len == next->block_start) && - (em->block_len > SZ_128K && next->block_len > SZ_128K)) - ret = false; + goto out; + /* Next extent is preallocated, no sense to defrag current extent */ + if (test_bit(EXTENT_FLAG_PREALLOC, &next->flags)) + goto out; + + /* + * Next extent are not only mergable but also adjacent in their + * logical address, normally an excellent candicate, but if they + * are already large enough, then no need to defrag current extent. + */ + if ((em->block_start + em->block_len == next->block_start) && + (em->block_len > SZ_128K && next->block_len > SZ_128K)) + goto out; + ret = true; +out: free_extent_map(next); return ret; } -- 2.34.1 ^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] btrfs: defrag: don't try to merge regular extents with preallocated extents 2022-01-23 4:52 [PATCH] btrfs: defrag: don't try to merge regular extents with preallocated extents Qu Wenruo @ 2022-01-24 12:19 ` Filipe Manana 2022-01-24 12:36 ` Qu Wenruo 0 siblings, 1 reply; 3+ messages in thread From: Filipe Manana @ 2022-01-24 12:19 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Sun, Jan 23, 2022 at 12:52:42PM +0800, Qu Wenruo wrote: > [BUG] > With older kernels (before v5.16), btrfs will defrag preallocated extents. > While with newer kernels (v5.16 and newer) btrfs will not defrag > preallocated extents, but it will defrag the extent just before the > preallocated extent, even it's just a single sector. > > This can be exposed by the following small script: > > mkfs.btrfs -f $dev > /dev/null > > mount $dev $mnt > xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file > xfs_io -c "fiemap -v" $mnt/file > btrfs fi defrag $mnt/file > sync > xfs_io -c "fiemap -v" $mnt/file > > The output looks like this on older kernels: > > /mnt/btrfs/file: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS > 0: [0..7]: 26624..26631 8 0x0 > 1: [8..39]: 26632..26663 32 0x801 > /mnt/btrfs/file: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS > 0: [0..39]: 26664..26703 40 0x1 > > Which defrags the single sector along with the preallocated extent, and > replace them with an regular extent into a new location (caused by data > COW). > This wastes most of the data IO just for the preallocated range. > > On the other hand, v5.16 is slightly better: > > /mnt/btrfs/file: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS > 0: [0..7]: 26624..26631 8 0x0 > 1: [8..39]: 26632..26663 32 0x801 > /mnt/btrfs/file: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS > 0: [0..7]: 26664..26671 8 0x0 > 1: [8..39]: 26632..26663 32 0x801 > > The preallocated range is not defragged, but the sector before it still > gets defragged, which has no need for it. > > [CAUSE] > One of the function reused by the old and new behavior is > defrag_check_next_extent(), it will determine if we should defrag > current extent by checking the next one. > > It only checks if the next extent is a hole or inlined, but it doesn't > check if it's preallocated. > > On the other hand, out of the function, both old and new kernel will > reject preallocated extents. > > Such inconsistent behavior causes above behavior. > > [FIX] > - Also check if next extent is preallocated > If so, don't defrag current extent > > - Add comments on each case we don't defrag > > This will reduce the IO caused by defrag ioctl and autodefrag. > > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > fs/btrfs/ioctl.c | 29 +++++++++++++++++++++++------ > 1 file changed, 23 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index 91ba2efe9792..dfa81b377e89 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -1049,23 +1049,40 @@ static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start, > return em; > } > > +/* > + * Return if current extent @em is a good candidate for defrag. > + * > + * This is done by checking against the next extent after @em. > + */ > static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em, > bool locked) > { > struct extent_map *next; > - bool ret = true; > + bool ret = false; > > /* this is the last extent */ > if (em->start + em->len >= i_size_read(inode)) > - return false; > + return ret; > > next = defrag_lookup_extent(inode, em->start + em->len, locked); > + /* No next extent or a hole, no way to merge */ > if (!next || next->block_start >= EXTENT_MAP_LAST_BYTE) > - ret = false; > - else if ((em->block_start + em->block_len == next->block_start) && > - (em->block_len > SZ_128K && next->block_len > SZ_128K)) > - ret = false; > + goto out; > > + /* Next extent is preallocated, no sense to defrag current extent */ > + if (test_bit(EXTENT_FLAG_PREALLOC, &next->flags)) > + goto out; > + > + /* > + * Next extent are not only mergable but also adjacent in their are not -> is not mergable -> mergeable their -> its > + * logical address, normally an excellent candicate, but if they candicate -> candidate > + * are already large enough, then no need to defrag current extent. > + */ It still sounds a bit odd to me, maybe: Next extent is mergeable and its logical address is contiguous with this extent, so normally an excellent candidate, but if this extent or the next one is already large enough, then we don't need to defrag. We use SZ_128K because in case of enabled compression, extents can never be larger than that. Adding this comment is unrelated to this fix about prealloc extents, but I'm fine with it. Other than that it looks fine. Reviewed-by: Filipe Manana <fdmanana@suse.com> Thanks. > + if ((em->block_start + em->block_len == next->block_start) && > + (em->block_len > SZ_128K && next->block_len > SZ_128K)) > + goto out; > + ret = true; > +out: > free_extent_map(next); > return ret; > } > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] btrfs: defrag: don't try to merge regular extents with preallocated extents 2022-01-24 12:19 ` Filipe Manana @ 2022-01-24 12:36 ` Qu Wenruo 0 siblings, 0 replies; 3+ messages in thread From: Qu Wenruo @ 2022-01-24 12:36 UTC (permalink / raw) To: Filipe Manana; +Cc: linux-btrfs On 2022/1/24 20:19, Filipe Manana wrote: > On Sun, Jan 23, 2022 at 12:52:42PM +0800, Qu Wenruo wrote: >> [BUG] >> With older kernels (before v5.16), btrfs will defrag preallocated extents. >> While with newer kernels (v5.16 and newer) btrfs will not defrag >> preallocated extents, but it will defrag the extent just before the >> preallocated extent, even it's just a single sector. >> >> This can be exposed by the following small script: >> >> mkfs.btrfs -f $dev > /dev/null >> >> mount $dev $mnt >> xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file >> xfs_io -c "fiemap -v" $mnt/file >> btrfs fi defrag $mnt/file >> sync >> xfs_io -c "fiemap -v" $mnt/file >> >> The output looks like this on older kernels: >> >> /mnt/btrfs/file: >> EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS >> 0: [0..7]: 26624..26631 8 0x0 >> 1: [8..39]: 26632..26663 32 0x801 >> /mnt/btrfs/file: >> EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS >> 0: [0..39]: 26664..26703 40 0x1 >> >> Which defrags the single sector along with the preallocated extent, and >> replace them with an regular extent into a new location (caused by data >> COW). >> This wastes most of the data IO just for the preallocated range. >> >> On the other hand, v5.16 is slightly better: >> >> /mnt/btrfs/file: >> EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS >> 0: [0..7]: 26624..26631 8 0x0 >> 1: [8..39]: 26632..26663 32 0x801 >> /mnt/btrfs/file: >> EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS >> 0: [0..7]: 26664..26671 8 0x0 >> 1: [8..39]: 26632..26663 32 0x801 >> >> The preallocated range is not defragged, but the sector before it still >> gets defragged, which has no need for it. >> >> [CAUSE] >> One of the function reused by the old and new behavior is >> defrag_check_next_extent(), it will determine if we should defrag >> current extent by checking the next one. >> >> It only checks if the next extent is a hole or inlined, but it doesn't >> check if it's preallocated. >> >> On the other hand, out of the function, both old and new kernel will >> reject preallocated extents. >> >> Such inconsistent behavior causes above behavior. >> >> [FIX] >> - Also check if next extent is preallocated >> If so, don't defrag current extent >> >> - Add comments on each case we don't defrag >> >> This will reduce the IO caused by defrag ioctl and autodefrag. >> >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> --- >> fs/btrfs/ioctl.c | 29 +++++++++++++++++++++++------ >> 1 file changed, 23 insertions(+), 6 deletions(-) >> >> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c >> index 91ba2efe9792..dfa81b377e89 100644 >> --- a/fs/btrfs/ioctl.c >> +++ b/fs/btrfs/ioctl.c >> @@ -1049,23 +1049,40 @@ static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start, >> return em; >> } >> >> +/* >> + * Return if current extent @em is a good candidate for defrag. >> + * >> + * This is done by checking against the next extent after @em. >> + */ >> static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em, >> bool locked) >> { >> struct extent_map *next; >> - bool ret = true; >> + bool ret = false; >> >> /* this is the last extent */ >> if (em->start + em->len >= i_size_read(inode)) >> - return false; >> + return ret; >> >> next = defrag_lookup_extent(inode, em->start + em->len, locked); >> + /* No next extent or a hole, no way to merge */ >> if (!next || next->block_start >= EXTENT_MAP_LAST_BYTE) >> - ret = false; >> - else if ((em->block_start + em->block_len == next->block_start) && >> - (em->block_len > SZ_128K && next->block_len > SZ_128K)) >> - ret = false; >> + goto out; >> >> + /* Next extent is preallocated, no sense to defrag current extent */ >> + if (test_bit(EXTENT_FLAG_PREALLOC, &next->flags)) >> + goto out; >> + >> + /* >> + * Next extent are not only mergable but also adjacent in their > > are not -> is not > mergable -> mergeable > their -> its > >> + * logical address, normally an excellent candicate, but if they > > candicate -> candidate > >> + * are already large enough, then no need to defrag current extent. >> + */ > > It still sounds a bit odd to me, maybe: > > Next extent is mergeable and its logical address is contiguous with this > extent, so normally an excellent candidate, but if this extent or the next > one is already large enough, then we don't need to defrag. We use SZ_128K > because in case of enabled compression, extents can never be larger than > that. In fact, I'm a little more concerned of the original condition now. One thing is, the threshold here, it's hard coded 128K, while the extent_threshold for defrag can be specified by the ioctl caller. Another thing is, the original condition is using block_start, I'm not sure if we really need to check that. As long as the next extent is not a hole/preallocated one, we're completely happy to defrag. In fact, if the disk bytenr/num_bytes of @em is not adjacent to the next extent, it's even better, we can merge it into one extent without an extra seek. So I tend to change the condition more, like this: - Skip holes/preallocated That's already here in the patch - Skip large extents, using @threshold passed into this function No longer hard coded values, let the defrag caller to have more control, and have more consistent behavior. - No more check on em::block_start There are some pros and cons of defragging already physically adjacent file extents: Pros: * Reduces the number of extents Which may be what the defrag users want, to defrag extents caused by small but sequential direct IO. With reduced number of extents, there is a slight chance to reduce mount time by a little. Cons: * Extra IO and no saving in seeking time So why the existing code checks on the em::block_start is already questionable, as it's not a clear win. I know this sounds weird especially after I have broken so many defrag code, but I still want to remove the checks, replacing with more reasonable checks, even it means it will change the behavior again. Thanks, Qu > > Adding this comment is unrelated to this fix about prealloc extents, but I'm > fine with it. > > Other than that it looks fine. > > Reviewed-by: Filipe Manana <fdmanana@suse.com> > > Thanks. > >> + if ((em->block_start + em->block_len == next->block_start) && >> + (em->block_len > SZ_128K && next->block_len > SZ_128K)) >> + goto out; >> + ret = true; >> +out: >> free_extent_map(next); >> return ret; >> } >> -- >> 2.34.1 >> > ^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2022-01-24 12:36 UTC | newest] Thread overview: 3+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-01-23 4:52 [PATCH] btrfs: defrag: don't try to merge regular extents with preallocated extents Qu Wenruo 2022-01-24 12:19 ` Filipe Manana 2022-01-24 12:36 ` Qu Wenruo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).