* [PATCH 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path
@ 2025-11-20 0:04 Qu Wenruo
2025-11-20 0:04 ` [PATCH 1/4] btrfs: integrate the error handling of submit_one_sector() Qu Wenruo
` (3 more replies)
0 siblings, 4 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-11-20 0:04 UTC (permalink / raw)
To: linux-btrfs
Although btrfs has bs < ps support for a long time, and the larger data
folios support is also going to be graduate from experimental features
soon, the write path is still iterating each fs block and call
btrfs_get_extent() on each fs block.
What makes the situation worse is that, for the write path we do not
have any cached extent map, meaning even with large folios and we got a
continuous range that can be submitted in one go, we still call
btrfs_get_extent() many times and get the same range extent map again
and again.
This series will reduce the duplicated btrfs_get_extent() calls by only
call it once for each range, other than for each fs block.
The first 3 patches are just minor cleanups/refactors, the last patch
is the real optimization.
I don't expect there will be much difference in the real world though.
Qu Wenruo (4):
btrfs: integrate the error handling of submit_one_sector()
btrfs: use bitmap_set() to replace set_bit() in a loop
btrfs: extract the io finishing code into a helper
btrfs: reduce extent map lookup during writes
fs/btrfs/extent_io.c | 243 ++++++++++++++++++++++++-------------------
1 file changed, 134 insertions(+), 109 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 10+ messages in thread* [PATCH 1/4] btrfs: integrate the error handling of submit_one_sector() 2025-11-20 0:04 [PATCH 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo @ 2025-11-20 0:04 ` Qu Wenruo 2025-11-20 0:04 ` [PATCH 2/4] btrfs: use bitmap_set() to replace set_bit() in a loop Qu Wenruo ` (2 subsequent siblings) 3 siblings, 0 replies; 10+ messages in thread From: Qu Wenruo @ 2025-11-20 0:04 UTC (permalink / raw) To: linux-btrfs Currently submit_one_sector() has only one failure pattern from btrfs_get_extent(). However the error handling is split into two part, one inside submit_one_sector(), which clear the dirty flag and finish the writeback for the fs block. The other part of the error handling is to submit any remaining bio inside bio_ctrl and mark the ordered extent for the fs block as finished. There is no special reason that we must split the error handling, let's just concentrace all the error handling inside submit_one_sector(). Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/extent_io.c | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 2d32dfc34ae3..a1dfc0902bfc 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1598,7 +1598,7 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode, /* * Return 0 if we have submitted or queued the sector for submission. - * Return <0 for critical errors, and the sector will have its dirty flag cleared. + * Return <0 for critical errors, and the involved sector will be cleaned up. * * Caller should make sure filepos < i_size and handle filepos >= i_size case. */ @@ -1622,6 +1622,13 @@ static int submit_one_sector(struct btrfs_inode *inode, em = btrfs_get_extent(inode, NULL, filepos, sectorsize); if (IS_ERR(em)) { + /* + * bio_ctrl may contain a bio crossing several folios. + * Submit it immediately so that the bio has a chance + * to finish normally, other than marked as error. + */ + submit_one_bio(bio_ctrl); + /* * When submission failed, we should still clear the folio dirty. * Or the folio will be written back again but without any @@ -1630,6 +1637,13 @@ static int submit_one_sector(struct btrfs_inode *inode, btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize); btrfs_folio_set_writeback(fs_info, folio, filepos, sectorsize); btrfs_folio_clear_writeback(fs_info, folio, filepos, sectorsize); + + /* + * Since there is no bio submitted to finish the ordered + * extent, we have to manually finish this sector. + */ + btrfs_mark_ordered_io_finished(inode, folio, filepos, + fs_info->sectorsize, false); return PTR_ERR(em); } @@ -1756,19 +1770,6 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, } ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size); if (unlikely(ret < 0)) { - /* - * bio_ctrl may contain a bio crossing several folios. - * Submit it immediately so that the bio has a chance - * to finish normally, other than marked as error. - */ - submit_one_bio(bio_ctrl); - /* - * Failed to grab the extent map which should be very rare. - * Since there is no bio submitted to finish the ordered - * extent, we have to manually finish this sector. - */ - btrfs_mark_ordered_io_finished(inode, folio, cur, - fs_info->sectorsize, false); if (!found_error) found_error = ret; continue; -- 2.52.0 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/4] btrfs: use bitmap_set() to replace set_bit() in a loop 2025-11-20 0:04 [PATCH 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo 2025-11-20 0:04 ` [PATCH 1/4] btrfs: integrate the error handling of submit_one_sector() Qu Wenruo @ 2025-11-20 0:04 ` Qu Wenruo 2025-11-20 0:04 ` [PATCH 3/4] btrfs: extract the io finishing code into a helper Qu Wenruo 2025-11-20 0:04 ` [PATCH 4/4] btrfs: reduce extent map lookup during writes Qu Wenruo 3 siblings, 0 replies; 10+ messages in thread From: Qu Wenruo @ 2025-11-20 0:04 UTC (permalink / raw) To: linux-btrfs Inside extent_writepage_io(), we use the specified range to calculate @range_bitmap, which is later to determine which blocks are going to be submitted. However we're calling set_bit() in a loop like the following: for (cur = start; cur < end; cur += fs_info->sectorsize) set_bit((cur - folio_start) >> fs_info->sectorsize_bits, &range_bitmap); Which is not optimal, as we know all bits for the range [start, start + len) can be set to 1 in one go. Replace the set_bit() in a loop with a proper bitmap_set() call. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/extent_io.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index a1dfc0902bfc..4fc3b3d776ee 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1728,8 +1728,8 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, return ret; } - for (cur = start; cur < end; cur += fs_info->sectorsize) - set_bit((cur - folio_start) >> fs_info->sectorsize_bits, &range_bitmap); + bitmap_set(&range_bitmap, (start - folio_pos(folio)) >> fs_info->sectorsize_bits, + len >> fs_info->sectorsize_bits); bitmap_and(&bio_ctrl->submit_bitmap, &bio_ctrl->submit_bitmap, &range_bitmap, blocks_per_folio); -- 2.52.0 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 3/4] btrfs: extract the io finishing code into a helper 2025-11-20 0:04 [PATCH 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo 2025-11-20 0:04 ` [PATCH 1/4] btrfs: integrate the error handling of submit_one_sector() Qu Wenruo 2025-11-20 0:04 ` [PATCH 2/4] btrfs: use bitmap_set() to replace set_bit() in a loop Qu Wenruo @ 2025-11-20 0:04 ` Qu Wenruo 2025-11-20 0:59 ` Boris Burkov 2025-11-20 0:04 ` [PATCH 4/4] btrfs: reduce extent map lookup during writes Qu Wenruo 3 siblings, 1 reply; 10+ messages in thread From: Qu Wenruo @ 2025-11-20 0:04 UTC (permalink / raw) To: linux-btrfs Currently we have a code block, which finishes IO for a range beyond i_size, deep inside the loop of extent_writepage_io(). Extract it into a helper, finish_io_beyond_eof(), to reduce the level of indents. Furthermore slightly change the parameter passed into the helper, currently we fully finish the IO for the range beyond EOF, but that range may be beyond the range [start, start + len), that means we may finish the IO for ranges which we should not touch. So call the finish_io_beyond_eof() only for the range we should touch. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/extent_io.c | 62 +++++++++++++++++++++++++------------------- 1 file changed, 35 insertions(+), 27 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 4fc3b3d776ee..cbee93a929f3 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1684,6 +1684,40 @@ static int submit_one_sector(struct btrfs_inode *inode, return 0; } +static void finish_io_beyond_eof(struct btrfs_inode *inode, struct folio *folio, + u64 start, u32 len, loff_t i_size) +{ + struct btrfs_fs_info *fs_info = inode->root->fs_info; + struct btrfs_ordered_extent *ordered; + + ASSERT(start >= i_size); + + ordered = btrfs_lookup_first_ordered_range(inode, start, len); + + /* + * We have just run delalloc before getting here, so + * there must be an ordered extent. + */ + ASSERT(ordered != NULL); + spin_lock(&inode->ordered_tree_lock); + set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags); + ordered->truncated_len = min(ordered->truncated_len, + start - ordered->file_offset); + spin_unlock(&inode->ordered_tree_lock); + btrfs_put_ordered_extent(ordered); + + btrfs_mark_ordered_io_finished(inode, folio, start, len, true); + /* + * This range is beyond i_size, thus we don't need to + * bother writing back. + * But we still need to clear the dirty subpage bit, or + * the next time the folio gets dirtied, we will try to + * writeback the sectors with subpage dirty bits, + * causing writeback without ordered extent. + */ + btrfs_folio_clear_dirty(fs_info, folio, start, len); +} + /* * Helper for extent_writepage(). This calls the writepage start hooks, * and does the loop to map the page into extents and bios. @@ -1739,33 +1773,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits); if (cur >= i_size) { - struct btrfs_ordered_extent *ordered; - - ordered = btrfs_lookup_first_ordered_range(inode, cur, - folio_end - cur); - /* - * We have just run delalloc before getting here, so - * there must be an ordered extent. - */ - ASSERT(ordered != NULL); - spin_lock(&inode->ordered_tree_lock); - set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags); - ordered->truncated_len = min(ordered->truncated_len, - cur - ordered->file_offset); - spin_unlock(&inode->ordered_tree_lock); - btrfs_put_ordered_extent(ordered); - - btrfs_mark_ordered_io_finished(inode, folio, cur, - end - cur, true); - /* - * This range is beyond i_size, thus we don't need to - * bother writing back. - * But we still need to clear the dirty subpage bit, or - * the next time the folio gets dirtied, we will try to - * writeback the sectors with subpage dirty bits, - * causing writeback without ordered extent. - */ - btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur); + finish_io_beyond_eof(inode, folio, cur, start + len - cur, i_size); break; } ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size); -- 2.52.0 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 3/4] btrfs: extract the io finishing code into a helper 2025-11-20 0:04 ` [PATCH 3/4] btrfs: extract the io finishing code into a helper Qu Wenruo @ 2025-11-20 0:59 ` Boris Burkov 2025-11-20 1:11 ` Qu Wenruo 0 siblings, 1 reply; 10+ messages in thread From: Boris Burkov @ 2025-11-20 0:59 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Thu, Nov 20, 2025 at 10:34:32AM +1030, Qu Wenruo wrote: > Currently we have a code block, which finishes IO for a range beyond > i_size, deep inside the loop of extent_writepage_io(). > > Extract it into a helper, finish_io_beyond_eof(), to reduce the level > of indents. > > Furthermore slightly change the parameter passed into the helper, > currently we fully finish the IO for the range beyond EOF, but that > range may be beyond the range [start, start + len), that means we may > finish the IO for ranges which we should not touch. I'm a little confused about this. I can't see any changes to the effective parameters to anything but btrfs_lookup_first_ordered_range() but that is getting the first ordered_extent So allowing more past [cur, end] via [cur, folio_end] shouldn't change what ordered_extent we get, as we are asserting that we get one with the narrower search, and we are getting the first one. the params to min(), btrfs_mark_ordered_io_finished(), and btrfs_clear_folio_dirty() seem unchanged. What am I missing? Thanks, Boris > > So call the finish_io_beyond_eof() only for the range we should touch. > > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > fs/btrfs/extent_io.c | 62 +++++++++++++++++++++++++------------------- > 1 file changed, 35 insertions(+), 27 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index 4fc3b3d776ee..cbee93a929f3 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -1684,6 +1684,40 @@ static int submit_one_sector(struct btrfs_inode *inode, > return 0; > } > > +static void finish_io_beyond_eof(struct btrfs_inode *inode, struct folio *folio, > + u64 start, u32 len, loff_t i_size) > +{ > + struct btrfs_fs_info *fs_info = inode->root->fs_info; > + struct btrfs_ordered_extent *ordered; > + > + ASSERT(start >= i_size); > + > + ordered = btrfs_lookup_first_ordered_range(inode, start, len); > + > + /* > + * We have just run delalloc before getting here, so > + * there must be an ordered extent. > + */ > + ASSERT(ordered != NULL); > + spin_lock(&inode->ordered_tree_lock); > + set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags); > + ordered->truncated_len = min(ordered->truncated_len, > + start - ordered->file_offset); > + spin_unlock(&inode->ordered_tree_lock); > + btrfs_put_ordered_extent(ordered); > + > + btrfs_mark_ordered_io_finished(inode, folio, start, len, true); > + /* > + * This range is beyond i_size, thus we don't need to > + * bother writing back. > + * But we still need to clear the dirty subpage bit, or > + * the next time the folio gets dirtied, we will try to > + * writeback the sectors with subpage dirty bits, > + * causing writeback without ordered extent. > + */ > + btrfs_folio_clear_dirty(fs_info, folio, start, len); > +} > + > /* > * Helper for extent_writepage(). This calls the writepage start hooks, > * and does the loop to map the page into extents and bios. > @@ -1739,33 +1773,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, > cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits); > > if (cur >= i_size) { > - struct btrfs_ordered_extent *ordered; > - > - ordered = btrfs_lookup_first_ordered_range(inode, cur, > - folio_end - cur); > - /* > - * We have just run delalloc before getting here, so > - * there must be an ordered extent. > - */ > - ASSERT(ordered != NULL); > - spin_lock(&inode->ordered_tree_lock); > - set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags); > - ordered->truncated_len = min(ordered->truncated_len, > - cur - ordered->file_offset); > - spin_unlock(&inode->ordered_tree_lock); > - btrfs_put_ordered_extent(ordered); > - > - btrfs_mark_ordered_io_finished(inode, folio, cur, > - end - cur, true); > - /* > - * This range is beyond i_size, thus we don't need to > - * bother writing back. > - * But we still need to clear the dirty subpage bit, or > - * the next time the folio gets dirtied, we will try to > - * writeback the sectors with subpage dirty bits, > - * causing writeback without ordered extent. > - */ > - btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur); > + finish_io_beyond_eof(inode, folio, cur, start + len - cur, i_size); AFAICT, start + len is still end here, why not do end - cur? > break; > } > ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size); > -- > 2.52.0 > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/4] btrfs: extract the io finishing code into a helper 2025-11-20 0:59 ` Boris Burkov @ 2025-11-20 1:11 ` Qu Wenruo 0 siblings, 0 replies; 10+ messages in thread From: Qu Wenruo @ 2025-11-20 1:11 UTC (permalink / raw) To: Boris Burkov, Qu Wenruo; +Cc: linux-btrfs 在 2025/11/20 11:29, Boris Burkov 写道: > On Thu, Nov 20, 2025 at 10:34:32AM +1030, Qu Wenruo wrote: >> Currently we have a code block, which finishes IO for a range beyond >> i_size, deep inside the loop of extent_writepage_io(). >> >> Extract it into a helper, finish_io_beyond_eof(), to reduce the level >> of indents. >> >> Furthermore slightly change the parameter passed into the helper, >> currently we fully finish the IO for the range beyond EOF, but that >> range may be beyond the range [start, start + len), that means we may >> finish the IO for ranges which we should not touch. > > I'm a little confused about this. I can't see any changes to the > effective parameters to anything but btrfs_lookup_first_ordered_range() > but that is getting the first ordered_extent > > So allowing more past [cur, end] via [cur, folio_end] shouldn't change > what ordered_extent we get, as we are asserting that we get one with > the narrower search, and we are getting the first one. > > the params to min(), btrfs_mark_ordered_io_finished(), and > btrfs_clear_folio_dirty() seem unchanged. My bad, I was reading the original code wrong, assuming it's clearing the IO to @folio_end not @end. So this part is not changed. On the other hand, we can not rely on btrfs_lookup_first_ordered_range() to catch all OEs for a range. That part only works if there is only one OE for the range. If there are multiple ones, the remaining OEs are not truncated. Will update the commit message and the handling of finish_io_beyond_eof() to ensure all involved OEs are properly truncated. (This also affects the old code though). Thanks, Qu > > What am I missing? > > Thanks, > Boris > >> >> So call the finish_io_beyond_eof() only for the range we should touch. >> >> Signed-off-by: Qu Wenruo <wqu@suse.com> >> --- >> fs/btrfs/extent_io.c | 62 +++++++++++++++++++++++++------------------- >> 1 file changed, 35 insertions(+), 27 deletions(-) >> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >> index 4fc3b3d776ee..cbee93a929f3 100644 >> --- a/fs/btrfs/extent_io.c >> +++ b/fs/btrfs/extent_io.c >> @@ -1684,6 +1684,40 @@ static int submit_one_sector(struct btrfs_inode *inode, >> return 0; >> } >> >> +static void finish_io_beyond_eof(struct btrfs_inode *inode, struct folio *folio, > + u64 start, u32 len, loff_t i_size) >> +{ >> + struct btrfs_fs_info *fs_info = inode->root->fs_info; >> + struct btrfs_ordered_extent *ordered; >> + >> + ASSERT(start >= i_size); >> + >> + ordered = btrfs_lookup_first_ordered_range(inode, start, len); >> + >> + /* >> + * We have just run delalloc before getting here, so >> + * there must be an ordered extent. >> + */ >> + ASSERT(ordered != NULL); >> + spin_lock(&inode->ordered_tree_lock); >> + set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags); >> + ordered->truncated_len = min(ordered->truncated_len, >> + start - ordered->file_offset); >> + spin_unlock(&inode->ordered_tree_lock); >> + btrfs_put_ordered_extent(ordered); >> + >> + btrfs_mark_ordered_io_finished(inode, folio, start, len, true); >> + /* >> + * This range is beyond i_size, thus we don't need to >> + * bother writing back. >> + * But we still need to clear the dirty subpage bit, or >> + * the next time the folio gets dirtied, we will try to >> + * writeback the sectors with subpage dirty bits, >> + * causing writeback without ordered extent. >> + */ >> + btrfs_folio_clear_dirty(fs_info, folio, start, len); >> +} >> + >> /* >> * Helper for extent_writepage(). This calls the writepage start hooks, >> * and does the loop to map the page into extents and bios. >> @@ -1739,33 +1773,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, >> cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits); >> >> if (cur >= i_size) { >> - struct btrfs_ordered_extent *ordered; >> - >> - ordered = btrfs_lookup_first_ordered_range(inode, cur, >> - folio_end - cur); >> - /* >> - * We have just run delalloc before getting here, so >> - * there must be an ordered extent. >> - */ >> - ASSERT(ordered != NULL); >> - spin_lock(&inode->ordered_tree_lock); >> - set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags); >> - ordered->truncated_len = min(ordered->truncated_len, >> - cur - ordered->file_offset); >> - spin_unlock(&inode->ordered_tree_lock); >> - btrfs_put_ordered_extent(ordered); >> - >> - btrfs_mark_ordered_io_finished(inode, folio, cur, >> - end - cur, true); >> - /* >> - * This range is beyond i_size, thus we don't need to >> - * bother writing back. >> - * But we still need to clear the dirty subpage bit, or >> - * the next time the folio gets dirtied, we will try to >> - * writeback the sectors with subpage dirty bits, >> - * causing writeback without ordered extent. >> - */ >> - btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur); >> + finish_io_beyond_eof(inode, folio, cur, start + len - cur, i_size); > > AFAICT, start + len is still end here, why not do end - cur? > >> break; >> } >> ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size); >> -- >> 2.52.0 >> > ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 4/4] btrfs: reduce extent map lookup during writes 2025-11-20 0:04 [PATCH 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo ` (2 preceding siblings ...) 2025-11-20 0:04 ` [PATCH 3/4] btrfs: extract the io finishing code into a helper Qu Wenruo @ 2025-11-20 0:04 ` Qu Wenruo 2025-11-20 1:16 ` Boris Burkov 3 siblings, 1 reply; 10+ messages in thread From: Qu Wenruo @ 2025-11-20 0:04 UTC (permalink / raw) To: linux-btrfs With large data folios supports, even on x86_64 we can hit a folio that contains several fs blocks. In that case, we still need to call btrfs_get_extent() for each block, as our submission path is still iterating each fs block and submit them one by one. This reduces the benefit of large folios. Change the behavior to submit the whole range when possible, this is done by: - Use for_each_set_bitrange() instead of for_each_set_bit() Now we can get a contiguous range to submit instead of a single fs block. - Handle blocks beyond EOF in one go This is pretty much the same as the old behavior, but for a range crossing i_size, we finish the range beyond i_size first, then submit the remaining. - Submit the contiguous range in one go Although we still need to consider the extent map boundary. - Remove submit_one_sector() As it's no longer utilized. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/extent_io.c | 208 +++++++++++++++++++++++-------------------- 1 file changed, 112 insertions(+), 96 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index cbee93a929f3..152eed265a3c 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1596,94 +1596,6 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode, return 0; } -/* - * Return 0 if we have submitted or queued the sector for submission. - * Return <0 for critical errors, and the involved sector will be cleaned up. - * - * Caller should make sure filepos < i_size and handle filepos >= i_size case. - */ -static int submit_one_sector(struct btrfs_inode *inode, - struct folio *folio, - u64 filepos, struct btrfs_bio_ctrl *bio_ctrl, - loff_t i_size) -{ - struct btrfs_fs_info *fs_info = inode->root->fs_info; - struct extent_map *em; - u64 block_start; - u64 disk_bytenr; - u64 extent_offset; - u64 em_end; - const u32 sectorsize = fs_info->sectorsize; - - ASSERT(IS_ALIGNED(filepos, sectorsize)); - - /* @filepos >= i_size case should be handled by the caller. */ - ASSERT(filepos < i_size); - - em = btrfs_get_extent(inode, NULL, filepos, sectorsize); - if (IS_ERR(em)) { - /* - * bio_ctrl may contain a bio crossing several folios. - * Submit it immediately so that the bio has a chance - * to finish normally, other than marked as error. - */ - submit_one_bio(bio_ctrl); - - /* - * When submission failed, we should still clear the folio dirty. - * Or the folio will be written back again but without any - * ordered extent. - */ - btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize); - btrfs_folio_set_writeback(fs_info, folio, filepos, sectorsize); - btrfs_folio_clear_writeback(fs_info, folio, filepos, sectorsize); - - /* - * Since there is no bio submitted to finish the ordered - * extent, we have to manually finish this sector. - */ - btrfs_mark_ordered_io_finished(inode, folio, filepos, - fs_info->sectorsize, false); - return PTR_ERR(em); - } - - extent_offset = filepos - em->start; - em_end = btrfs_extent_map_end(em); - ASSERT(filepos <= em_end); - ASSERT(IS_ALIGNED(em->start, sectorsize)); - ASSERT(IS_ALIGNED(em->len, sectorsize)); - - block_start = btrfs_extent_map_block_start(em); - disk_bytenr = btrfs_extent_map_block_start(em) + extent_offset; - - ASSERT(!btrfs_extent_map_is_compressed(em)); - ASSERT(block_start != EXTENT_MAP_HOLE); - ASSERT(block_start != EXTENT_MAP_INLINE); - - btrfs_free_extent_map(em); - em = NULL; - - /* - * Although the PageDirty bit is cleared before entering this - * function, subpage dirty bit is not cleared. - * So clear subpage dirty bit here so next time we won't submit - * a folio for a range already written to disk. - */ - btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize); - btrfs_folio_set_writeback(fs_info, folio, filepos, sectorsize); - /* - * Above call should set the whole folio with writeback flag, even - * just for a single subpage sector. - * As long as the folio is properly locked and the range is correct, - * we should always get the folio with writeback flag. - */ - ASSERT(folio_test_writeback(folio)); - - submit_extent_folio(bio_ctrl, disk_bytenr, folio, - sectorsize, filepos - folio_pos(folio), 0); - return 0; -} - static void finish_io_beyond_eof(struct btrfs_inode *inode, struct folio *folio, u64 start, u32 len, loff_t i_size) { @@ -1718,6 +1630,96 @@ static void finish_io_beyond_eof(struct btrfs_inode *inode, struct folio *folio, btrfs_folio_clear_dirty(fs_info, folio, start, len); } +/* + * Return 0 if we have submitted or queued the range for submission. + * Return <0 for critical errors, and the involved blocks will be cleaned up. + * + * Caller should make sure the range doesn't go beyond the last block of the inode. + */ +static int submit_range(struct btrfs_inode *inode, struct folio *folio, + u64 start, u32 len, struct btrfs_bio_ctrl *bio_ctrl) +{ + struct btrfs_fs_info *fs_info = inode->root->fs_info; + const u32 sectorsize = fs_info->sectorsize; + u64 cur = start; + + ASSERT(IS_ALIGNED(start, sectorsize)); + ASSERT(IS_ALIGNED(len, sectorsize)); + ASSERT(start + len <= folio_end(folio)); + + while (cur < start + len) { + struct extent_map *em; + u64 block_start; + u64 disk_bytenr; + u64 extent_offset; + u64 em_end; + u32 cur_len = start + len - cur; + + em = btrfs_get_extent(inode, NULL, cur, sectorsize); + if (IS_ERR(em)) { + /* + * bio_ctrl may contain a bio crossing several folios. + * Submit it immediately so that the bio has a chance + * to finish normally, other than marked as error. + */ + submit_one_bio(bio_ctrl); + + /* + * When submission failed, we should still clear the folio dirty. + * Or the folio will be written back again but without any + * ordered extent. + */ + btrfs_folio_clear_dirty(fs_info, folio, cur, len); + btrfs_folio_set_writeback(fs_info, folio, cur, len); + btrfs_folio_clear_writeback(fs_info, folio, cur, len); + + /* + * Since there is no bio submitted to finish the ordered + * extent, we have to manually finish this sector. + */ + btrfs_mark_ordered_io_finished(inode, folio, cur, len, false); + return PTR_ERR(em); + } + extent_offset = cur - em->start; + em_end = btrfs_extent_map_end(em); + ASSERT(cur <= em_end); + ASSERT(IS_ALIGNED(em->start, sectorsize)); + ASSERT(IS_ALIGNED(em->len, sectorsize)); + + block_start = btrfs_extent_map_block_start(em); + disk_bytenr = btrfs_extent_map_block_start(em) + extent_offset; + + ASSERT(!btrfs_extent_map_is_compressed(em)); + ASSERT(block_start != EXTENT_MAP_HOLE); + ASSERT(block_start != EXTENT_MAP_INLINE); + + cur_len = min(cur_len, em_end - cur); + btrfs_free_extent_map(em); + em = NULL; + + /* + * Although the PageDirty bit is cleared before entering this + * function, subpage dirty bit is not cleared. + * So clear subpage dirty bit here so next time we won't submit + * a folio for a range already written to disk. + */ + btrfs_folio_clear_dirty(fs_info, folio, cur, cur_len); + btrfs_folio_set_writeback(fs_info, folio, cur, cur_len); + /* + * Above call should set the whole folio with writeback flag, even + * just for a single subpage block. + * As long as the folio is properly locked and the range is correct, + * we should always get the folio with writeback flag. + */ + ASSERT(folio_test_writeback(folio)); + + submit_extent_folio(bio_ctrl, disk_bytenr, folio, + cur_len, cur - folio_pos(folio), 0); + cur += cur_len; + } + return 0; +} + /* * Helper for extent_writepage(). This calls the writepage start hooks, * and does the loop to map the page into extents and bios. @@ -1740,8 +1742,9 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, const u64 folio_start = folio_pos(folio); const u64 folio_end = folio_start + folio_size(folio); const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio); - u64 cur; - int bit; + unsigned int start_bit; + unsigned int end_bit; + const u64 rounded_isize = round_up(i_size, fs_info->sectorsize); int ret = 0; ASSERT(start >= folio_start, "start=%llu folio_start=%llu", start, folio_start); @@ -1769,14 +1772,27 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, bio_ctrl->end_io_func = end_bbio_data_write; - for_each_set_bit(bit, &bio_ctrl->submit_bitmap, blocks_per_folio) { - cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits); + for_each_set_bitrange(start_bit, end_bit, &bio_ctrl->submit_bitmap, blocks_per_folio) { + const u64 cur_start = folio_pos(folio) + (start_bit << fs_info->sectorsize_bits); + u32 cur_len = (end_bit - start_bit) << fs_info->sectorsize_bits; - if (cur >= i_size) { - finish_io_beyond_eof(inode, folio, cur, start + len - cur, i_size); - break; + if (cur_start > rounded_isize) { + /* + * The whole range is byoned EOF. + * + * Just finish the IO and skip to the next range. + */ + finish_io_beyond_eof(inode, folio, cur_start, cur_len, i_size); + continue; } - ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size); + if (cur_start + cur_len > rounded_isize) { + /* The tailing part of the range is beyond EOF. */ + finish_io_beyond_eof(inode, folio, rounded_isize, + cur_start + cur_len - rounded_isize, i_size); + /* Shrink the range inside the EOF. */ + cur_len = rounded_isize - cur_start; + } + ret = submit_range(inode, folio, cur_start, cur_len, bio_ctrl); if (unlikely(ret < 0)) { if (!found_error) found_error = ret; -- 2.52.0 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 4/4] btrfs: reduce extent map lookup during writes 2025-11-20 0:04 ` [PATCH 4/4] btrfs: reduce extent map lookup during writes Qu Wenruo @ 2025-11-20 1:16 ` Boris Burkov 2025-11-20 5:28 ` Qu Wenruo 0 siblings, 1 reply; 10+ messages in thread From: Boris Burkov @ 2025-11-20 1:16 UTC (permalink / raw) To: Qu Wenruo; +Cc: linux-btrfs On Thu, Nov 20, 2025 at 10:34:33AM +1030, Qu Wenruo wrote: > With large data folios supports, even on x86_64 we can hit a folio that > contains several fs blocks. > > In that case, we still need to call btrfs_get_extent() for each block, > as our submission path is still iterating each fs block and submit them > one by one. This reduces the benefit of large folios. > > Change the behavior to submit the whole range when possible, this is > done by: > > - Use for_each_set_bitrange() instead of for_each_set_bit() > Now we can get a contiguous range to submit instead of a single fs > block. > > - Handle blocks beyond EOF in one go > This is pretty much the same as the old behavior, but for a range > crossing i_size, we finish the range beyond i_size first, then submit > the remaining. > > - Submit the contiguous range in one go > Although we still need to consider the extent map boundary. > > - Remove submit_one_sector() > As it's no longer utilized. > This is very cool, quite excited about it. A couple questions inline but the approach looks good to me overall. Also, I was curious and looked at writepage_delalloc and that also uses for_each_set_bit(...) though with simpler stuff in the loop (btrfs_folio_set_lock) though I wonder if that can be simplified as well. Thanks, Boris > Signed-off-by: Qu Wenruo <wqu@suse.com> > --- > fs/btrfs/extent_io.c | 208 +++++++++++++++++++++++-------------------- > 1 file changed, 112 insertions(+), 96 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index cbee93a929f3..152eed265a3c 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -1596,94 +1596,6 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode, > return 0; > } > > -/* > - * Return 0 if we have submitted or queued the sector for submission. > - * Return <0 for critical errors, and the involved sector will be cleaned up. > - * > - * Caller should make sure filepos < i_size and handle filepos >= i_size case. > - */ > -static int submit_one_sector(struct btrfs_inode *inode, > - struct folio *folio, > - u64 filepos, struct btrfs_bio_ctrl *bio_ctrl, > - loff_t i_size) > -{ > - struct btrfs_fs_info *fs_info = inode->root->fs_info; > - struct extent_map *em; > - u64 block_start; > - u64 disk_bytenr; > - u64 extent_offset; > - u64 em_end; > - const u32 sectorsize = fs_info->sectorsize; > - > - ASSERT(IS_ALIGNED(filepos, sectorsize)); > - > - /* @filepos >= i_size case should be handled by the caller. */ > - ASSERT(filepos < i_size); > - > - em = btrfs_get_extent(inode, NULL, filepos, sectorsize); > - if (IS_ERR(em)) { > - /* > - * bio_ctrl may contain a bio crossing several folios. > - * Submit it immediately so that the bio has a chance > - * to finish normally, other than marked as error. > - */ > - submit_one_bio(bio_ctrl); > - > - /* > - * When submission failed, we should still clear the folio dirty. > - * Or the folio will be written back again but without any > - * ordered extent. > - */ > - btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize); > - btrfs_folio_set_writeback(fs_info, folio, filepos, sectorsize); > - btrfs_folio_clear_writeback(fs_info, folio, filepos, sectorsize); > - > - /* > - * Since there is no bio submitted to finish the ordered > - * extent, we have to manually finish this sector. > - */ > - btrfs_mark_ordered_io_finished(inode, folio, filepos, > - fs_info->sectorsize, false); > - return PTR_ERR(em); > - } > - > - extent_offset = filepos - em->start; > - em_end = btrfs_extent_map_end(em); > - ASSERT(filepos <= em_end); > - ASSERT(IS_ALIGNED(em->start, sectorsize)); > - ASSERT(IS_ALIGNED(em->len, sectorsize)); > - > - block_start = btrfs_extent_map_block_start(em); > - disk_bytenr = btrfs_extent_map_block_start(em) + extent_offset; > - > - ASSERT(!btrfs_extent_map_is_compressed(em)); > - ASSERT(block_start != EXTENT_MAP_HOLE); > - ASSERT(block_start != EXTENT_MAP_INLINE); > - > - btrfs_free_extent_map(em); > - em = NULL; > - > - /* > - * Although the PageDirty bit is cleared before entering this > - * function, subpage dirty bit is not cleared. > - * So clear subpage dirty bit here so next time we won't submit > - * a folio for a range already written to disk. > - */ > - btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize); > - btrfs_folio_set_writeback(fs_info, folio, filepos, sectorsize); > - /* > - * Above call should set the whole folio with writeback flag, even > - * just for a single subpage sector. > - * As long as the folio is properly locked and the range is correct, > - * we should always get the folio with writeback flag. > - */ > - ASSERT(folio_test_writeback(folio)); > - > - submit_extent_folio(bio_ctrl, disk_bytenr, folio, > - sectorsize, filepos - folio_pos(folio), 0); > - return 0; > -} > - > static void finish_io_beyond_eof(struct btrfs_inode *inode, struct folio *folio, > u64 start, u32 len, loff_t i_size) > { > @@ -1718,6 +1630,96 @@ static void finish_io_beyond_eof(struct btrfs_inode *inode, struct folio *folio, > btrfs_folio_clear_dirty(fs_info, folio, start, len); > } > > +/* > + * Return 0 if we have submitted or queued the range for submission. > + * Return <0 for critical errors, and the involved blocks will be cleaned up. > + * > + * Caller should make sure the range doesn't go beyond the last block of the inode. > + */ > +static int submit_range(struct btrfs_inode *inode, struct folio *folio, > + u64 start, u32 len, struct btrfs_bio_ctrl *bio_ctrl) > +{ > + struct btrfs_fs_info *fs_info = inode->root->fs_info; > + const u32 sectorsize = fs_info->sectorsize; > + u64 cur = start; > + > + ASSERT(IS_ALIGNED(start, sectorsize)); > + ASSERT(IS_ALIGNED(len, sectorsize)); > + ASSERT(start + len <= folio_end(folio)); > + > + while (cur < start + len) { > + struct extent_map *em; > + u64 block_start; > + u64 disk_bytenr; > + u64 extent_offset; > + u64 em_end; > + u32 cur_len = start + len - cur; > + > + em = btrfs_get_extent(inode, NULL, cur, sectorsize); why is sectorsize still relevant? Just a small size to grab an overlapping em or something more meaningful? > + if (IS_ERR(em)) { > + /* > + * bio_ctrl may contain a bio crossing several folios. > + * Submit it immediately so that the bio has a chance > + * to finish normally, other than marked as error. > + */ > + submit_one_bio(bio_ctrl); > + > + /* > + * When submission failed, we should still clear the folio dirty. > + * Or the folio will be written back again but without any > + * ordered extent. > + */ > + btrfs_folio_clear_dirty(fs_info, folio, cur, len); > + btrfs_folio_set_writeback(fs_info, folio, cur, len); > + btrfs_folio_clear_writeback(fs_info, folio, cur, len); cur, cur_len? or start, len? cur, len feels wrong. > + > + /* > + * Since there is no bio submitted to finish the ordered > + * extent, we have to manually finish this sector. > + */ > + btrfs_mark_ordered_io_finished(inode, folio, cur, len, false); Same question about cur, len; also I don't know if the comment referring to "this sector" still makes sense. Manually finish this ordered_extent or something? > + return PTR_ERR(em); > + } > + extent_offset = cur - em->start; > + em_end = btrfs_extent_map_end(em); > + ASSERT(cur <= em_end); > + ASSERT(IS_ALIGNED(em->start, sectorsize)); > + ASSERT(IS_ALIGNED(em->len, sectorsize)); > + > + block_start = btrfs_extent_map_block_start(em); > + disk_bytenr = btrfs_extent_map_block_start(em) + extent_offset; > + > + ASSERT(!btrfs_extent_map_is_compressed(em)); > + ASSERT(block_start != EXTENT_MAP_HOLE); > + ASSERT(block_start != EXTENT_MAP_INLINE); > + > + cur_len = min(cur_len, em_end - cur); > + btrfs_free_extent_map(em); > + em = NULL; > + > + /* > + * Although the PageDirty bit is cleared before entering this > + * function, subpage dirty bit is not cleared. > + * So clear subpage dirty bit here so next time we won't submit > + * a folio for a range already written to disk. > + */ > + btrfs_folio_clear_dirty(fs_info, folio, cur, cur_len); > + btrfs_folio_set_writeback(fs_info, folio, cur, cur_len); > + /* > + * Above call should set the whole folio with writeback flag, even > + * just for a single subpage block. > + * As long as the folio is properly locked and the range is correct, > + * we should always get the folio with writeback flag. > + */ > + ASSERT(folio_test_writeback(folio)); > + > + submit_extent_folio(bio_ctrl, disk_bytenr, folio, > + cur_len, cur - folio_pos(folio), 0); > + cur += cur_len; > + } > + return 0; > +} > + > /* > * Helper for extent_writepage(). This calls the writepage start hooks, > * and does the loop to map the page into extents and bios. > @@ -1740,8 +1742,9 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, > const u64 folio_start = folio_pos(folio); > const u64 folio_end = folio_start + folio_size(folio); > const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio); > - u64 cur; > - int bit; > + unsigned int start_bit; > + unsigned int end_bit; > + const u64 rounded_isize = round_up(i_size, fs_info->sectorsize); > int ret = 0; > > ASSERT(start >= folio_start, "start=%llu folio_start=%llu", start, folio_start); > @@ -1769,14 +1772,27 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode, > > bio_ctrl->end_io_func = end_bbio_data_write; > > - for_each_set_bit(bit, &bio_ctrl->submit_bitmap, blocks_per_folio) { > - cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits); > + for_each_set_bitrange(start_bit, end_bit, &bio_ctrl->submit_bitmap, blocks_per_folio) { > + const u64 cur_start = folio_pos(folio) + (start_bit << fs_info->sectorsize_bits); > + u32 cur_len = (end_bit - start_bit) << fs_info->sectorsize_bits; > > - if (cur >= i_size) { > - finish_io_beyond_eof(inode, folio, cur, start + len - cur, i_size); > - break; > + if (cur_start > rounded_isize) { > + /* > + * The whole range is byoned EOF. sic: beyond > + * > + * Just finish the IO and skip to the next range. > + */ > + finish_io_beyond_eof(inode, folio, cur_start, cur_len, i_size); > + continue; > } > - ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size); > + if (cur_start + cur_len > rounded_isize) { > + /* The tailing part of the range is beyond EOF. */ > + finish_io_beyond_eof(inode, folio, rounded_isize, > + cur_start + cur_len - rounded_isize, i_size); > + /* Shrink the range inside the EOF. */ > + cur_len = rounded_isize - cur_start; > + } > + ret = submit_range(inode, folio, cur_start, cur_len, bio_ctrl); > if (unlikely(ret < 0)) { > if (!found_error) > found_error = ret; > -- > 2.52.0 > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 4/4] btrfs: reduce extent map lookup during writes 2025-11-20 1:16 ` Boris Burkov @ 2025-11-20 5:28 ` Qu Wenruo 2025-11-20 9:00 ` Qu Wenruo 0 siblings, 1 reply; 10+ messages in thread From: Qu Wenruo @ 2025-11-20 5:28 UTC (permalink / raw) To: Boris Burkov; +Cc: linux-btrfs 在 2025/11/20 11:46, Boris Burkov 写道: > On Thu, Nov 20, 2025 at 10:34:33AM +1030, Qu Wenruo wrote: >> With large data folios supports, even on x86_64 we can hit a folio that >> contains several fs blocks. >> >> In that case, we still need to call btrfs_get_extent() for each block, >> as our submission path is still iterating each fs block and submit them >> one by one. This reduces the benefit of large folios. >> >> Change the behavior to submit the whole range when possible, this is >> done by: >> >> - Use for_each_set_bitrange() instead of for_each_set_bit() >> Now we can get a contiguous range to submit instead of a single fs >> block. >> >> - Handle blocks beyond EOF in one go >> This is pretty much the same as the old behavior, but for a range >> crossing i_size, we finish the range beyond i_size first, then submit >> the remaining. >> >> - Submit the contiguous range in one go >> Although we still need to consider the extent map boundary. >> >> - Remove submit_one_sector() >> As it's no longer utilized. >> > > This is very cool, quite excited about it. A couple questions inline but > the approach looks good to me overall. > > Also, I was curious and looked at writepage_delalloc and that also uses > for_each_set_bit(...) though with simpler stuff in the loop > (btrfs_folio_set_lock) though I wonder if that can be simplified as well. Thanks for catching that one. I'll check for any similar call sites and do the conversion in the next update. > > Thanks, > Boris > [...] >> + while (cur < start + len) { >> + struct extent_map *em; >> + u64 block_start; >> + u64 disk_bytenr; >> + u64 extent_offset; >> + u64 em_end; >> + u32 cur_len = start + len - cur; >> + >> + em = btrfs_get_extent(inode, NULL, cur, sectorsize); > > why is sectorsize still relevant? Just a small size to grab an > overlapping em or something more meaningful? This in fact is a pretty important error handling for some corner cases. Normally we should always have an extent map for the dirty ranges, but if we hit some situation like the following: 0 16K 32K Dirty range: |//////////////////////| Em range: | |///////////| If we call btrfs_get_extent() with cur == 0, len == 32K, we can get an em at 16K, and that can screw up our calculation. So here we should keep the sectorsize as len to catch above situation. I'll add an comment about the hidden error handling situation. > >> + if (IS_ERR(em)) { >> + /* >> + * bio_ctrl may contain a bio crossing several folios. >> + * Submit it immediately so that the bio has a chance >> + * to finish normally, other than marked as error. >> + */ >> + submit_one_bio(bio_ctrl); >> + >> + /* >> + * When submission failed, we should still clear the folio dirty. >> + * Or the folio will be written back again but without any >> + * ordered extent. >> + */ >> + btrfs_folio_clear_dirty(fs_info, folio, cur, len); >> + btrfs_folio_set_writeback(fs_info, folio, cur, len); >> + btrfs_folio_clear_writeback(fs_info, folio, cur, len); > > cur, cur_len? or start, len? cur, len feels wrong Oh, thanks for catching this one, it should be @cur, @cur_len. . > >> + >> + /* >> + * Since there is no bio submitted to finish the ordered >> + * extent, we have to manually finish this sector. >> + */ >> + btrfs_mark_ordered_io_finished(inode, folio, cur, len, false); > > Same question about cur, len; also I don't know if the comment referring > to "this sector" still makes sense. Manually finish this ordered_extent > or something? I'll fix the the comment too. Thanks, Qu >> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 4/4] btrfs: reduce extent map lookup during writes 2025-11-20 5:28 ` Qu Wenruo @ 2025-11-20 9:00 ` Qu Wenruo 0 siblings, 0 replies; 10+ messages in thread From: Qu Wenruo @ 2025-11-20 9:00 UTC (permalink / raw) To: Boris Burkov; +Cc: linux-btrfs 在 2025/11/20 15:58, Qu Wenruo 写道: > > > 在 2025/11/20 11:46, Boris Burkov 写道: >> On Thu, Nov 20, 2025 at 10:34:33AM +1030, Qu Wenruo wrote: >>> With large data folios supports, even on x86_64 we can hit a folio that >>> contains several fs blocks. >>> >>> In that case, we still need to call btrfs_get_extent() for each block, >>> as our submission path is still iterating each fs block and submit them >>> one by one. This reduces the benefit of large folios. >>> >>> Change the behavior to submit the whole range when possible, this is >>> done by: >>> >>> - Use for_each_set_bitrange() instead of for_each_set_bit() >>> Now we can get a contiguous range to submit instead of a single fs >>> block. >>> >>> - Handle blocks beyond EOF in one go >>> This is pretty much the same as the old behavior, but for a range >>> crossing i_size, we finish the range beyond i_size first, then submit >>> the remaining. >>> >>> - Submit the contiguous range in one go >>> Although we still need to consider the extent map boundary. >>> >>> - Remove submit_one_sector() >>> As it's no longer utilized. >>> >> >> This is very cool, quite excited about it. A couple questions inline but >> the approach looks good to me overall. >> >> Also, I was curious and looked at writepage_delalloc and that also uses >> for_each_set_bit(...) though with simpler stuff in the loop >> (btrfs_folio_set_lock) though I wonder if that can be simplified as well. > > Thanks for catching that one. > > I'll check for any similar call sites and do the conversion in the next > update. > >> >> Thanks, >> Boris >> > [...] >>> + while (cur < start + len) { >>> + struct extent_map *em; >>> + u64 block_start; >>> + u64 disk_bytenr; >>> + u64 extent_offset; >>> + u64 em_end; >>> + u32 cur_len = start + len - cur; >>> + >>> + em = btrfs_get_extent(inode, NULL, cur, sectorsize); >> >> why is sectorsize still relevant? Just a small size to grab an >> overlapping em or something more meaningful? > > This in fact is a pretty important error handling for some corner cases. > > Normally we should always have an extent map for the dirty ranges, but > if we hit some situation like the following: > > 0 16K 32K > Dirty range: |//////////////////////| > Em range: | |///////////| > > If we call btrfs_get_extent() with cur == 0, len == 32K, we can get an > em at 16K, and that can screw up our calculation. > > So here we should keep the sectorsize as len to catch above situation. > > I'll add an comment about the hidden error handling situation. I'm an idiot, the existing code can handle it by properly return a HOLE em. To be honest, I think the length parameter is only making the function more confusing than needed. I'll look into if we can remove the parameter completely. Thanks, Qu > >> >>> + if (IS_ERR(em)) { >>> + /* >>> + * bio_ctrl may contain a bio crossing several folios. >>> + * Submit it immediately so that the bio has a chance >>> + * to finish normally, other than marked as error. >>> + */ >>> + submit_one_bio(bio_ctrl); >>> + >>> + /* >>> + * When submission failed, we should still clear the >>> folio dirty. >>> + * Or the folio will be written back again but without any >>> + * ordered extent. >>> + */ >>> + btrfs_folio_clear_dirty(fs_info, folio, cur, len); >>> + btrfs_folio_set_writeback(fs_info, folio, cur, len); >>> + btrfs_folio_clear_writeback(fs_info, folio, cur, len); >> >> cur, cur_len? or start, len? cur, len feels wrong > > Oh, thanks for catching this one, it should be @cur, @cur_len. > > . >> >>> + >>> + /* >>> + * Since there is no bio submitted to finish the ordered >>> + * extent, we have to manually finish this sector. >>> + */ >>> + btrfs_mark_ordered_io_finished(inode, folio, cur, len, >>> false); >> >> Same question about cur, len; also I don't know if the comment referring >> to "this sector" still makes sense. Manually finish this ordered_extent >> or something? > > I'll fix the the comment too. > > Thanks, > Qu > >>> > ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-11-20 9:00 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-11-20 0:04 [PATCH 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo 2025-11-20 0:04 ` [PATCH 1/4] btrfs: integrate the error handling of submit_one_sector() Qu Wenruo 2025-11-20 0:04 ` [PATCH 2/4] btrfs: use bitmap_set() to replace set_bit() in a loop Qu Wenruo 2025-11-20 0:04 ` [PATCH 3/4] btrfs: extract the io finishing code into a helper Qu Wenruo 2025-11-20 0:59 ` Boris Burkov 2025-11-20 1:11 ` Qu Wenruo 2025-11-20 0:04 ` [PATCH 4/4] btrfs: reduce extent map lookup during writes Qu Wenruo 2025-11-20 1:16 ` Boris Burkov 2025-11-20 5:28 ` Qu Wenruo 2025-11-20 9:00 ` Qu Wenruo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox