[PATCH 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path
@ 2025-11-20  0:04 Qu Wenruo
  2025-11-20  0:04 ` [PATCH 1/4] btrfs: integrate the error handling of submit_one_sector() Qu Wenruo
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-11-20  0:04 UTC (permalink / raw)
  To: linux-btrfs

Although btrfs has bs < ps support for a long time, and the larger data
folios support is also going to be graduate from experimental features
soon, the write path is still iterating each fs block and call
btrfs_get_extent() on each fs block.

What makes the situation worse is that, for the write path we do not
have any cached extent map, meaning even with large folios and we got a
continuous range that can be submitted in one go, we still call
btrfs_get_extent() many times and get the same range extent map again
and again.

This series will reduce the duplicated btrfs_get_extent() calls by only
call it once for each range, other than for each fs block.

The first 3 patches are just minor cleanups/refactors, the last patch
is the real optimization.

I don't expect there will be much difference in the real world though.

Qu Wenruo (4):
  btrfs: integrate the error handling of submit_one_sector()
  btrfs: use bitmap_set() to replace set_bit() in a loop
  btrfs: extract the io finishing code into a helper
  btrfs: reduce extent map lookup during writes

 fs/btrfs/extent_io.c | 243 ++++++++++++++++++++++++-------------------
 1 file changed, 134 insertions(+), 109 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/4] btrfs: integrate the error handling of submit_one_sector()
  2025-11-20  0:04 [PATCH 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo
@ 2025-11-20  0:04 ` Qu Wenruo
  2025-11-20  0:04 ` [PATCH 2/4] btrfs: use bitmap_set() to replace set_bit() in a loop Qu Wenruo
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-11-20  0:04 UTC (permalink / raw)
  To: linux-btrfs

Currently submit_one_sector() has only one failure pattern from
btrfs_get_extent().

However the error handling is split into two part, one inside
submit_one_sector(), which clear the dirty flag and finish the writeback
for the fs block.

The other part of the error handling is to submit any remaining bio
inside bio_ctrl and mark the ordered extent for the fs block as
finished.

There is no special reason that we must split the error handling, let's
just concentrace all the error handling inside submit_one_sector().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2d32dfc34ae3..a1dfc0902bfc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1598,7 +1598,7 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 
 /*
  * Return 0 if we have submitted or queued the sector for submission.
- * Return <0 for critical errors, and the sector will have its dirty flag cleared.
+ * Return <0 for critical errors, and the involved sector will be cleaned up.
  *
  * Caller should make sure filepos < i_size and handle filepos >= i_size case.
  */
@@ -1622,6 +1622,13 @@ static int submit_one_sector(struct btrfs_inode *inode,
 
 	em = btrfs_get_extent(inode, NULL, filepos, sectorsize);
 	if (IS_ERR(em)) {
+		/*
+		 * bio_ctrl may contain a bio crossing several folios.
+		 * Submit it immediately so that the bio has a chance
+		 * to finish normally, other than marked as error.
+		 */
+		submit_one_bio(bio_ctrl);
+
 		/*
 		 * When submission failed, we should still clear the folio dirty.
 		 * Or the folio will be written back again but without any
@@ -1630,6 +1637,13 @@ static int submit_one_sector(struct btrfs_inode *inode,
 		btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize);
 		btrfs_folio_set_writeback(fs_info, folio, filepos, sectorsize);
 		btrfs_folio_clear_writeback(fs_info, folio, filepos, sectorsize);
+
+		/*
+		 * Since there is no bio submitted to finish the ordered
+		 * extent, we have to manually finish this sector.
+		 */
+		btrfs_mark_ordered_io_finished(inode, folio, filepos,
+					       fs_info->sectorsize, false);
 		return PTR_ERR(em);
 	}
 
@@ -1756,19 +1770,6 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 		}
 		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
 		if (unlikely(ret < 0)) {
-			/*
-			 * bio_ctrl may contain a bio crossing several folios.
-			 * Submit it immediately so that the bio has a chance
-			 * to finish normally, other than marked as error.
-			 */
-			submit_one_bio(bio_ctrl);
-			/*
-			 * Failed to grab the extent map which should be very rare.
-			 * Since there is no bio submitted to finish the ordered
-			 * extent, we have to manually finish this sector.
-			 */
-			btrfs_mark_ordered_io_finished(inode, folio, cur,
-						       fs_info->sectorsize, false);
 			if (!found_error)
 				found_error = ret;
 			continue;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/4] btrfs: use bitmap_set() to replace set_bit() in a loop
  2025-11-20  0:04 [PATCH 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo
  2025-11-20  0:04 ` [PATCH 1/4] btrfs: integrate the error handling of submit_one_sector() Qu Wenruo
@ 2025-11-20  0:04 ` Qu Wenruo
  2025-11-20  0:04 ` [PATCH 3/4] btrfs: extract the io finishing code into a helper Qu Wenruo
  2025-11-20  0:04 ` [PATCH 4/4] btrfs: reduce extent map lookup during writes Qu Wenruo
  3 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-11-20  0:04 UTC (permalink / raw)
  To: linux-btrfs

Inside extent_writepage_io(), we use the specified range to calculate
@range_bitmap, which is later to determine which blocks are going to be
submitted.

However we're calling set_bit() in a loop like the following:

	for (cur = start; cur < end; cur += fs_info->sectorsize)
		set_bit((cur - folio_start) >> fs_info->sectorsize_bits, &range_bitmap);

Which is not optimal, as we know all bits for the range [start, start +
len) can be set to 1 in one go.

Replace the set_bit() in a loop with a proper bitmap_set() call.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a1dfc0902bfc..4fc3b3d776ee 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1728,8 +1728,8 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 		return ret;
 	}
 
-	for (cur = start; cur < end; cur += fs_info->sectorsize)
-		set_bit((cur - folio_start) >> fs_info->sectorsize_bits, &range_bitmap);
+	bitmap_set(&range_bitmap, (start - folio_pos(folio)) >> fs_info->sectorsize_bits,
+		   len >> fs_info->sectorsize_bits);
 	bitmap_and(&bio_ctrl->submit_bitmap, &bio_ctrl->submit_bitmap, &range_bitmap,
 		   blocks_per_folio);
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/4] btrfs: extract the io finishing code into a helper
  2025-11-20  0:04 [PATCH 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo
  2025-11-20  0:04 ` [PATCH 1/4] btrfs: integrate the error handling of submit_one_sector() Qu Wenruo
  2025-11-20  0:04 ` [PATCH 2/4] btrfs: use bitmap_set() to replace set_bit() in a loop Qu Wenruo
@ 2025-11-20  0:04 ` Qu Wenruo
  2025-11-20  0:59   ` Boris Burkov
  2025-11-20  0:04 ` [PATCH 4/4] btrfs: reduce extent map lookup during writes Qu Wenruo
  3 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2025-11-20  0:04 UTC (permalink / raw)
  To: linux-btrfs

Currently we have a code block, which finishes IO for a range beyond
i_size, deep inside the loop of extent_writepage_io().

Extract it into a helper, finish_io_beyond_eof(), to reduce the level
of indents.

Furthermore slightly change the parameter passed into the helper,
currently we fully finish the IO for the range beyond EOF, but that
range may be beyond the range [start, start + len), that means we may
finish the IO for ranges which we should not touch.

So call the finish_io_beyond_eof() only for the range we should touch.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 62 +++++++++++++++++++++++++-------------------
 1 file changed, 35 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4fc3b3d776ee..cbee93a929f3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1684,6 +1684,40 @@ static int submit_one_sector(struct btrfs_inode *inode,
 	return 0;
 }
 
+static void finish_io_beyond_eof(struct btrfs_inode *inode, struct folio *folio,
+				 u64 start, u32 len, loff_t i_size)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	struct btrfs_ordered_extent *ordered;
+
+	ASSERT(start >= i_size);
+
+	ordered = btrfs_lookup_first_ordered_range(inode, start, len);
+
+	/*
+	 * We have just run delalloc before getting here, so
+	 * there must be an ordered extent.
+	 */
+	ASSERT(ordered != NULL);
+	spin_lock(&inode->ordered_tree_lock);
+	set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
+	ordered->truncated_len = min(ordered->truncated_len,
+				     start - ordered->file_offset);
+	spin_unlock(&inode->ordered_tree_lock);
+	btrfs_put_ordered_extent(ordered);
+
+	btrfs_mark_ordered_io_finished(inode, folio, start, len, true);
+	/*
+	 * This range is beyond i_size, thus we don't need to
+	 * bother writing back.
+	 * But we still need to clear the dirty subpage bit, or
+	 * the next time the folio gets dirtied, we will try to
+	 * writeback the sectors with subpage dirty bits,
+	 * causing writeback without ordered extent.
+	 */
+	btrfs_folio_clear_dirty(fs_info, folio, start, len);
+}
+
 /*
  * Helper for extent_writepage().  This calls the writepage start hooks,
  * and does the loop to map the page into extents and bios.
@@ -1739,33 +1773,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 		cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits);
 
 		if (cur >= i_size) {
-			struct btrfs_ordered_extent *ordered;
-
-			ordered = btrfs_lookup_first_ordered_range(inode, cur,
-								   folio_end - cur);
-			/*
-			 * We have just run delalloc before getting here, so
-			 * there must be an ordered extent.
-			 */
-			ASSERT(ordered != NULL);
-			spin_lock(&inode->ordered_tree_lock);
-			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
-			ordered->truncated_len = min(ordered->truncated_len,
-						     cur - ordered->file_offset);
-			spin_unlock(&inode->ordered_tree_lock);
-			btrfs_put_ordered_extent(ordered);
-
-			btrfs_mark_ordered_io_finished(inode, folio, cur,
-						       end - cur, true);
-			/*
-			 * This range is beyond i_size, thus we don't need to
-			 * bother writing back.
-			 * But we still need to clear the dirty subpage bit, or
-			 * the next time the folio gets dirtied, we will try to
-			 * writeback the sectors with subpage dirty bits,
-			 * causing writeback without ordered extent.
-			 */
-			btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur);
+			finish_io_beyond_eof(inode, folio, cur, start + len - cur, i_size);
 			break;
 		}
 		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/4] btrfs: extract the io finishing code into a helper
  2025-11-20  0:04 ` [PATCH 3/4] btrfs: extract the io finishing code into a helper Qu Wenruo
@ 2025-11-20  0:59   ` Boris Burkov
  2025-11-20  1:11     ` Qu Wenruo
  0 siblings, 1 reply; 10+ messages in thread
From: Boris Burkov @ 2025-11-20  0:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Nov 20, 2025 at 10:34:32AM +1030, Qu Wenruo wrote:
> Currently we have a code block, which finishes IO for a range beyond
> i_size, deep inside the loop of extent_writepage_io().
> 
> Extract it into a helper, finish_io_beyond_eof(), to reduce the level
> of indents.
> 
> Furthermore slightly change the parameter passed into the helper,
> currently we fully finish the IO for the range beyond EOF, but that
> range may be beyond the range [start, start + len), that means we may
> finish the IO for ranges which we should not touch.

I'm a little confused about this. I can't see any changes to the
effective parameters to anything but btrfs_lookup_first_ordered_range()
but that is getting the first ordered_extent

So allowing more past [cur, end] via [cur, folio_end] shouldn't change
what ordered_extent we get, as we are asserting that we get one with
the narrower search, and we are getting the first one.

the params to min(), btrfs_mark_ordered_io_finished(), and
btrfs_clear_folio_dirty() seem unchanged.

What am I missing?

Thanks,
Boris

> 
> So call the finish_io_beyond_eof() only for the range we should touch.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 62 +++++++++++++++++++++++++-------------------
>  1 file changed, 35 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 4fc3b3d776ee..cbee93a929f3 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1684,6 +1684,40 @@ static int submit_one_sector(struct btrfs_inode *inode,
>  	return 0;
>  }
>  
> +static void finish_io_beyond_eof(struct btrfs_inode *inode, struct folio *folio, > +				 u64 start, u32 len, loff_t i_size)
> +{
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	struct btrfs_ordered_extent *ordered;
> +
> +	ASSERT(start >= i_size);
> +
> +	ordered = btrfs_lookup_first_ordered_range(inode, start, len);
> +
> +	/*
> +	 * We have just run delalloc before getting here, so
> +	 * there must be an ordered extent.
> +	 */
> +	ASSERT(ordered != NULL);
> +	spin_lock(&inode->ordered_tree_lock);
> +	set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> +	ordered->truncated_len = min(ordered->truncated_len,
> +				     start - ordered->file_offset);
> +	spin_unlock(&inode->ordered_tree_lock);
> +	btrfs_put_ordered_extent(ordered);
> +
> +	btrfs_mark_ordered_io_finished(inode, folio, start, len, true);
> +	/*
> +	 * This range is beyond i_size, thus we don't need to
> +	 * bother writing back.
> +	 * But we still need to clear the dirty subpage bit, or
> +	 * the next time the folio gets dirtied, we will try to
> +	 * writeback the sectors with subpage dirty bits,
> +	 * causing writeback without ordered extent.
> +	 */
> +	btrfs_folio_clear_dirty(fs_info, folio, start, len);
> +}
> +
>  /*
>   * Helper for extent_writepage().  This calls the writepage start hooks,
>   * and does the loop to map the page into extents and bios.
> @@ -1739,33 +1773,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>  		cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits);
>  
>  		if (cur >= i_size) {
> -			struct btrfs_ordered_extent *ordered;
> -
> -			ordered = btrfs_lookup_first_ordered_range(inode, cur,
> -								   folio_end - cur);
> -			/*
> -			 * We have just run delalloc before getting here, so
> -			 * there must be an ordered extent.
> -			 */
> -			ASSERT(ordered != NULL);
> -			spin_lock(&inode->ordered_tree_lock);
> -			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
> -			ordered->truncated_len = min(ordered->truncated_len,
> -						     cur - ordered->file_offset);
> -			spin_unlock(&inode->ordered_tree_lock);
> -			btrfs_put_ordered_extent(ordered);
> -
> -			btrfs_mark_ordered_io_finished(inode, folio, cur,
> -						       end - cur, true);
> -			/*
> -			 * This range is beyond i_size, thus we don't need to
> -			 * bother writing back.
> -			 * But we still need to clear the dirty subpage bit, or
> -			 * the next time the folio gets dirtied, we will try to
> -			 * writeback the sectors with subpage dirty bits,
> -			 * causing writeback without ordered extent.
> -			 */
> -			btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur);
> +			finish_io_beyond_eof(inode, folio, cur, start + len - cur, i_size);

AFAICT, start + len is still end here, why not do end - cur?

>  			break;
>  		}
>  		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/4] btrfs: extract the io finishing code into a helper
  2025-11-20  0:59   ` Boris Burkov
@ 2025-11-20  1:11     ` Qu Wenruo
  0 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-11-20  1:11 UTC (permalink / raw)
  To: Boris Burkov, Qu Wenruo; +Cc: linux-btrfs



在 2025/11/20 11:29, Boris Burkov 写道:
> On Thu, Nov 20, 2025 at 10:34:32AM +1030, Qu Wenruo wrote:
>> Currently we have a code block, which finishes IO for a range beyond
>> i_size, deep inside the loop of extent_writepage_io().
>>
>> Extract it into a helper, finish_io_beyond_eof(), to reduce the level
>> of indents.
>>
>> Furthermore slightly change the parameter passed into the helper,
>> currently we fully finish the IO for the range beyond EOF, but that
>> range may be beyond the range [start, start + len), that means we may
>> finish the IO for ranges which we should not touch.
> 
> I'm a little confused about this. I can't see any changes to the
> effective parameters to anything but btrfs_lookup_first_ordered_range()
> but that is getting the first ordered_extent
> 
> So allowing more past [cur, end] via [cur, folio_end] shouldn't change
> what ordered_extent we get, as we are asserting that we get one with
> the narrower search, and we are getting the first one.
> 
> the params to min(), btrfs_mark_ordered_io_finished(), and
> btrfs_clear_folio_dirty() seem unchanged.

My bad, I was reading the original code wrong, assuming it's clearing 
the IO to @folio_end not @end.

So this part is not changed.


On the other hand, we can not rely on btrfs_lookup_first_ordered_range() 
to catch all OEs for a range.
That part only works if there is only one OE for the range.
If there are multiple ones, the remaining OEs are not truncated.

Will update the commit message and the handling of 
finish_io_beyond_eof() to ensure all involved OEs are properly truncated.
(This also affects  the old code though).

Thanks,
Qu

> 
> What am I missing?
> 
> Thanks,
> Boris
> 
>>
>> So call the finish_io_beyond_eof() only for the range we should touch.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/extent_io.c | 62 +++++++++++++++++++++++++-------------------
>>   1 file changed, 35 insertions(+), 27 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 4fc3b3d776ee..cbee93a929f3 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -1684,6 +1684,40 @@ static int submit_one_sector(struct btrfs_inode *inode,
>>   	return 0;
>>   }
>>   
>> +static void finish_io_beyond_eof(struct btrfs_inode *inode, struct folio *folio, > +				 u64 start, u32 len, loff_t i_size)
>> +{
>> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
>> +	struct btrfs_ordered_extent *ordered;
>> +
>> +	ASSERT(start >= i_size);
>> +
>> +	ordered = btrfs_lookup_first_ordered_range(inode, start, len);
>> +
>> +	/*
>> +	 * We have just run delalloc before getting here, so
>> +	 * there must be an ordered extent.
>> +	 */
>> +	ASSERT(ordered != NULL);
>> +	spin_lock(&inode->ordered_tree_lock);
>> +	set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
>> +	ordered->truncated_len = min(ordered->truncated_len,
>> +				     start - ordered->file_offset);
>> +	spin_unlock(&inode->ordered_tree_lock);
>> +	btrfs_put_ordered_extent(ordered);
>> +
>> +	btrfs_mark_ordered_io_finished(inode, folio, start, len, true);
>> +	/*
>> +	 * This range is beyond i_size, thus we don't need to
>> +	 * bother writing back.
>> +	 * But we still need to clear the dirty subpage bit, or
>> +	 * the next time the folio gets dirtied, we will try to
>> +	 * writeback the sectors with subpage dirty bits,
>> +	 * causing writeback without ordered extent.
>> +	 */
>> +	btrfs_folio_clear_dirty(fs_info, folio, start, len);
>> +}
>> +
>>   /*
>>    * Helper for extent_writepage().  This calls the writepage start hooks,
>>    * and does the loop to map the page into extents and bios.
>> @@ -1739,33 +1773,7 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>>   		cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits);
>>   
>>   		if (cur >= i_size) {
>> -			struct btrfs_ordered_extent *ordered;
>> -
>> -			ordered = btrfs_lookup_first_ordered_range(inode, cur,
>> -								   folio_end - cur);
>> -			/*
>> -			 * We have just run delalloc before getting here, so
>> -			 * there must be an ordered extent.
>> -			 */
>> -			ASSERT(ordered != NULL);
>> -			spin_lock(&inode->ordered_tree_lock);
>> -			set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
>> -			ordered->truncated_len = min(ordered->truncated_len,
>> -						     cur - ordered->file_offset);
>> -			spin_unlock(&inode->ordered_tree_lock);
>> -			btrfs_put_ordered_extent(ordered);
>> -
>> -			btrfs_mark_ordered_io_finished(inode, folio, cur,
>> -						       end - cur, true);
>> -			/*
>> -			 * This range is beyond i_size, thus we don't need to
>> -			 * bother writing back.
>> -			 * But we still need to clear the dirty subpage bit, or
>> -			 * the next time the folio gets dirtied, we will try to
>> -			 * writeback the sectors with subpage dirty bits,
>> -			 * causing writeback without ordered extent.
>> -			 */
>> -			btrfs_folio_clear_dirty(fs_info, folio, cur, end - cur);
>> +			finish_io_beyond_eof(inode, folio, cur, start + len - cur, i_size);
> 
> AFAICT, start + len is still end here, why not do end - cur?
> 
>>   			break;
>>   		}
>>   		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
>> -- 
>> 2.52.0
>>
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 4/4] btrfs: reduce extent map lookup during writes
  2025-11-20  0:04 [PATCH 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo
                   ` (2 preceding siblings ...)
  2025-11-20  0:04 ` [PATCH 3/4] btrfs: extract the io finishing code into a helper Qu Wenruo
@ 2025-11-20  0:04 ` Qu Wenruo
  2025-11-20  1:16   ` Boris Burkov
  3 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2025-11-20  0:04 UTC (permalink / raw)
  To: linux-btrfs

With large data folios supports, even on x86_64 we can hit a folio that
contains several fs blocks.

In that case, we still need to call btrfs_get_extent() for each block,
as our submission path is still iterating each fs block and submit them
one by one. This reduces the benefit of large folios.

Change the behavior to submit the whole range when possible, this is
done by:

- Use for_each_set_bitrange() instead of for_each_set_bit()
  Now we can get a contiguous range to submit instead of a single fs
  block.

- Handle blocks beyond EOF in one go
  This is pretty much the same as the old behavior, but for a range
  crossing i_size, we finish the range beyond i_size first, then submit
  the remaining.

- Submit the contiguous range in one go
  Although we still need to consider the extent map boundary.

- Remove submit_one_sector()
  As it's no longer utilized.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 208 +++++++++++++++++++++++--------------------
 1 file changed, 112 insertions(+), 96 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cbee93a929f3..152eed265a3c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1596,94 +1596,6 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 	return 0;
 }
 
-/*
- * Return 0 if we have submitted or queued the sector for submission.
- * Return <0 for critical errors, and the involved sector will be cleaned up.
- *
- * Caller should make sure filepos < i_size and handle filepos >= i_size case.
- */
-static int submit_one_sector(struct btrfs_inode *inode,
-			     struct folio *folio,
-			     u64 filepos, struct btrfs_bio_ctrl *bio_ctrl,
-			     loff_t i_size)
-{
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	struct extent_map *em;
-	u64 block_start;
-	u64 disk_bytenr;
-	u64 extent_offset;
-	u64 em_end;
-	const u32 sectorsize = fs_info->sectorsize;
-
-	ASSERT(IS_ALIGNED(filepos, sectorsize));
-
-	/* @filepos >= i_size case should be handled by the caller. */
-	ASSERT(filepos < i_size);
-
-	em = btrfs_get_extent(inode, NULL, filepos, sectorsize);
-	if (IS_ERR(em)) {
-		/*
-		 * bio_ctrl may contain a bio crossing several folios.
-		 * Submit it immediately so that the bio has a chance
-		 * to finish normally, other than marked as error.
-		 */
-		submit_one_bio(bio_ctrl);
-
-		/*
-		 * When submission failed, we should still clear the folio dirty.
-		 * Or the folio will be written back again but without any
-		 * ordered extent.
-		 */
-		btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize);
-		btrfs_folio_set_writeback(fs_info, folio, filepos, sectorsize);
-		btrfs_folio_clear_writeback(fs_info, folio, filepos, sectorsize);
-
-		/*
-		 * Since there is no bio submitted to finish the ordered
-		 * extent, we have to manually finish this sector.
-		 */
-		btrfs_mark_ordered_io_finished(inode, folio, filepos,
-					       fs_info->sectorsize, false);
-		return PTR_ERR(em);
-	}
-
-	extent_offset = filepos - em->start;
-	em_end = btrfs_extent_map_end(em);
-	ASSERT(filepos <= em_end);
-	ASSERT(IS_ALIGNED(em->start, sectorsize));
-	ASSERT(IS_ALIGNED(em->len, sectorsize));
-
-	block_start = btrfs_extent_map_block_start(em);
-	disk_bytenr = btrfs_extent_map_block_start(em) + extent_offset;
-
-	ASSERT(!btrfs_extent_map_is_compressed(em));
-	ASSERT(block_start != EXTENT_MAP_HOLE);
-	ASSERT(block_start != EXTENT_MAP_INLINE);
-
-	btrfs_free_extent_map(em);
-	em = NULL;
-
-	/*
-	 * Although the PageDirty bit is cleared before entering this
-	 * function, subpage dirty bit is not cleared.
-	 * So clear subpage dirty bit here so next time we won't submit
-	 * a folio for a range already written to disk.
-	 */
-	btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize);
-	btrfs_folio_set_writeback(fs_info, folio, filepos, sectorsize);
-	/*
-	 * Above call should set the whole folio with writeback flag, even
-	 * just for a single subpage sector.
-	 * As long as the folio is properly locked and the range is correct,
-	 * we should always get the folio with writeback flag.
-	 */
-	ASSERT(folio_test_writeback(folio));
-
-	submit_extent_folio(bio_ctrl, disk_bytenr, folio,
-			    sectorsize, filepos - folio_pos(folio), 0);
-	return 0;
-}
-
 static void finish_io_beyond_eof(struct btrfs_inode *inode, struct folio *folio,
 				 u64 start, u32 len, loff_t i_size)
 {
@@ -1718,6 +1630,96 @@ static void finish_io_beyond_eof(struct btrfs_inode *inode, struct folio *folio,
 	btrfs_folio_clear_dirty(fs_info, folio, start, len);
 }
 
+/*
+ * Return 0 if we have submitted or queued the range for submission.
+ * Return <0 for critical errors, and the involved blocks will be cleaned up.
+ *
+ * Caller should make sure the range doesn't go beyond the last block of the inode.
+ */
+static int submit_range(struct btrfs_inode *inode, struct folio *folio,
+			u64 start, u32 len, struct btrfs_bio_ctrl *bio_ctrl)
+{
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
+	const u32 sectorsize = fs_info->sectorsize;
+	u64 cur = start;
+
+	ASSERT(IS_ALIGNED(start, sectorsize));
+	ASSERT(IS_ALIGNED(len, sectorsize));
+	ASSERT(start + len <= folio_end(folio));
+
+	while (cur < start + len) {
+		struct extent_map *em;
+		u64 block_start;
+		u64 disk_bytenr;
+		u64 extent_offset;
+		u64 em_end;
+		u32 cur_len = start + len - cur;
+
+		em = btrfs_get_extent(inode, NULL, cur, sectorsize);
+		if (IS_ERR(em)) {
+			/*
+			 * bio_ctrl may contain a bio crossing several folios.
+			 * Submit it immediately so that the bio has a chance
+			 * to finish normally, other than marked as error.
+			 */
+			submit_one_bio(bio_ctrl);
+
+			/*
+			 * When submission failed, we should still clear the folio dirty.
+			 * Or the folio will be written back again but without any
+			 * ordered extent.
+			 */
+			btrfs_folio_clear_dirty(fs_info, folio, cur, len);
+			btrfs_folio_set_writeback(fs_info, folio, cur, len);
+			btrfs_folio_clear_writeback(fs_info, folio, cur, len);
+
+			/*
+			 * Since there is no bio submitted to finish the ordered
+			 * extent, we have to manually finish this sector.
+			 */
+			btrfs_mark_ordered_io_finished(inode, folio, cur, len, false);
+			return PTR_ERR(em);
+		}
+		extent_offset = cur - em->start;
+		em_end = btrfs_extent_map_end(em);
+		ASSERT(cur <= em_end);
+		ASSERT(IS_ALIGNED(em->start, sectorsize));
+		ASSERT(IS_ALIGNED(em->len, sectorsize));
+
+		block_start = btrfs_extent_map_block_start(em);
+		disk_bytenr = btrfs_extent_map_block_start(em) + extent_offset;
+
+		ASSERT(!btrfs_extent_map_is_compressed(em));
+		ASSERT(block_start != EXTENT_MAP_HOLE);
+		ASSERT(block_start != EXTENT_MAP_INLINE);
+
+		cur_len = min(cur_len, em_end - cur);
+		btrfs_free_extent_map(em);
+		em = NULL;
+
+		/*
+		 * Although the PageDirty bit is cleared before entering this
+		 * function, subpage dirty bit is not cleared.
+		 * So clear subpage dirty bit here so next time we won't submit
+		 * a folio for a range already written to disk.
+		 */
+		btrfs_folio_clear_dirty(fs_info, folio, cur, cur_len);
+		btrfs_folio_set_writeback(fs_info, folio, cur, cur_len);
+		/*
+		 * Above call should set the whole folio with writeback flag, even
+		 * just for a single subpage block.
+		 * As long as the folio is properly locked and the range is correct,
+		 * we should always get the folio with writeback flag.
+		 */
+		ASSERT(folio_test_writeback(folio));
+
+		submit_extent_folio(bio_ctrl, disk_bytenr, folio,
+				    cur_len, cur - folio_pos(folio), 0);
+		cur += cur_len;
+	}
+	return 0;
+}
+
 /*
  * Helper for extent_writepage().  This calls the writepage start hooks,
  * and does the loop to map the page into extents and bios.
@@ -1740,8 +1742,9 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 	const u64 folio_start = folio_pos(folio);
 	const u64 folio_end = folio_start + folio_size(folio);
 	const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio);
-	u64 cur;
-	int bit;
+	unsigned int start_bit;
+	unsigned int end_bit;
+	const u64 rounded_isize = round_up(i_size, fs_info->sectorsize);
 	int ret = 0;
 
 	ASSERT(start >= folio_start, "start=%llu folio_start=%llu", start, folio_start);
@@ -1769,14 +1772,27 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
 
 	bio_ctrl->end_io_func = end_bbio_data_write;
 
-	for_each_set_bit(bit, &bio_ctrl->submit_bitmap, blocks_per_folio) {
-		cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits);
+	for_each_set_bitrange(start_bit, end_bit, &bio_ctrl->submit_bitmap, blocks_per_folio) {
+		const u64 cur_start = folio_pos(folio) + (start_bit << fs_info->sectorsize_bits);
+		u32 cur_len = (end_bit - start_bit) << fs_info->sectorsize_bits;
 
-		if (cur >= i_size) {
-			finish_io_beyond_eof(inode, folio, cur, start + len - cur, i_size);
-			break;
+		if (cur_start > rounded_isize) {
+			/*
+			 * The whole range is byoned EOF.
+			 *
+			 * Just finish the IO and skip to the next range.
+			 */
+			finish_io_beyond_eof(inode, folio, cur_start, cur_len, i_size);
+			continue;
 		}
-		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
+		if (cur_start + cur_len > rounded_isize) {
+			/* The tailing part of the range is beyond EOF. */
+			finish_io_beyond_eof(inode, folio, rounded_isize,
+					     cur_start + cur_len - rounded_isize, i_size);
+			/* Shrink the range inside the EOF. */
+			cur_len = rounded_isize - cur_start;
+		}
+		ret = submit_range(inode, folio, cur_start, cur_len, bio_ctrl);
 		if (unlikely(ret < 0)) {
 			if (!found_error)
 				found_error = ret;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] btrfs: reduce extent map lookup during writes
  2025-11-20  0:04 ` [PATCH 4/4] btrfs: reduce extent map lookup during writes Qu Wenruo
@ 2025-11-20  1:16   ` Boris Burkov
  2025-11-20  5:28     ` Qu Wenruo
  0 siblings, 1 reply; 10+ messages in thread
From: Boris Burkov @ 2025-11-20  1:16 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Nov 20, 2025 at 10:34:33AM +1030, Qu Wenruo wrote:
> With large data folios supports, even on x86_64 we can hit a folio that
> contains several fs blocks.
> 
> In that case, we still need to call btrfs_get_extent() for each block,
> as our submission path is still iterating each fs block and submit them
> one by one. This reduces the benefit of large folios.
> 
> Change the behavior to submit the whole range when possible, this is
> done by:
> 
> - Use for_each_set_bitrange() instead of for_each_set_bit()
>   Now we can get a contiguous range to submit instead of a single fs
>   block.
> 
> - Handle blocks beyond EOF in one go
>   This is pretty much the same as the old behavior, but for a range
>   crossing i_size, we finish the range beyond i_size first, then submit
>   the remaining.
> 
> - Submit the contiguous range in one go
>   Although we still need to consider the extent map boundary.
> 
> - Remove submit_one_sector()
>   As it's no longer utilized.
> 

This is very cool, quite excited about it. A couple questions inline but
the approach looks good to me overall.

Also, I was curious and looked at writepage_delalloc and that also uses
for_each_set_bit(...) though with simpler stuff in the loop
(btrfs_folio_set_lock) though I wonder if that can be simplified as well.

Thanks,
Boris

> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 208 +++++++++++++++++++++++--------------------
>  1 file changed, 112 insertions(+), 96 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index cbee93a929f3..152eed265a3c 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1596,94 +1596,6 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>  	return 0;
>  }
>  
> -/*
> - * Return 0 if we have submitted or queued the sector for submission.
> - * Return <0 for critical errors, and the involved sector will be cleaned up.
> - *
> - * Caller should make sure filepos < i_size and handle filepos >= i_size case.
> - */
> -static int submit_one_sector(struct btrfs_inode *inode,
> -			     struct folio *folio,
> -			     u64 filepos, struct btrfs_bio_ctrl *bio_ctrl,
> -			     loff_t i_size)
> -{
> -	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	struct extent_map *em;
> -	u64 block_start;
> -	u64 disk_bytenr;
> -	u64 extent_offset;
> -	u64 em_end;
> -	const u32 sectorsize = fs_info->sectorsize;
> -
> -	ASSERT(IS_ALIGNED(filepos, sectorsize));
> -
> -	/* @filepos >= i_size case should be handled by the caller. */
> -	ASSERT(filepos < i_size);
> -
> -	em = btrfs_get_extent(inode, NULL, filepos, sectorsize);
> -	if (IS_ERR(em)) {
> -		/*
> -		 * bio_ctrl may contain a bio crossing several folios.
> -		 * Submit it immediately so that the bio has a chance
> -		 * to finish normally, other than marked as error.
> -		 */
> -		submit_one_bio(bio_ctrl);
> -
> -		/*
> -		 * When submission failed, we should still clear the folio dirty.
> -		 * Or the folio will be written back again but without any
> -		 * ordered extent.
> -		 */
> -		btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize);
> -		btrfs_folio_set_writeback(fs_info, folio, filepos, sectorsize);
> -		btrfs_folio_clear_writeback(fs_info, folio, filepos, sectorsize);
> -
> -		/*
> -		 * Since there is no bio submitted to finish the ordered
> -		 * extent, we have to manually finish this sector.
> -		 */
> -		btrfs_mark_ordered_io_finished(inode, folio, filepos,
> -					       fs_info->sectorsize, false);
> -		return PTR_ERR(em);
> -	}
> -
> -	extent_offset = filepos - em->start;
> -	em_end = btrfs_extent_map_end(em);
> -	ASSERT(filepos <= em_end);
> -	ASSERT(IS_ALIGNED(em->start, sectorsize));
> -	ASSERT(IS_ALIGNED(em->len, sectorsize));
> -
> -	block_start = btrfs_extent_map_block_start(em);
> -	disk_bytenr = btrfs_extent_map_block_start(em) + extent_offset;
> -
> -	ASSERT(!btrfs_extent_map_is_compressed(em));
> -	ASSERT(block_start != EXTENT_MAP_HOLE);
> -	ASSERT(block_start != EXTENT_MAP_INLINE);
> -
> -	btrfs_free_extent_map(em);
> -	em = NULL;
> -
> -	/*
> -	 * Although the PageDirty bit is cleared before entering this
> -	 * function, subpage dirty bit is not cleared.
> -	 * So clear subpage dirty bit here so next time we won't submit
> -	 * a folio for a range already written to disk.
> -	 */
> -	btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize);
> -	btrfs_folio_set_writeback(fs_info, folio, filepos, sectorsize);
> -	/*
> -	 * Above call should set the whole folio with writeback flag, even
> -	 * just for a single subpage sector.
> -	 * As long as the folio is properly locked and the range is correct,
> -	 * we should always get the folio with writeback flag.
> -	 */
> -	ASSERT(folio_test_writeback(folio));
> -
> -	submit_extent_folio(bio_ctrl, disk_bytenr, folio,
> -			    sectorsize, filepos - folio_pos(folio), 0);
> -	return 0;
> -}
> -
>  static void finish_io_beyond_eof(struct btrfs_inode *inode, struct folio *folio,
>  				 u64 start, u32 len, loff_t i_size)
>  {
> @@ -1718,6 +1630,96 @@ static void finish_io_beyond_eof(struct btrfs_inode *inode, struct folio *folio,
>  	btrfs_folio_clear_dirty(fs_info, folio, start, len);
>  }
>  
> +/*
> + * Return 0 if we have submitted or queued the range for submission.
> + * Return <0 for critical errors, and the involved blocks will be cleaned up.
> + *
> + * Caller should make sure the range doesn't go beyond the last block of the inode.
> + */
> +static int submit_range(struct btrfs_inode *inode, struct folio *folio,
> +			u64 start, u32 len, struct btrfs_bio_ctrl *bio_ctrl)
> +{
> +	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> +	const u32 sectorsize = fs_info->sectorsize;
> +	u64 cur = start;
> +
> +	ASSERT(IS_ALIGNED(start, sectorsize));
> +	ASSERT(IS_ALIGNED(len, sectorsize));
> +	ASSERT(start + len <= folio_end(folio));
> +
> +	while (cur < start + len) {
> +		struct extent_map *em;
> +		u64 block_start;
> +		u64 disk_bytenr;
> +		u64 extent_offset;
> +		u64 em_end;
> +		u32 cur_len = start + len - cur;
> +
> +		em = btrfs_get_extent(inode, NULL, cur, sectorsize);

why is sectorsize still relevant? Just a small size to grab an
overlapping em or something more meaningful?

> +		if (IS_ERR(em)) {
> +			/*
> +			 * bio_ctrl may contain a bio crossing several folios.
> +			 * Submit it immediately so that the bio has a chance
> +			 * to finish normally, other than marked as error.
> +			 */
> +			submit_one_bio(bio_ctrl);
> +
> +			/*
> +			 * When submission failed, we should still clear the folio dirty.
> +			 * Or the folio will be written back again but without any
> +			 * ordered extent.
> +			 */
> +			btrfs_folio_clear_dirty(fs_info, folio, cur, len);
> +			btrfs_folio_set_writeback(fs_info, folio, cur, len);
> +			btrfs_folio_clear_writeback(fs_info, folio, cur, len);

cur, cur_len? or start, len? cur, len feels wrong.

> +
> +			/*
> +			 * Since there is no bio submitted to finish the ordered
> +			 * extent, we have to manually finish this sector.
> +			 */
> +			btrfs_mark_ordered_io_finished(inode, folio, cur, len, false);

Same question about cur, len; also I don't know if the comment referring
to "this sector" still makes sense. Manually finish this ordered_extent
or something?

> +			return PTR_ERR(em);
> +		}
> +		extent_offset = cur - em->start;
> +		em_end = btrfs_extent_map_end(em);
> +		ASSERT(cur <= em_end);
> +		ASSERT(IS_ALIGNED(em->start, sectorsize));
> +		ASSERT(IS_ALIGNED(em->len, sectorsize));
> +
> +		block_start = btrfs_extent_map_block_start(em);
> +		disk_bytenr = btrfs_extent_map_block_start(em) + extent_offset;
> +
> +		ASSERT(!btrfs_extent_map_is_compressed(em));
> +		ASSERT(block_start != EXTENT_MAP_HOLE);
> +		ASSERT(block_start != EXTENT_MAP_INLINE);
> +
> +		cur_len = min(cur_len, em_end - cur);
> +		btrfs_free_extent_map(em);
> +		em = NULL;
> +
> +		/*
> +		 * Although the PageDirty bit is cleared before entering this
> +		 * function, subpage dirty bit is not cleared.
> +		 * So clear subpage dirty bit here so next time we won't submit
> +		 * a folio for a range already written to disk.
> +		 */
> +		btrfs_folio_clear_dirty(fs_info, folio, cur, cur_len);
> +		btrfs_folio_set_writeback(fs_info, folio, cur, cur_len);
> +		/*
> +		 * Above call should set the whole folio with writeback flag, even
> +		 * just for a single subpage block.
> +		 * As long as the folio is properly locked and the range is correct,
> +		 * we should always get the folio with writeback flag.
> +		 */
> +		ASSERT(folio_test_writeback(folio));
> +
> +		submit_extent_folio(bio_ctrl, disk_bytenr, folio,
> +				    cur_len, cur - folio_pos(folio), 0);
> +		cur += cur_len;
> +	}
> +	return 0;
> +}
> +
>  /*
>   * Helper for extent_writepage().  This calls the writepage start hooks,
>   * and does the loop to map the page into extents and bios.
> @@ -1740,8 +1742,9 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>  	const u64 folio_start = folio_pos(folio);
>  	const u64 folio_end = folio_start + folio_size(folio);
>  	const unsigned int blocks_per_folio = btrfs_blocks_per_folio(fs_info, folio);
> -	u64 cur;
> -	int bit;
> +	unsigned int start_bit;
> +	unsigned int end_bit;
> +	const u64 rounded_isize = round_up(i_size, fs_info->sectorsize);
>  	int ret = 0;
>  
>  	ASSERT(start >= folio_start, "start=%llu folio_start=%llu", start, folio_start);
> @@ -1769,14 +1772,27 @@ static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
>  
>  	bio_ctrl->end_io_func = end_bbio_data_write;
>  
> -	for_each_set_bit(bit, &bio_ctrl->submit_bitmap, blocks_per_folio) {
> -		cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits);
> +	for_each_set_bitrange(start_bit, end_bit, &bio_ctrl->submit_bitmap, blocks_per_folio) {
> +		const u64 cur_start = folio_pos(folio) + (start_bit << fs_info->sectorsize_bits);
> +		u32 cur_len = (end_bit - start_bit) << fs_info->sectorsize_bits;
>  
> -		if (cur >= i_size) {
> -			finish_io_beyond_eof(inode, folio, cur, start + len - cur, i_size);
> -			break;
> +		if (cur_start > rounded_isize) {
> +			/*
> +			 * The whole range is byoned EOF.
sic: beyond
> +			 *
> +			 * Just finish the IO and skip to the next range.
> +			 */
> +			finish_io_beyond_eof(inode, folio, cur_start, cur_len, i_size);
> +			continue;
>  		}
> -		ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
> +		if (cur_start + cur_len > rounded_isize) {
> +			/* The tailing part of the range is beyond EOF. */
> +			finish_io_beyond_eof(inode, folio, rounded_isize,
> +					     cur_start + cur_len - rounded_isize, i_size);
> +			/* Shrink the range inside the EOF. */
> +			cur_len = rounded_isize - cur_start;
> +		}
> +		ret = submit_range(inode, folio, cur_start, cur_len, bio_ctrl);
>  		if (unlikely(ret < 0)) {
>  			if (!found_error)
>  				found_error = ret;
> -- 
> 2.52.0
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] btrfs: reduce extent map lookup during writes
  2025-11-20  1:16   ` Boris Burkov
@ 2025-11-20  5:28     ` Qu Wenruo
  2025-11-20  9:00       ` Qu Wenruo
  0 siblings, 1 reply; 10+ messages in thread
From: Qu Wenruo @ 2025-11-20  5:28 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs



在 2025/11/20 11:46, Boris Burkov 写道:
> On Thu, Nov 20, 2025 at 10:34:33AM +1030, Qu Wenruo wrote:
>> With large data folios supports, even on x86_64 we can hit a folio that
>> contains several fs blocks.
>>
>> In that case, we still need to call btrfs_get_extent() for each block,
>> as our submission path is still iterating each fs block and submit them
>> one by one. This reduces the benefit of large folios.
>>
>> Change the behavior to submit the whole range when possible, this is
>> done by:
>>
>> - Use for_each_set_bitrange() instead of for_each_set_bit()
>>    Now we can get a contiguous range to submit instead of a single fs
>>    block.
>>
>> - Handle blocks beyond EOF in one go
>>    This is pretty much the same as the old behavior, but for a range
>>    crossing i_size, we finish the range beyond i_size first, then submit
>>    the remaining.
>>
>> - Submit the contiguous range in one go
>>    Although we still need to consider the extent map boundary.
>>
>> - Remove submit_one_sector()
>>    As it's no longer utilized.
>>
> 
> This is very cool, quite excited about it. A couple questions inline but
> the approach looks good to me overall.
> 
> Also, I was curious and looked at writepage_delalloc and that also uses
> for_each_set_bit(...) though with simpler stuff in the loop
> (btrfs_folio_set_lock) though I wonder if that can be simplified as well.

Thanks for catching that one.

I'll check for any similar call sites and do the conversion in the next 
update.

> 
> Thanks,
> Boris
> 
[...]
>> +	while (cur < start + len) {
>> +		struct extent_map *em;
>> +		u64 block_start;
>> +		u64 disk_bytenr;
>> +		u64 extent_offset;
>> +		u64 em_end;
>> +		u32 cur_len = start + len - cur;
>> +
>> +		em = btrfs_get_extent(inode, NULL, cur, sectorsize);
> 
> why is sectorsize still relevant? Just a small size to grab an
> overlapping em or something more meaningful?

This in fact is a pretty important error handling for some corner cases.

Normally we should always have an extent map for the dirty ranges, but 
if we hit some situation like the following:

                          0          16K         32K
Dirty range:             |//////////////////////|
Em range:                |          |///////////|

If we call btrfs_get_extent() with cur == 0, len == 32K, we can get an 
em at 16K, and that can screw up our calculation.

So here we should keep the sectorsize as len to catch above situation.

I'll add an comment about the hidden error handling situation.

> 
>> +		if (IS_ERR(em)) {
>> +			/*
>> +			 * bio_ctrl may contain a bio crossing several folios.
>> +			 * Submit it immediately so that the bio has a chance
>> +			 * to finish normally, other than marked as error.
>> +			 */
>> +			submit_one_bio(bio_ctrl);
>> +
>> +			/*
>> +			 * When submission failed, we should still clear the folio dirty.
>> +			 * Or the folio will be written back again but without any
>> +			 * ordered extent.
>> +			 */
>> +			btrfs_folio_clear_dirty(fs_info, folio, cur, len);
>> +			btrfs_folio_set_writeback(fs_info, folio, cur, len);
>> +			btrfs_folio_clear_writeback(fs_info, folio, cur, len);
> 
> cur, cur_len? or start, len? cur, len feels wrong

Oh, thanks for catching this one, it should be @cur, @cur_len.

.
> 
>> +
>> +			/*
>> +			 * Since there is no bio submitted to finish the ordered
>> +			 * extent, we have to manually finish this sector.
>> +			 */
>> +			btrfs_mark_ordered_io_finished(inode, folio, cur, len, false);
> 
> Same question about cur, len; also I don't know if the comment referring
> to "this sector" still makes sense. Manually finish this ordered_extent
> or something?

I'll fix the the comment too.

Thanks,
Qu

>>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] btrfs: reduce extent map lookup during writes
  2025-11-20  5:28     ` Qu Wenruo
@ 2025-11-20  9:00       ` Qu Wenruo
  0 siblings, 0 replies; 10+ messages in thread
From: Qu Wenruo @ 2025-11-20  9:00 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs



在 2025/11/20 15:58, Qu Wenruo 写道:
> 
> 
> 在 2025/11/20 11:46, Boris Burkov 写道:
>> On Thu, Nov 20, 2025 at 10:34:33AM +1030, Qu Wenruo wrote:
>>> With large data folios supports, even on x86_64 we can hit a folio that
>>> contains several fs blocks.
>>>
>>> In that case, we still need to call btrfs_get_extent() for each block,
>>> as our submission path is still iterating each fs block and submit them
>>> one by one. This reduces the benefit of large folios.
>>>
>>> Change the behavior to submit the whole range when possible, this is
>>> done by:
>>>
>>> - Use for_each_set_bitrange() instead of for_each_set_bit()
>>>    Now we can get a contiguous range to submit instead of a single fs
>>>    block.
>>>
>>> - Handle blocks beyond EOF in one go
>>>    This is pretty much the same as the old behavior, but for a range
>>>    crossing i_size, we finish the range beyond i_size first, then submit
>>>    the remaining.
>>>
>>> - Submit the contiguous range in one go
>>>    Although we still need to consider the extent map boundary.
>>>
>>> - Remove submit_one_sector()
>>>    As it's no longer utilized.
>>>
>>
>> This is very cool, quite excited about it. A couple questions inline but
>> the approach looks good to me overall.
>>
>> Also, I was curious and looked at writepage_delalloc and that also uses
>> for_each_set_bit(...) though with simpler stuff in the loop
>> (btrfs_folio_set_lock) though I wonder if that can be simplified as well.
> 
> Thanks for catching that one.
> 
> I'll check for any similar call sites and do the conversion in the next 
> update.
> 
>>
>> Thanks,
>> Boris
>>
> [...]
>>> +    while (cur < start + len) {
>>> +        struct extent_map *em;
>>> +        u64 block_start;
>>> +        u64 disk_bytenr;
>>> +        u64 extent_offset;
>>> +        u64 em_end;
>>> +        u32 cur_len = start + len - cur;
>>> +
>>> +        em = btrfs_get_extent(inode, NULL, cur, sectorsize);
>>
>> why is sectorsize still relevant? Just a small size to grab an
>> overlapping em or something more meaningful?
> 
> This in fact is a pretty important error handling for some corner cases.
> 
> Normally we should always have an extent map for the dirty ranges, but 
> if we hit some situation like the following:
> 
>                           0          16K         32K
> Dirty range:             |//////////////////////|
> Em range:                |          |///////////|
> 
> If we call btrfs_get_extent() with cur == 0, len == 32K, we can get an 
> em at 16K, and that can screw up our calculation.
> 
> So here we should keep the sectorsize as len to catch above situation.
> 
> I'll add an comment about the hidden error handling situation.

I'm an idiot, the existing code can handle it by properly return a HOLE em.

To be honest, I think the length parameter is only making the function 
more confusing than needed.

I'll look into if we can remove the parameter completely.

Thanks,
Qu

> 
>>
>>> +        if (IS_ERR(em)) {
>>> +            /*
>>> +             * bio_ctrl may contain a bio crossing several folios.
>>> +             * Submit it immediately so that the bio has a chance
>>> +             * to finish normally, other than marked as error.
>>> +             */
>>> +            submit_one_bio(bio_ctrl);
>>> +
>>> +            /*
>>> +             * When submission failed, we should still clear the 
>>> folio dirty.
>>> +             * Or the folio will be written back again but without any
>>> +             * ordered extent.
>>> +             */
>>> +            btrfs_folio_clear_dirty(fs_info, folio, cur, len);
>>> +            btrfs_folio_set_writeback(fs_info, folio, cur, len);
>>> +            btrfs_folio_clear_writeback(fs_info, folio, cur, len);
>>
>> cur, cur_len? or start, len? cur, len feels wrong
> 
> Oh, thanks for catching this one, it should be @cur, @cur_len.
> 
> .
>>
>>> +
>>> +            /*
>>> +             * Since there is no bio submitted to finish the ordered
>>> +             * extent, we have to manually finish this sector.
>>> +             */
>>> +            btrfs_mark_ordered_io_finished(inode, folio, cur, len, 
>>> false);
>>
>> Same question about cur, len; also I don't know if the comment referring
>> to "this sector" still makes sense. Manually finish this ordered_extent
>> or something?
> 
> I'll fix the the comment too.
> 
> Thanks,
> Qu
> 
>>>
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-11-20  9:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-20  0:04 [PATCH 0/4] btrfs: reduce btrfs_get_extent() calls for buffered write path Qu Wenruo
2025-11-20  0:04 ` [PATCH 1/4] btrfs: integrate the error handling of submit_one_sector() Qu Wenruo
2025-11-20  0:04 ` [PATCH 2/4] btrfs: use bitmap_set() to replace set_bit() in a loop Qu Wenruo
2025-11-20  0:04 ` [PATCH 3/4] btrfs: extract the io finishing code into a helper Qu Wenruo
2025-11-20  0:59   ` Boris Burkov
2025-11-20  1:11     ` Qu Wenruo
2025-11-20  0:04 ` [PATCH 4/4] btrfs: reduce extent map lookup during writes Qu Wenruo
2025-11-20  1:16   ` Boris Burkov
2025-11-20  5:28     ` Qu Wenruo
2025-11-20  9:00       ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox