[PATCH 0/3] btrfs: improvements to the release

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] btrfs: improvements to the release_folio callback
@ 2025-03-27 16:13 fdmanana
  2025-03-27 16:13 ` [PATCH 1/3] btrfs: update comment for try_release_extent_state() fdmanana
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: fdmanana @ 2025-03-27 16:13 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

A couple improvements to the release_folio callback and an update to a
related function comment that's still referring to pages when the callback
is now folio aware. Details in the change logs.

Filipe Manana (3):
  btrfs: update comment for try_release_extent_state()
  btrfs: allow folios to be released while ordered extent is finishing
  btrfs: pass a pointer to get_range_bits() to cache first search result

 fs/btrfs/extent-io-tree.c | 34 ++++++++++++++++++++++
 fs/btrfs/extent-io-tree.h |  7 +++++
 fs/btrfs/extent_io.c      | 60 ++++++++++++++++++++++-----------------
 fs/btrfs/inode.c          |  6 ++--
 4 files changed, 79 insertions(+), 28 deletions(-)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/3] btrfs: update comment for try_release_extent_state()
  2025-03-27 16:13 [PATCH 0/3] btrfs: improvements to the release_folio callback fdmanana
@ 2025-03-27 16:13 ` fdmanana
  2025-03-27 16:13 ` [PATCH 2/3] btrfs: allow folios to be released while ordered extent is finishing fdmanana
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: fdmanana @ 2025-03-27 16:13 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Drop reference to pages from the comment since the function is fully folio
aware and works regardless of how many pages are in the folio. Also while
at it, capitalize the first word and make it more explicit that
release_folio is a callback from struct address_space_operations.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b168dc354f20..a11b22fcd154 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2618,9 +2618,9 @@ int extent_invalidate_folio(struct extent_io_tree *tree,
 }
 
 /*
- * a helper for release_folio, this tests for areas of the page that
- * are locked or under IO and drops the related state bits if it is safe
- * to drop the page.
+ * A helper for struct address_space_operations::release_folio, this tests for
+ * areas of the folio that are locked or under IO and drops the related state
+ * bits if it is safe to drop the folio.
  */
 static bool try_release_extent_state(struct extent_io_tree *tree,
 				     struct folio *folio)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/3] btrfs: allow folios to be released while ordered extent is finishing
  2025-03-27 16:13 [PATCH 0/3] btrfs: improvements to the release_folio callback fdmanana
  2025-03-27 16:13 ` [PATCH 1/3] btrfs: update comment for try_release_extent_state() fdmanana
@ 2025-03-27 16:13 ` fdmanana
  2025-03-31 22:39   ` David Sterba
  2025-03-27 16:13 ` [PATCH 3/3] btrfs: pass a pointer to get_range_bits() to cache first search result fdmanana
  2025-03-31 22:48 ` [PATCH 0/3] btrfs: improvements to the release_folio callback David Sterba
  3 siblings, 1 reply; 7+ messages in thread
From: fdmanana @ 2025-03-27 16:13 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When the release_folio callback (from struct address_space_operations) is
invoked we don't allow the folio to be released if its range is currently
locked in the inode's io_tree, as it may indicate the folio may be needed
by the task that locked the range.

However if the range is locked because an ordered extent is finishing,
then we can safely allow the folio to be released because ordered extent
completion doesn't need to use the folio at all.

When we are under memory pressure, the kernel starts writeback of dirty
pages (folios) with the goal of releasing the pages from the page cache
after writeback completes, however this often is not possible on btrfs
because:

  * Once the writeback completes we queue the ordered extent completion;

  * Once the ordered extent completion starts, we lock the range in the
    inode's io_tree (at btrfs_finish_one_ordered());

  * If the release_folio callback is called while the folio's range is
    locked in the inode's io_tree, we don't allow the folio to be
    released, so the kernel has to try to release memory elsewhere,
    which may result in triggering more writeback or releasing other
    pages from the page cache which may be more useful to have around
    for applications.

In contrast, when the release_folio callback is invoked after writeback
finishes and before ordered extent completion starts or locks the range,
we allow the folio to be released, as well as when the release_folio
callback is invoked after ordered extent completion unlocks the range.

Improve on this by detecting if the range is locked for ordered extent
completion and if it is, allow the folio to be released. This detection
is achieved by adding a new extent flag in the io_tree that is set when
the range is locked during ordered extent completion.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-io-tree.c | 22 +++++++++++++++++
 fs/btrfs/extent-io-tree.h |  6 +++++
 fs/btrfs/extent_io.c      | 52 +++++++++++++++++++++------------------
 fs/btrfs/inode.c          |  6 +++--
 4 files changed, 60 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 13de6af279e5..14510a71a8fd 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -1752,6 +1752,28 @@ bool test_range_bit_exists(struct extent_io_tree *tree, u64 start, u64 end, u32
 	return bitset;
 }
 
+void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits)
+{
+	struct extent_state *state;
+
+	*bits = 0;
+
+	spin_lock(&tree->lock);
+	state = tree_search(tree, start);
+	while (state) {
+		if (state->start > end)
+			break;
+
+		*bits |= state->state;
+
+		if (state->end >= end)
+			break;
+
+		state = next_state(state);
+	}
+	spin_unlock(&tree->lock);
+}
+
 /*
  * Check if the whole range [@start,@end) contains the single @bit set.
  */
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index 6ffef1cd37c1..e49f24151167 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -38,6 +38,11 @@ enum {
 	 * that is left for the ordered extent completion.
 	 */
 	ENUM_BIT(EXTENT_DELALLOC_NEW),
+	/*
+	 * Mark that a range is being locked for finishing an ordered extent.
+	 * Used together with EXTENT_LOCKED.
+	 */
+	ENUM_BIT(EXTENT_FINISHING_ORDERED),
 	/*
 	 * When an ordered extent successfully completes for a region marked as
 	 * a new delalloc range, use this flag when clearing a new delalloc
@@ -166,6 +171,7 @@ void free_extent_state(struct extent_state *state);
 bool test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bit,
 		    struct extent_state *cached_state);
 bool test_range_bit_exists(struct extent_io_tree *tree, u64 start, u64 end, u32 bit);
+void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits);
 int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 			     u32 bits, struct extent_changeset *changeset);
 int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a11b22fcd154..6b9a80f9e0f5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2627,33 +2627,37 @@ static bool try_release_extent_state(struct extent_io_tree *tree,
 {
 	u64 start = folio_pos(folio);
 	u64 end = start + folio_size(folio) - 1;
-	bool ret;
+	u32 range_bits;
+	u32 clear_bits;
+	int ret;
 
-	if (test_range_bit_exists(tree, start, end, EXTENT_LOCKED)) {
-		ret = false;
-	} else {
-		u32 clear_bits = ~(EXTENT_LOCKED | EXTENT_NODATASUM |
-				   EXTENT_DELALLOC_NEW | EXTENT_CTLBITS |
-				   EXTENT_QGROUP_RESERVED);
-		int ret2;
+	get_range_bits(tree, start, end, &range_bits);
 
-		/*
-		 * At this point we can safely clear everything except the
-		 * locked bit, the nodatasum bit and the delalloc new bit.
-		 * The delalloc new bit will be cleared by ordered extent
-		 * completion.
-		 */
-		ret2 = __clear_extent_bit(tree, start, end, clear_bits, NULL, NULL);
+	/*
+	 * We can release the folio if it's locked only for ordered extent
+	 * completion, since that doesn't require using the folio.
+	 */
+	if ((range_bits & EXTENT_LOCKED) &&
+	    !(range_bits & EXTENT_FINISHING_ORDERED))
+		return false;
 
-		/* if clear_extent_bit failed for enomem reasons,
-		 * we can't allow the release to continue.
-		 */
-		if (ret2 < 0)
-			ret = false;
-		else
-			ret = true;
-	}
-	return ret;
+	clear_bits = ~(EXTENT_LOCKED | EXTENT_NODATASUM | EXTENT_DELALLOC_NEW |
+		       EXTENT_CTLBITS | EXTENT_QGROUP_RESERVED |
+		       EXTENT_FINISHING_ORDERED);
+	/*
+	 * At this point we can safely clear everything except the locked,
+	 * nodatasum, delalloc new and finishing ordered bits. The delalloc new
+	 * bit will be cleared by ordered extent completion.
+	 */
+	ret = __clear_extent_bit(tree, start, end, clear_bits, NULL, NULL);
+	/*
+	 * If clear_extent_bit failed for enomem reasons, we can't allow the
+	 * release to continue.
+	 */
+	if (ret < 0)
+		return false;
+
+	return true;
 }
 
 /*
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e283627c087d..469b3fd64f17 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3132,8 +3132,10 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
 	 * depending on their current state).
 	 */
 	if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags)) {
-		clear_bits |= EXTENT_LOCKED;
-		lock_extent(io_tree, start, end, &cached_state);
+		clear_bits |= EXTENT_LOCKED | EXTENT_FINISHING_ORDERED;
+		__lock_extent(io_tree, start, end,
+			      EXTENT_LOCKED | EXTENT_FINISHING_ORDERED,
+			      &cached_state);
 	}
 
 	if (freespace_inode)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/3] btrfs: pass a pointer to get_range_bits() to cache first search result
  2025-03-27 16:13 [PATCH 0/3] btrfs: improvements to the release_folio callback fdmanana
  2025-03-27 16:13 ` [PATCH 1/3] btrfs: update comment for try_release_extent_state() fdmanana
  2025-03-27 16:13 ` [PATCH 2/3] btrfs: allow folios to be released while ordered extent is finishing fdmanana
@ 2025-03-27 16:13 ` fdmanana
  2025-03-31 22:48 ` [PATCH 0/3] btrfs: improvements to the release_folio callback David Sterba
  3 siblings, 0 replies; 7+ messages in thread
From: fdmanana @ 2025-03-27 16:13 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Allow get_range_bits() to take an extent state pointer to pointer argument
so that we can cache the first extent state record in the target range, so
that a caller can use it for subsequent operations without doing a full
tree search. Currently the only user is try_release_extent_state(), which
then does a call to __clear_extent_bit() which can use such a cached state
record.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-io-tree.c | 14 +++++++++++++-
 fs/btrfs/extent-io-tree.h |  3 ++-
 fs/btrfs/extent_io.c      | 18 +++++++++++-------
 3 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 14510a71a8fd..7ae24a533404 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -1752,14 +1752,26 @@ bool test_range_bit_exists(struct extent_io_tree *tree, u64 start, u64 end, u32
 	return bitset;
 }
 
-void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits)
+void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits,
+		    struct extent_state **cached_state)
 {
 	struct extent_state *state;
 
+	/*
+	 * The cached state is currently mandatory and not used to start the
+	 * search, only to cache the first state record found in the range.
+	 */
+	ASSERT(cached_state != NULL);
+	ASSERT(*cached_state == NULL);
+
 	*bits = 0;
 
 	spin_lock(&tree->lock);
 	state = tree_search(tree, start);
+	if (state && state->start < end) {
+		*cached_state = state;
+		refcount_inc(&state->refs);
+	}
 	while (state) {
 		if (state->start > end)
 			break;
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index e49f24151167..cf83e094b00e 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -171,7 +171,8 @@ void free_extent_state(struct extent_state *state);
 bool test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bit,
 		    struct extent_state *cached_state);
 bool test_range_bit_exists(struct extent_io_tree *tree, u64 start, u64 end, u32 bit);
-void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits);
+void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits,
+		    struct extent_state **cached_state);
 int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 			     u32 bits, struct extent_changeset *changeset);
 int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 6b9a80f9e0f5..6a6f9ded00e3 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2625,13 +2625,15 @@ int extent_invalidate_folio(struct extent_io_tree *tree,
 static bool try_release_extent_state(struct extent_io_tree *tree,
 				     struct folio *folio)
 {
+	struct extent_state *cached_state = NULL;
 	u64 start = folio_pos(folio);
 	u64 end = start + folio_size(folio) - 1;
 	u32 range_bits;
 	u32 clear_bits;
-	int ret;
+	bool ret = false;
+	int ret2;
 
-	get_range_bits(tree, start, end, &range_bits);
+	get_range_bits(tree, start, end, &range_bits, &cached_state);
 
 	/*
 	 * We can release the folio if it's locked only for ordered extent
@@ -2639,7 +2641,7 @@ static bool try_release_extent_state(struct extent_io_tree *tree,
 	 */
 	if ((range_bits & EXTENT_LOCKED) &&
 	    !(range_bits & EXTENT_FINISHING_ORDERED))
-		return false;
+		goto out;
 
 	clear_bits = ~(EXTENT_LOCKED | EXTENT_NODATASUM | EXTENT_DELALLOC_NEW |
 		       EXTENT_CTLBITS | EXTENT_QGROUP_RESERVED |
@@ -2649,15 +2651,17 @@ static bool try_release_extent_state(struct extent_io_tree *tree,
 	 * nodatasum, delalloc new and finishing ordered bits. The delalloc new
 	 * bit will be cleared by ordered extent completion.
 	 */
-	ret = __clear_extent_bit(tree, start, end, clear_bits, NULL, NULL);
+	ret2 = __clear_extent_bit(tree, start, end, clear_bits, &cached_state, NULL);
 	/*
 	 * If clear_extent_bit failed for enomem reasons, we can't allow the
 	 * release to continue.
 	 */
-	if (ret < 0)
-		return false;
+	if (ret2 == 0)
+		ret = true;
+out:
+	free_extent_state(cached_state);
 
-	return true;
+	return ret;
 }
 
 /*
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/3] btrfs: allow folios to be released while ordered extent is finishing
  2025-03-27 16:13 ` [PATCH 2/3] btrfs: allow folios to be released while ordered extent is finishing fdmanana
@ 2025-03-31 22:39   ` David Sterba
  2025-04-01 10:13     ` Filipe Manana
  0 siblings, 1 reply; 7+ messages in thread
From: David Sterba @ 2025-03-31 22:39 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Mar 27, 2025 at 04:13:51PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When the release_folio callback (from struct address_space_operations) is
> invoked we don't allow the folio to be released if its range is currently
> locked in the inode's io_tree, as it may indicate the folio may be needed
> by the task that locked the range.
> 
> However if the range is locked because an ordered extent is finishing,
> then we can safely allow the folio to be released because ordered extent
> completion doesn't need to use the folio at all.
> 
> When we are under memory pressure, the kernel starts writeback of dirty
> pages (folios) with the goal of releasing the pages from the page cache
> after writeback completes, however this often is not possible on btrfs
> because:
> 
>   * Once the writeback completes we queue the ordered extent completion;
> 
>   * Once the ordered extent completion starts, we lock the range in the
>     inode's io_tree (at btrfs_finish_one_ordered());
> 
>   * If the release_folio callback is called while the folio's range is
>     locked in the inode's io_tree, we don't allow the folio to be
>     released, so the kernel has to try to release memory elsewhere,
>     which may result in triggering more writeback or releasing other
>     pages from the page cache which may be more useful to have around
>     for applications.
> 
> In contrast, when the release_folio callback is invoked after writeback
> finishes and before ordered extent completion starts or locks the range,
> we allow the folio to be released, as well as when the release_folio
> callback is invoked after ordered extent completion unlocks the range.
> 
> Improve on this by detecting if the range is locked for ordered extent
> completion and if it is, allow the folio to be released. This detection
> is achieved by adding a new extent flag in the io_tree that is set when
> the range is locked during ordered extent completion.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/extent-io-tree.c | 22 +++++++++++++++++
>  fs/btrfs/extent-io-tree.h |  6 +++++
>  fs/btrfs/extent_io.c      | 52 +++++++++++++++++++++------------------
>  fs/btrfs/inode.c          |  6 +++--
>  4 files changed, 60 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
> index 13de6af279e5..14510a71a8fd 100644
> --- a/fs/btrfs/extent-io-tree.c
> +++ b/fs/btrfs/extent-io-tree.c
> @@ -1752,6 +1752,28 @@ bool test_range_bit_exists(struct extent_io_tree *tree, u64 start, u64 end, u32
>  	return bitset;
>  }
>  
> +void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits)
> +{
> +	struct extent_state *state;
> +
> +	*bits = 0;
> +
> +	spin_lock(&tree->lock);
> +	state = tree_search(tree, start);
> +	while (state) {
> +		if (state->start > end)
> +			break;
> +
> +		*bits |= state->state;
> +
> +		if (state->end >= end)
> +			break;
> +
> +		state = next_state(state);
> +	}
> +	spin_unlock(&tree->lock);
> +}
> +
>  /*
>   * Check if the whole range [@start,@end) contains the single @bit set.
>   */
> diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
> index 6ffef1cd37c1..e49f24151167 100644
> --- a/fs/btrfs/extent-io-tree.h
> +++ b/fs/btrfs/extent-io-tree.h
> @@ -38,6 +38,11 @@ enum {
>  	 * that is left for the ordered extent completion.
>  	 */
>  	ENUM_BIT(EXTENT_DELALLOC_NEW),
> +	/*
> +	 * Mark that a range is being locked for finishing an ordered extent.
> +	 * Used together with EXTENT_LOCKED.
> +	 */
> +	ENUM_BIT(EXTENT_FINISHING_ORDERED),
>  	/*
>  	 * When an ordered extent successfully completes for a region marked as
>  	 * a new delalloc range, use this flag when clearing a new delalloc
> @@ -166,6 +171,7 @@ void free_extent_state(struct extent_state *state);
>  bool test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bit,
>  		    struct extent_state *cached_state);
>  bool test_range_bit_exists(struct extent_io_tree *tree, u64 start, u64 end, u32 bit);
> +void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits);
>  int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
>  			     u32 bits, struct extent_changeset *changeset);
>  int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index a11b22fcd154..6b9a80f9e0f5 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2627,33 +2627,37 @@ static bool try_release_extent_state(struct extent_io_tree *tree,
>  {
>  	u64 start = folio_pos(folio);
>  	u64 end = start + folio_size(folio) - 1;
> -	bool ret;
> +	u32 range_bits;
> +	u32 clear_bits;
> +	int ret;
>  
> -	if (test_range_bit_exists(tree, start, end, EXTENT_LOCKED)) {
> -		ret = false;
> -	} else {
> -		u32 clear_bits = ~(EXTENT_LOCKED | EXTENT_NODATASUM |
> -				   EXTENT_DELALLOC_NEW | EXTENT_CTLBITS |
> -				   EXTENT_QGROUP_RESERVED);
> -		int ret2;
> +	get_range_bits(tree, start, end, &range_bits);

There's a difference how much of the tree is traversed,
test_range_bit_exists() stops on first occurence of EXTENT_LOCKED (a
single bit), get_range_bits() unconditionally explores the whole tree.

>  
> -		/*
> -		 * At this point we can safely clear everything except the
> -		 * locked bit, the nodatasum bit and the delalloc new bit.
> -		 * The delalloc new bit will be cleared by ordered extent
> -		 * completion.
> -		 */
> -		ret2 = __clear_extent_bit(tree, start, end, clear_bits, NULL, NULL);
> +	/*
> +	 * We can release the folio if it's locked only for ordered extent
> +	 * completion, since that doesn't require using the folio.
> +	 */
> +	if ((range_bits & EXTENT_LOCKED) &&
> +	    !(range_bits & EXTENT_FINISHING_ORDERED))

Here we need to know that LOCKED exists and FINISHING_ORDERED does not
exist in the range. This can be proven when the whole tree is traversed,
but could be in some cases be reduced to

	if (test_range_bit_exists(..., LOCKED) &&
	    !test_range_bit_exists(, FINISHING_ORDERED))

where in some percent of cases the whole tree won't be traversed (and
the lock held for a shorter time). This depends on the runtime what
combinations of the locks exist, it's possible than in the average case
the whole tree would be traversed anyway, and get_range_bits() is OK.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/3] btrfs: improvements to the release_folio callback
  2025-03-27 16:13 [PATCH 0/3] btrfs: improvements to the release_folio callback fdmanana
                   ` (2 preceding siblings ...)
  2025-03-27 16:13 ` [PATCH 3/3] btrfs: pass a pointer to get_range_bits() to cache first search result fdmanana
@ 2025-03-31 22:48 ` David Sterba
  3 siblings, 0 replies; 7+ messages in thread
From: David Sterba @ 2025-03-31 22:48 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Thu, Mar 27, 2025 at 04:13:49PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> A couple improvements to the release_folio callback and an update to a
> related function comment that's still referring to pages when the callback
> is now folio aware. Details in the change logs.
> 
> Filipe Manana (3):
>   btrfs: update comment for try_release_extent_state()
>   btrfs: allow folios to be released while ordered extent is finishing
>   btrfs: pass a pointer to get_range_bits() to cache first search result

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/3] btrfs: allow folios to be released while ordered extent is finishing
  2025-03-31 22:39   ` David Sterba
@ 2025-04-01 10:13     ` Filipe Manana
  0 siblings, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2025-04-01 10:13 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs

On Mon, Mar 31, 2025 at 11:39 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Thu, Mar 27, 2025 at 04:13:51PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When the release_folio callback (from struct address_space_operations) is
> > invoked we don't allow the folio to be released if its range is currently
> > locked in the inode's io_tree, as it may indicate the folio may be needed
> > by the task that locked the range.
> >
> > However if the range is locked because an ordered extent is finishing,
> > then we can safely allow the folio to be released because ordered extent
> > completion doesn't need to use the folio at all.
> >
> > When we are under memory pressure, the kernel starts writeback of dirty
> > pages (folios) with the goal of releasing the pages from the page cache
> > after writeback completes, however this often is not possible on btrfs
> > because:
> >
> >   * Once the writeback completes we queue the ordered extent completion;
> >
> >   * Once the ordered extent completion starts, we lock the range in the
> >     inode's io_tree (at btrfs_finish_one_ordered());
> >
> >   * If the release_folio callback is called while the folio's range is
> >     locked in the inode's io_tree, we don't allow the folio to be
> >     released, so the kernel has to try to release memory elsewhere,
> >     which may result in triggering more writeback or releasing other
> >     pages from the page cache which may be more useful to have around
> >     for applications.
> >
> > In contrast, when the release_folio callback is invoked after writeback
> > finishes and before ordered extent completion starts or locks the range,
> > we allow the folio to be released, as well as when the release_folio
> > callback is invoked after ordered extent completion unlocks the range.
> >
> > Improve on this by detecting if the range is locked for ordered extent
> > completion and if it is, allow the folio to be released. This detection
> > is achieved by adding a new extent flag in the io_tree that is set when
> > the range is locked during ordered extent completion.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/extent-io-tree.c | 22 +++++++++++++++++
> >  fs/btrfs/extent-io-tree.h |  6 +++++
> >  fs/btrfs/extent_io.c      | 52 +++++++++++++++++++++------------------
> >  fs/btrfs/inode.c          |  6 +++--
> >  4 files changed, 60 insertions(+), 26 deletions(-)
> >
> > diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
> > index 13de6af279e5..14510a71a8fd 100644
> > --- a/fs/btrfs/extent-io-tree.c
> > +++ b/fs/btrfs/extent-io-tree.c
> > @@ -1752,6 +1752,28 @@ bool test_range_bit_exists(struct extent_io_tree *tree, u64 start, u64 end, u32
> >       return bitset;
> >  }
> >
> > +void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits)
> > +{
> > +     struct extent_state *state;
> > +
> > +     *bits = 0;
> > +
> > +     spin_lock(&tree->lock);
> > +     state = tree_search(tree, start);
> > +     while (state) {
> > +             if (state->start > end)
> > +                     break;
> > +
> > +             *bits |= state->state;
> > +
> > +             if (state->end >= end)
> > +                     break;
> > +
> > +             state = next_state(state);
> > +     }
> > +     spin_unlock(&tree->lock);
> > +}
> > +
> >  /*
> >   * Check if the whole range [@start,@end) contains the single @bit set.
> >   */
> > diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
> > index 6ffef1cd37c1..e49f24151167 100644
> > --- a/fs/btrfs/extent-io-tree.h
> > +++ b/fs/btrfs/extent-io-tree.h
> > @@ -38,6 +38,11 @@ enum {
> >        * that is left for the ordered extent completion.
> >        */
> >       ENUM_BIT(EXTENT_DELALLOC_NEW),
> > +     /*
> > +      * Mark that a range is being locked for finishing an ordered extent.
> > +      * Used together with EXTENT_LOCKED.
> > +      */
> > +     ENUM_BIT(EXTENT_FINISHING_ORDERED),
> >       /*
> >        * When an ordered extent successfully completes for a region marked as
> >        * a new delalloc range, use this flag when clearing a new delalloc
> > @@ -166,6 +171,7 @@ void free_extent_state(struct extent_state *state);
> >  bool test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bit,
> >                   struct extent_state *cached_state);
> >  bool test_range_bit_exists(struct extent_io_tree *tree, u64 start, u64 end, u32 bit);
> > +void get_range_bits(struct extent_io_tree *tree, u64 start, u64 end, u32 *bits);
> >  int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
> >                            u32 bits, struct extent_changeset *changeset);
> >  int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index a11b22fcd154..6b9a80f9e0f5 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2627,33 +2627,37 @@ static bool try_release_extent_state(struct extent_io_tree *tree,
> >  {
> >       u64 start = folio_pos(folio);
> >       u64 end = start + folio_size(folio) - 1;
> > -     bool ret;
> > +     u32 range_bits;
> > +     u32 clear_bits;
> > +     int ret;
> >
> > -     if (test_range_bit_exists(tree, start, end, EXTENT_LOCKED)) {
> > -             ret = false;
> > -     } else {
> > -             u32 clear_bits = ~(EXTENT_LOCKED | EXTENT_NODATASUM |
> > -                                EXTENT_DELALLOC_NEW | EXTENT_CTLBITS |
> > -                                EXTENT_QGROUP_RESERVED);
> > -             int ret2;
> > +     get_range_bits(tree, start, end, &range_bits);
>
> There's a difference how much of the tree is traversed,
> test_range_bit_exists() stops on first occurence of EXTENT_LOCKED (a
> single bit), get_range_bits() unconditionally explores the whole tree.

The whole tree?
The folio's range, unless there are no records in the tree outside the
folio's range, in which case it is a very small tree.

But it's not a problem for several reasons:

1) When a range is not under IO or locked, there's typically no state
record in the tree;

2) If a range is under delalloc or writeback, we don't even get to
this function - we exit early in btrfs_release_folio() if the folio is
dirty or under writeback;

3) The IO is folio size based (whether single page or multi page in
the near future), so it's unlikely to find more than one state record
for the range - correct me if you know of cases where we'll get more
than one.

4) For bit ranges other than EXTENT_LOCKED, extent state records are
merged, so even for very uncommon scenarios it's unlikely to find more
than 1 state record for the folio's range.

So traversing the whole range is far from being a bottleneck or
causing more lock contention, even if there are multiple state records
- we already had to traverse the whole range with
test_range_bit_exists() when the folio's range is not locked.

>
> >
> > -             /*
> > -              * At this point we can safely clear everything except the
> > -              * locked bit, the nodatasum bit and the delalloc new bit.
> > -              * The delalloc new bit will be cleared by ordered extent
> > -              * completion.
> > -              */
> > -             ret2 = __clear_extent_bit(tree, start, end, clear_bits, NULL, NULL);
> > +     /*
> > +      * We can release the folio if it's locked only for ordered extent
> > +      * completion, since that doesn't require using the folio.
> > +      */
> > +     if ((range_bits & EXTENT_LOCKED) &&
> > +         !(range_bits & EXTENT_FINISHING_ORDERED))
>
> Here we need to know that LOCKED exists and FINISHING_ORDERED does not
> exist in the range. This can be proven when the whole tree is traversed,

I still don't get why you say the whole tree.

> but could be in some cases be reduced to
>
>         if (test_range_bit_exists(..., LOCKED) &&
>             !test_range_bit_exists(, FINISHING_ORDERED))
>
> where in some percent of cases the whole tree won't be traversed (and
> the lock held for a shorter time). This depends on the runtime what
> combinations of the locks exist, it's possible than in the average case
> the whole tree would be traversed anyway, and get_range_bits() is OK.

I don't see what specific cases you are talking about, correct me if
any of the things I'd said above is wrong and we can have a lot of
extent state records in the folio's range.
But even if we do, it's the same behaviour as for
test_range_bit_exists() when the range is not locked.

Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-04-01 10:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-27 16:13 [PATCH 0/3] btrfs: improvements to the release_folio callback fdmanana
2025-03-27 16:13 ` [PATCH 1/3] btrfs: update comment for try_release_extent_state() fdmanana
2025-03-27 16:13 ` [PATCH 2/3] btrfs: allow folios to be released while ordered extent is finishing fdmanana
2025-03-31 22:39   ` David Sterba
2025-04-01 10:13     ` Filipe Manana
2025-03-27 16:13 ` [PATCH 3/3] btrfs: pass a pointer to get_range_bits() to cache first search result fdmanana
2025-03-31 22:48 ` [PATCH 0/3] btrfs: improvements to the release_folio callback David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox