[PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation
@ 2025-05-07 17:23 fdmanana
  2025-05-07 17:23 ` [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent fdmanana
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: fdmanana @ 2025-05-07 17:23 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Some fixes and cleanups around ordered extent allocation. Details in the
change logs.

Filipe Manana (5):
  btrfs: fix qgroup reservation leak on failure to allocate ordered extent
  btrfs: check we grabbed inode reference when allocating an ordered extent
  btrfs: fold error checks when allocating ordered extent and update comments
  btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes()
  btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent()

 fs/btrfs/block-group.c  | 10 ++++----
 fs/btrfs/block-group.h  |  2 +-
 fs/btrfs/direct-io.c    |  3 +--
 fs/btrfs/extent-tree.c  |  8 +++----
 fs/btrfs/extent-tree.h  |  2 +-
 fs/btrfs/inode.c        | 10 ++++----
 fs/btrfs/ordered-data.c | 51 ++++++++++++++++++++++++++---------------
 7 files changed, 50 insertions(+), 36 deletions(-)

-- 
2.47.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent
  2025-05-07 17:23 [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation fdmanana
@ 2025-05-07 17:23 ` fdmanana
  2025-05-07 22:33   ` Boris Burkov
  2025-05-07 17:23 ` [PATCH 2/5] btrfs: check we grabbed inode reference when allocating an " fdmanana
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: fdmanana @ 2025-05-07 17:23 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

If we fail to allocate an ordered extent for a COW write we end up leaking
a qgroup data reservation since we called btrfs_qgroup_release_data() but
we didn't call btrfs_qgroup_free_refroot() (which would happen when
running the respective data delayed ref created by ordered extent
completion or when finishing the ordered extent in case an error happened).

So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
ordered extent for a COW write.

Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ordered-data.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index ae49f87b27e8..e44d3dd17caf 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -153,9 +153,10 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
 	struct btrfs_ordered_extent *entry;
 	int ret;
 	u64 qgroup_rsv = 0;
+	const bool is_nocow = (flags &
+	       ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC)));
 
-	if (flags &
-	    ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC))) {
+	if (is_nocow) {
 		/* For nocow write, we can release the qgroup rsv right now */
 		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
 		if (ret < 0)
@@ -170,8 +171,13 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
 			return ERR_PTR(ret);
 	}
 	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
-	if (!entry)
+	if (!entry) {
+		if (!is_nocow)
+			btrfs_qgroup_free_refroot(inode->root->fs_info,
+						  btrfs_root_id(inode->root),
+						  qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
 		return ERR_PTR(-ENOMEM);
+	}
 
 	entry->file_offset = file_offset;
 	entry->num_bytes = num_bytes;
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/5] btrfs: check we grabbed inode reference when allocating an ordered extent
  2025-05-07 17:23 [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation fdmanana
  2025-05-07 17:23 ` [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent fdmanana
@ 2025-05-07 17:23 ` fdmanana
  2025-05-07 17:23 ` [PATCH 3/5] btrfs: fold error checks when allocating ordered extent and update comments fdmanana
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2025-05-07 17:23 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When allocating an ordered extent we call igrab() to get a reference on
the inode and attach it to the ordered extent. For an ordered extent we
always must have an inode reference since we during its life cycle we
need to access the inode for several things like for example:

* Inserting the ordered extent right after allocating it, when calling
  insert_ordered_extent() - we need to lock the inode's ordered_tree_lock;

* In the bio submission path we need to add checksums to the ordered
  extent and we end up at btrfs_add_ordered_sum(), where again we need
  to grab the inode from the ordered extent to lock the inode's
  ordered_tree_lock;

* When finishing an ordered extent, at btrfs_finish_ordered_extent(), we
  need again to access its inode in order to lock the inode's
  ordered_tree_lock;

* Etc etc etc.

Every where we deal with an ordered extent we always expect its inode to
be not NULL, the only exception being btrfs_put_ordered_extent() where
we check if it's NULL before calling btrfs_add_delayed_iput(), even though
we have already assumed it's not NULL when calling the tracepoint
trace_btrfs_ordered_extent_put() since the tracepoint dereferences the
inode to extract its number and root without ever checking it's NULL.

The igrab() call can return NULL if the inode is about to be freed or is
being freed (its state has I_FREEING or I_WILL_FREE set), and that's why
there's such check at btrfs_put_ordered_extent(). The igrab() and NULL
check were introduced in commit 5fd02043553b ("Btrfs: finish ordered
extents in their own thread") but even back then we always needed and
assumed igrab() returned a non-NULL pointer, since for example when
removing an ordered extent, at btrfs_remove_ordered_extent(), we assumed
the inode pointer was not NULL in order to access the inode's ordered
extent tree.

In fact whenever we allocate an ordered extent we are holding an inode
reference and the inode is not being freed or going to be freed (which
happens in the final iput), and since we depend on the inode for the
life cycle of the ordered extent, just make ordered extent allocation
to fail in case igrab() returns NULL and trigger a warning, to make it
clear it's not expected. This allows to remove the confusing NULL inode
check at btrfs_put_ordered_extent().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ordered-data.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index e44d3dd17caf..a6c5fc78bc4f 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -172,11 +172,8 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
 	}
 	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
 	if (!entry) {
-		if (!is_nocow)
-			btrfs_qgroup_free_refroot(inode->root->fs_info,
-						  btrfs_root_id(inode->root),
-						  qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
-		return ERR_PTR(-ENOMEM);
+		entry = ERR_PTR(-ENOMEM);
+		goto out;
 	}

 	entry->file_offset = file_offset;
@@ -186,7 +183,12 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
 	entry->disk_num_bytes = disk_num_bytes;
 	entry->offset = offset;
 	entry->bytes_left = num_bytes;
-	entry->inode = BTRFS_I(igrab(&inode->vfs_inode));
+	if (WARN_ON_ONCE(!igrab(&inode->vfs_inode))) {
+		kmem_cache_free(btrfs_ordered_extent_cache, entry);
+		entry = ERR_PTR(-ESTALE);
+		goto out;
+	}
+	entry->inode = inode;
 	entry->compress_type = compress_type;
 	entry->truncated_len = (u64)-1;
 	entry->qgroup_rsv = qgroup_rsv;
@@ -209,6 +211,12 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
 	btrfs_mod_outstanding_extents(inode, 1);
 	spin_unlock(&inode->lock);

+out:
+	if (IS_ERR(entry) && !is_nocow)
+		btrfs_qgroup_free_refroot(inode->root->fs_info,
+					  btrfs_root_id(inode->root),
+					  qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
+
 	return entry;
 }

@@ -622,8 +630,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry)
 		ASSERT(list_empty(&entry->root_extent_list));
 		ASSERT(list_empty(&entry->log_list));
 		ASSERT(RB_EMPTY_NODE(&entry->rb_node));
-		if (entry->inode)
-			btrfs_add_delayed_iput(entry->inode);
+		btrfs_add_delayed_iput(entry->inode);
 		list_for_each_entry_safe(sum, tmp, &entry->list, list)
 			kvfree(sum);
 		kmem_cache_free(btrfs_ordered_extent_cache, entry);
-- 
2.47.2

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/5] btrfs: fold error checks when allocating ordered extent and update comments
  2025-05-07 17:23 [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation fdmanana
  2025-05-07 17:23 ` [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent fdmanana
  2025-05-07 17:23 ` [PATCH 2/5] btrfs: check we grabbed inode reference when allocating an " fdmanana
@ 2025-05-07 17:23 ` fdmanana
  2025-05-07 17:23 ` [PATCH 4/5] btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes() fdmanana
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2025-05-07 17:23 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Instead of having an error check and return on each branch of the if
statement, move the error check to happen after that if branch, reducing
source code and object code sizes.

Before this change:

   $ size fs/btrfs/btrfs.ko
      text	   data	    bss	    dec	    hex	filename
   1840174	 163742	  16136	2020052	 1ed2d4	fs/btrfs/btrfs.ko

After this change:

   $ size fs/btrfs/btrfs.ko
      text	   data	    bss	    dec	    hex	filename
   1840138	 163742	  16136	2020016	 1ed2b0	fs/btrfs/btrfs.ko

While at it and moving the comments, update the comments to be more clear
about how qgroup reserved space is released and the intricacies of how
it's managed for COW writes.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ordered-data.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index a6c5fc78bc4f..477366eac164 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -156,20 +156,22 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
 	const bool is_nocow = (flags &
 	       ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC)));
 
-	if (is_nocow) {
-		/* For nocow write, we can release the qgroup rsv right now */
+	/*
+	 * For a nocow write we can free the qgroup reserve right now. For a cow
+	 * one we transfer the reserved space from the inode's iotree into the
+	 * ordered extent by calling btrfs_qgroup_release_data() and tracking
+	 * the qgroup reserved amount in the ordered extent, so that later after
+	 * completing the ordered extent, when running the data delayed ref it
+	 * creates, we free the reserved data with btrfs_qgroup_free_refroot().
+	 */
+	if (is_nocow)
 		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
-		if (ret < 0)
-			return ERR_PTR(ret);
-	} else {
-		/*
-		 * The ordered extent has reserved qgroup space, release now
-		 * and pass the reserved number for qgroup_record to free.
-		 */
+	else
 		ret = btrfs_qgroup_release_data(inode, file_offset, num_bytes, &qgroup_rsv);
-		if (ret < 0)
-			return ERR_PTR(ret);
-	}
+
+	if (ret < 0)
+		return ERR_PTR(ret);
+
 	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
 	if (!entry) {
 		entry = ERR_PTR(-ENOMEM);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/5] btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes()
  2025-05-07 17:23 [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation fdmanana
                   ` (2 preceding siblings ...)
  2025-05-07 17:23 ` [PATCH 3/5] btrfs: fold error checks when allocating ordered extent and update comments fdmanana
@ 2025-05-07 17:23 ` fdmanana
  2025-05-07 17:23 ` [PATCH 5/5] btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent() fdmanana
  2025-05-08 16:20 ` [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation Boris Burkov
  5 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2025-05-07 17:23 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We are using an integer for the 'delalloc' argument but all we need is a
boolean, so switch the type to 'bool' and rename the parameter to
'is_delalloc' to better match the fact that it's a boolean.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/block-group.c | 10 +++++-----
 fs/btrfs/block-group.h |  2 +-
 fs/btrfs/extent-tree.c |  2 +-
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 034182fd9b91..20f238dd8d96 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -3829,9 +3829,9 @@ int btrfs_add_reserved_bytes(struct btrfs_block_group *cache,
 /*
  * Update the block_group and space info counters.
  *
- * @cache:      The cache we are manipulating
- * @num_bytes:  The number of bytes in question
- * @delalloc:   The blocks are allocated for the delalloc write
+ * @cache:       The cache we are manipulating.
+ * @num_bytes:   The number of bytes in question.
+ * @is_delalloc: Whether the blocks are allocated for a delalloc write.
  *
  * This is called by somebody who is freeing space that was never actually used
  * on disk.  For example if you reserve some space for a new leaf in transaction
@@ -3839,7 +3839,7 @@ int btrfs_add_reserved_bytes(struct btrfs_block_group *cache,
  * reserve set to 0 in order to clear the reservation.
  */
 void btrfs_free_reserved_bytes(struct btrfs_block_group *cache,
-			       u64 num_bytes, int delalloc)
+			       u64 num_bytes, bool is_delalloc)
 {
 	struct btrfs_space_info *space_info = cache->space_info;
 
@@ -3853,7 +3853,7 @@ void btrfs_free_reserved_bytes(struct btrfs_block_group *cache,
 	space_info->bytes_reserved -= num_bytes;
 	space_info->max_extent_size = 0;
 
-	if (delalloc)
+	if (is_delalloc)
 		cache->delalloc_bytes -= num_bytes;
 	spin_unlock(&cache->lock);
 
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 35309b690d6f..f1cbd6643787 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -341,7 +341,7 @@ int btrfs_add_reserved_bytes(struct btrfs_block_group *cache,
 			     u64 ram_bytes, u64 num_bytes, int delalloc,
 			     bool force_wrong_size_class);
 void btrfs_free_reserved_bytes(struct btrfs_block_group *cache,
-			       u64 num_bytes, int delalloc);
+			       u64 num_bytes, bool is_delalloc);
 int btrfs_chunk_alloc(struct btrfs_trans_handle *trans,
 		      struct btrfs_space_info *space_info, u64 flags,
 		      enum btrfs_chunk_alloc_enum force);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 64e8c653ae8f..ca229381d448 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3508,7 +3508,7 @@ int btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 	WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
 
 	btrfs_add_free_space(bg, buf->start, buf->len);
-	btrfs_free_reserved_bytes(bg, buf->len, 0);
+	btrfs_free_reserved_bytes(bg, buf->len, false);
 	btrfs_put_block_group(bg);
 	trace_btrfs_reserved_extent_free(fs_info, buf->start, buf->len);
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/5] btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent()
  2025-05-07 17:23 [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation fdmanana
                   ` (3 preceding siblings ...)
  2025-05-07 17:23 ` [PATCH 4/5] btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes() fdmanana
@ 2025-05-07 17:23 ` fdmanana
  2025-05-08 16:20 ` [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation Boris Burkov
  5 siblings, 0 replies; 16+ messages in thread
From: fdmanana @ 2025-05-07 17:23 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We are using an integer for the 'delalloc' argument but all we need is a
boolean, so switch the type to 'bool' and rename the parameter to
'is_delalloc' to better match the fact that it's a boolean.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/direct-io.c   |  3 +--
 fs/btrfs/extent-tree.c |  6 +++---
 fs/btrfs/extent-tree.h |  2 +-
 fs/btrfs/inode.c       | 10 +++++-----
 4 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index fde612d9b077..546410c8fac0 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -204,8 +204,7 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
 				     BTRFS_ORDERED_REGULAR);
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 	if (IS_ERR(em))
-		btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset,
-					   1);
+		btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, true);
 
 	return em;
 }
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ca229381d448..14589d1a5f49 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4734,7 +4734,7 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 }
 
 int btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
-			       u64 start, u64 len, int delalloc)
+			       u64 start, u64 len, bool is_delalloc)
 {
 	struct btrfs_block_group *cache;
 
@@ -4746,7 +4746,7 @@ int btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
 	}
 
 	btrfs_add_free_space(cache, start, len);
-	btrfs_free_reserved_bytes(cache, len, delalloc);
+	btrfs_free_reserved_bytes(cache, len, is_delalloc);
 	trace_btrfs_reserved_extent_free(fs_info, start, len);
 
 	btrfs_put_block_group(cache);
@@ -5220,7 +5220,7 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 	btrfs_tree_unlock(buf);
 	free_extent_buffer(buf);
 out_free_reserved:
-	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 0);
+	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, false);
 out_unuse:
 	btrfs_unuse_block_rsv(fs_info, block_rsv, blocksize);
 	return ERR_PTR(ret);
diff --git a/fs/btrfs/extent-tree.h b/fs/btrfs/extent-tree.h
index 0ed682d9ed7b..88e19f9b87e2 100644
--- a/fs/btrfs/extent-tree.h
+++ b/fs/btrfs/extent-tree.h
@@ -150,7 +150,7 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref);
 u64 btrfs_get_extent_owner_root(struct btrfs_fs_info *fs_info,
 				struct extent_buffer *leaf, int slot);
 int btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
-			       u64 start, u64 len, int delalloc);
+			       u64 start, u64 len, bool is_delalloc);
 int btrfs_pin_reserved_extent(struct btrfs_trans_handle *trans,
 			      const struct extent_buffer *eb);
 int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2666b0f73452..f0b31691e766 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1174,7 +1174,7 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
 
 out_free_reserve:
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
-	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
+	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, true);
 	mapping_set_error(inode->vfs_inode.i_mapping, -EIO);
 	extent_clear_unlock_delalloc(inode, start, end,
 				     NULL, &cached,
@@ -1452,7 +1452,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	btrfs_drop_extent_map_range(inode, start, start + cur_alloc_size - 1, false);
 out_reserve:
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
-	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
+	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, true);
 out_unlock:
 	/*
 	 * Now, we have three regions to clean up:
@@ -3282,7 +3282,7 @@ int btrfs_finish_one_ordered(struct btrfs_ordered_extent *ordered_extent)
 						NULL);
 			btrfs_free_reserved_extent(fs_info,
 					ordered_extent->disk_bytenr,
-					ordered_extent->disk_num_bytes, 1);
+					ordered_extent->disk_num_bytes, true);
 			/*
 			 * Actually free the qgroup rsv which was released when
 			 * the ordered extent was created.
@@ -8886,7 +8886,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 		if (IS_ERR(trans)) {
 			ret = PTR_ERR(trans);
 			btrfs_free_reserved_extent(fs_info, ins.objectid,
-						   ins.offset, 0);
+						   ins.offset, false);
 			break;
 		}
 
@@ -9714,7 +9714,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
 
 out_free_reserved:
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
-	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
+	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, true);
 out_delalloc_release:
 	btrfs_delalloc_release_extents(inode, num_bytes);
 	btrfs_delalloc_release_metadata(inode, disk_num_bytes, ret < 0);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent
  2025-05-07 17:23 ` [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent fdmanana
@ 2025-05-07 22:33   ` Boris Burkov
  2025-05-07 23:11     ` Qu Wenruo
  2025-05-08 13:40     ` Filipe Manana
  0 siblings, 2 replies; 16+ messages in thread
From: Boris Burkov @ 2025-05-07 22:33 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, May 07, 2025 at 06:23:13PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> If we fail to allocate an ordered extent for a COW write we end up leaking
> a qgroup data reservation since we called btrfs_qgroup_release_data() but
> we didn't call btrfs_qgroup_free_refroot() (which would happen when
> running the respective data delayed ref created by ordered extent
> completion or when finishing the ordered extent in case an error happened).
> 
> So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
> ordered extent for a COW write.

I haven't tried it myself yet, but I believe that this patch will double
free reservation from the qgroup when this case occurs.

Can you share the context where you saw this bug? Have you run fstests
with qgroups or squotas enabled? I think this should show pretty quickly
in generic/475 with qgroups on.

Consider, for example, the following execution of the dio case:

btrfs_dio_iomap_begin
  btrfs_check_data_free_space // reserves the data into `reserved`, sets dio_data->data_space_reserved
  btrfs_get_blocks_direct_write
    btrfs_create_dio_extent
      btrfs_alloc_ordered_extent
        alloc_ordered_extent // fails and frees refroot, reserved is "wrong" now.
      // error propagates up
    // error propagates up via PTR_ERR

which brings us to the code:
if (ret < 0)
        goto unlock_err;
...
unlock_err:
...
if (dio_data->data_space_reserved) {
        btrfs_free_reserved_data_space()
}

so the execution continues...

btrfs_free_reserved_data_space
  btrfs_qgroup_free_data
    __btrfs_qgroup_release_data
      qgroup_free_reserved_data
        btrfs_qgroup_free_refroot

This will result in a underflow of the reservation once everything
outstanding gets released.

Furthermore, raw calls to free_refroot in cases where we have a reserved
changeset make me worried, because they will run afoul of races with
multiple threads touching the various bits. I don't see the bugs here,
but the reservation lifetime is really tricky so I wouldn't be surprised
if something like that was wrong too.

As of the last time I looked at this, I think cow_file_range handles
this correctly as well. Looking really quickly now, it looks like maybe
submit_one_async_extent() might not do a qgroup free, but I'm not sure
where the corresponding reservation is coming from.

I think if you have indeed found a different codepath that makes a data
reservation but doesn't release the qgroup part when allocating the
ordered extent fails, then the fastest path to a fix is to do that at
the same level as where it calls btrfs_check_data_free_space or (however
it gets the reservation), as is currently done by the main
ordered_extent allocation paths. With async_extent, we might need to
plumb through the reserved extent changeset through the async chunk to
do it perfectly...

Thanks,
Boris

> 
> Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak")
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/ordered-data.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index ae49f87b27e8..e44d3dd17caf 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -153,9 +153,10 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
>  	struct btrfs_ordered_extent *entry;
>  	int ret;
>  	u64 qgroup_rsv = 0;
> +	const bool is_nocow = (flags &
> +	       ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC)));
>  
> -	if (flags &
> -	    ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC))) {
> +	if (is_nocow) {
>  		/* For nocow write, we can release the qgroup rsv right now */
>  		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
>  		if (ret < 0)
> @@ -170,8 +171,13 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
>  			return ERR_PTR(ret);
>  	}
>  	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
> -	if (!entry)
> +	if (!entry) {
> +		if (!is_nocow)
> +			btrfs_qgroup_free_refroot(inode->root->fs_info,
> +						  btrfs_root_id(inode->root),
> +						  qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
>  		return ERR_PTR(-ENOMEM);
> +	}
>  
>  	entry->file_offset = file_offset;
>  	entry->num_bytes = num_bytes;
> -- 
> 2.47.2
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent
  2025-05-07 22:33   ` Boris Burkov
@ 2025-05-07 23:11     ` Qu Wenruo
  2025-05-07 23:39       ` Boris Burkov
  2025-05-08 13:43       ` Filipe Manana
  2025-05-08 13:40     ` Filipe Manana
  1 sibling, 2 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-05-07 23:11 UTC (permalink / raw)
  To: Boris Burkov, fdmanana; +Cc: linux-btrfs



在 2025/5/8 08:03, Boris Burkov 写道:
> On Wed, May 07, 2025 at 06:23:13PM +0100, fdmanana@kernel.org wrote:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> If we fail to allocate an ordered extent for a COW write we end up leaking
>> a qgroup data reservation since we called btrfs_qgroup_release_data() but
>> we didn't call btrfs_qgroup_free_refroot() (which would happen when
>> running the respective data delayed ref created by ordered extent
>> completion or when finishing the ordered extent in case an error happened).
>>
>> So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
>> ordered extent for a COW write.
> 
> I haven't tried it myself yet, but I believe that this patch will double
> free reservation from the qgroup when this case occurs.
> 
> Can you share the context where you saw this bug? Have you run fstests
> with qgroups or squotas enabled? I think this should show pretty quickly
> in generic/475 with qgroups on.
> 
> Consider, for example, the following execution of the dio case:
> 
> btrfs_dio_iomap_begin
>    btrfs_check_data_free_space // reserves the data into `reserved`, sets dio_data->data_space_reserved
>    btrfs_get_blocks_direct_write
>      btrfs_create_dio_extent
>        btrfs_alloc_ordered_extent
>          alloc_ordered_extent // fails and frees refroot, reserved is "wrong" now.
>        // error propagates up
>      // error propagates up via PTR_ERR
> 
> which brings us to the code:
> if (ret < 0)
>          goto unlock_err;
> ...
> unlock_err:
> ...
> if (dio_data->data_space_reserved) {
>          btrfs_free_reserved_data_space()
> }
> 
> so the execution continues...
> 
> btrfs_free_reserved_data_space
>    btrfs_qgroup_free_data
>      __btrfs_qgroup_release_data
>        qgroup_free_reserved_data
>          btrfs_qgroup_free_refroot
> 
> This will result in a underflow of the reservation once everything
> outstanding gets released.

That should still be safe.

Firstly at alloc_ordered_extent() we released the qgroup space already, 
thus there will be no EXTENT_QGROUP_RESERVED range in extent-io tree 
anymore.

Then at the final cleanup, qgroup_free_reserved_data_space() will 
release/free nothing, because the extent-io tree no longer has the 
corresponding range with EXTENT_QGROUP_RESERVED.

This is the core design of qgroup data reserved space, which allows us 
to call btrfs_release/free_data() twice without double accounting.

> 
> Furthermore, raw calls to free_refroot in cases where we have a reserved
> changeset make me worried, because they will run afoul of races with
> multiple threads touching the various bits. I don't see the bugs here,
> but the reservation lifetime is really tricky so I wouldn't be surprised
> if something like that was wrong too.

This free_refroot() is the common practice here. The extent-io tree 
based accounting can only cover the reserved space before ordered extent 
is allocated.

Then the reserved space is "transferred" to ordered extent, and 
eventually go to the new data extent, and finally freed at 
btrfs_qgroup_account_extents(), which also goes the free_refroot() way.

> 
> As of the last time I looked at this, I think cow_file_range handles
> this correctly as well. Looking really quickly now, it looks like maybe
> submit_one_async_extent() might not do a qgroup free, but I'm not sure
> where the corresponding reservation is coming from.
> 
> I think if you have indeed found a different codepath that makes a data
> reservation but doesn't release the qgroup part when allocating the
> ordered extent fails, then the fastest path to a fix is to do that at
> the same level as where it calls btrfs_check_data_free_space or (however
> it gets the reservation), as is currently done by the main
> ordered_extent allocation paths. With async_extent, we might need to
> plumb through the reserved extent changeset through the async chunk to
> do it perfectly...

I agree with you that, the extent io tree based double freeing 
prevention should only be the last safe net, not something we should 
abuse when possible.

But I can't find a better solution, mostly due to the fact that after 
the btrfs_qgroup_release_data() call, the qgroup reserved space is 
already released, and we have no way to undo that...

Maybe we can delayed the qgroup release/free calls until the ordered 
extent @entry is allocated?

Thanks,
Qu


> 
> Thanks,
> Boris
> 
>>
>> Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak")
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>   fs/btrfs/ordered-data.c | 12 +++++++++---
>>   1 file changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
>> index ae49f87b27e8..e44d3dd17caf 100644
>> --- a/fs/btrfs/ordered-data.c
>> +++ b/fs/btrfs/ordered-data.c
>> @@ -153,9 +153,10 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
>>   	struct btrfs_ordered_extent *entry;
>>   	int ret;
>>   	u64 qgroup_rsv = 0;
>> +	const bool is_nocow = (flags &
>> +	       ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC)));
>>   
>> -	if (flags &
>> -	    ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC))) {
>> +	if (is_nocow) {
>>   		/* For nocow write, we can release the qgroup rsv right now */
>>   		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
>>   		if (ret < 0)
>> @@ -170,8 +171,13 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
>>   			return ERR_PTR(ret);
>>   	}
>>   	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
>> -	if (!entry)
>> +	if (!entry) {
>> +		if (!is_nocow)
>> +			btrfs_qgroup_free_refroot(inode->root->fs_info,
>> +						  btrfs_root_id(inode->root),
>> +						  qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
>>   		return ERR_PTR(-ENOMEM);
>> +	}
>>   
>>   	entry->file_offset = file_offset;
>>   	entry->num_bytes = num_bytes;
>> -- 
>> 2.47.2
>>
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent
  2025-05-07 23:11     ` Qu Wenruo
@ 2025-05-07 23:39       ` Boris Burkov
  2025-05-08  0:08         ` Boris Burkov
  2025-05-08 13:43       ` Filipe Manana
  1 sibling, 1 reply; 16+ messages in thread
From: Boris Burkov @ 2025-05-07 23:39 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: fdmanana, linux-btrfs

On Thu, May 08, 2025 at 08:41:56AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2025/5/8 08:03, Boris Burkov 写道:
> > On Wed, May 07, 2025 at 06:23:13PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > > 
> > > If we fail to allocate an ordered extent for a COW write we end up leaking
> > > a qgroup data reservation since we called btrfs_qgroup_release_data() but
> > > we didn't call btrfs_qgroup_free_refroot() (which would happen when
> > > running the respective data delayed ref created by ordered extent
> > > completion or when finishing the ordered extent in case an error happened).
> > > 
> > > So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
> > > ordered extent for a COW write.
> > 
> > I haven't tried it myself yet, but I believe that this patch will double
> > free reservation from the qgroup when this case occurs.
> > 
> > Can you share the context where you saw this bug? Have you run fstests
> > with qgroups or squotas enabled? I think this should show pretty quickly
> > in generic/475 with qgroups on.
> > 
> > Consider, for example, the following execution of the dio case:
> > 
> > btrfs_dio_iomap_begin
> >    btrfs_check_data_free_space // reserves the data into `reserved`, sets dio_data->data_space_reserved
> >    btrfs_get_blocks_direct_write
> >      btrfs_create_dio_extent
> >        btrfs_alloc_ordered_extent
> >          alloc_ordered_extent // fails and frees refroot, reserved is "wrong" now.
> >        // error propagates up
> >      // error propagates up via PTR_ERR
> > 
> > which brings us to the code:
> > if (ret < 0)
> >          goto unlock_err;
> > ...
> > unlock_err:
> > ...
> > if (dio_data->data_space_reserved) {
> >          btrfs_free_reserved_data_space()
> > }
> > 
> > so the execution continues...
> > 
> > btrfs_free_reserved_data_space
> >    btrfs_qgroup_free_data
> >      __btrfs_qgroup_release_data
> >        qgroup_free_reserved_data
> >          btrfs_qgroup_free_refroot
> > 
> > This will result in a underflow of the reservation once everything
> > outstanding gets released.
> 
> That should still be safe.
> 
> Firstly at alloc_ordered_extent() we released the qgroup space already, thus
> there will be no EXTENT_QGROUP_RESERVED range in extent-io tree anymore.

Ah yes, good point. And that is before any chance of an error, so now I
I agree that this patch is correct, and that my analysis was wrong.

> 
> Then at the final cleanup, qgroup_free_reserved_data_space() will
> release/free nothing, because the extent-io tree no longer has the
> corresponding range with EXTENT_QGROUP_RESERVED.
> 
> This is the core design of qgroup data reserved space, which allows us to
> call btrfs_release/free_data() twice without double accounting.
> 
> > 
> > Furthermore, raw calls to free_refroot in cases where we have a reserved
> > changeset make me worried, because they will run afoul of races with
> > multiple threads touching the various bits. I don't see the bugs here,
> > but the reservation lifetime is really tricky so I wouldn't be surprised
> > if something like that was wrong too.
> 
> This free_refroot() is the common practice here. The extent-io tree based
> accounting can only cover the reserved space before ordered extent is
> allocated.
> 
> Then the reserved space is "transferred" to ordered extent, and eventually
> go to the new data extent, and finally freed at
> btrfs_qgroup_account_extents(), which also goes the free_refroot() way.
> 

That makes sense. I was also thinking of it in terms of "transferring"
when working on this. And yes, there are other post-OE-alloc error paths
(some that I wrote, lol) but I think they are a smell. However, a key
difference is that in those cases, the reservation either already
belongs to the delayed ref logic, or the OE still exists. So this is
still a weird case where we do not have the protection of an OE but
call a raw free_refroot.

I would guess that extent locking during OE creation probably saves us,
though, from glancing at the code. generic/475 with qgroups on will be
the ultimate arbiter here.

> > 
> > As of the last time I looked at this, I think cow_file_range handles
> > this correctly as well. Looking really quickly now, it looks like maybe
> > submit_one_async_extent() might not do a qgroup free, but I'm not sure
> > where the corresponding reservation is coming from.
> > 
> > I think if you have indeed found a different codepath that makes a data
> > reservation but doesn't release the qgroup part when allocating the
> > ordered extent fails, then the fastest path to a fix is to do that at
> > the same level as where it calls btrfs_check_data_free_space or (however
> > it gets the reservation), as is currently done by the main
> > ordered_extent allocation paths. With async_extent, we might need to
> > plumb through the reserved extent changeset through the async chunk to
> > do it perfectly...
> 
> I agree with you that, the extent io tree based double freeing prevention
> should only be the last safe net, not something we should abuse when
> possible.

Yes, this does still feel icky to me.

> 
> But I can't find a better solution, mostly due to the fact that after the
> btrfs_qgroup_release_data() call, the qgroup reserved space is already
> released, and we have no way to undo that...

I so wish we could have a nice RAII qgroup reservation object flow
through the whole system :)

> 
> Maybe we can delayed the qgroup release/free calls until the ordered extent
> @entry is allocated?
> 

I do think that is likely the cleaner fix. Though it would have to also be
after the igrab() check later in the series.

> Thanks,
> Qu
> 
> 
> > 
> > Thanks,
> > Boris
> > 
> > > 
> > > Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak")
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > >   fs/btrfs/ordered-data.c | 12 +++++++++---
> > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > > index ae49f87b27e8..e44d3dd17caf 100644
> > > --- a/fs/btrfs/ordered-data.c
> > > +++ b/fs/btrfs/ordered-data.c
> > > @@ -153,9 +153,10 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> > >   	struct btrfs_ordered_extent *entry;
> > >   	int ret;
> > >   	u64 qgroup_rsv = 0;
> > > +	const bool is_nocow = (flags &
> > > +	       ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC)));
> > > -	if (flags &
> > > -	    ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC))) {
> > > +	if (is_nocow) {
> > >   		/* For nocow write, we can release the qgroup rsv right now */
> > >   		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
> > >   		if (ret < 0)
> > > @@ -170,8 +171,13 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> > >   			return ERR_PTR(ret);
> > >   	}
> > >   	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
> > > -	if (!entry)
> > > +	if (!entry) {
> > > +		if (!is_nocow)
> > > +			btrfs_qgroup_free_refroot(inode->root->fs_info,
> > > +						  btrfs_root_id(inode->root),
> > > +						  qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
> > >   		return ERR_PTR(-ENOMEM);
> > > +	}
> > >   	entry->file_offset = file_offset;
> > >   	entry->num_bytes = num_bytes;
> > > -- 
> > > 2.47.2
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent
  2025-05-07 23:39       ` Boris Burkov
@ 2025-05-08  0:08         ` Boris Burkov
  0 siblings, 0 replies; 16+ messages in thread
From: Boris Burkov @ 2025-05-08  0:08 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: fdmanana, linux-btrfs

On Wed, May 07, 2025 at 04:39:59PM -0700, Boris Burkov wrote:
> On Thu, May 08, 2025 at 08:41:56AM +0930, Qu Wenruo wrote:
> > 
> > 
> > 在 2025/5/8 08:03, Boris Burkov 写道:
> > > On Wed, May 07, 2025 at 06:23:13PM +0100, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > > 
> > > > If we fail to allocate an ordered extent for a COW write we end up leaking
> > > > a qgroup data reservation since we called btrfs_qgroup_release_data() but
> > > > we didn't call btrfs_qgroup_free_refroot() (which would happen when
> > > > running the respective data delayed ref created by ordered extent
> > > > completion or when finishing the ordered extent in case an error happened).
> > > > 
> > > > So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
> > > > ordered extent for a COW write.
> > > 
> > > I haven't tried it myself yet, but I believe that this patch will double
> > > free reservation from the qgroup when this case occurs.
> > > 
> > > Can you share the context where you saw this bug? Have you run fstests
> > > with qgroups or squotas enabled? I think this should show pretty quickly
> > > in generic/475 with qgroups on.
> > > 
> > > Consider, for example, the following execution of the dio case:
> > > 
> > > btrfs_dio_iomap_begin
> > >    btrfs_check_data_free_space // reserves the data into `reserved`, sets dio_data->data_space_reserved
> > >    btrfs_get_blocks_direct_write
> > >      btrfs_create_dio_extent
> > >        btrfs_alloc_ordered_extent
> > >          alloc_ordered_extent // fails and frees refroot, reserved is "wrong" now.
> > >        // error propagates up
> > >      // error propagates up via PTR_ERR
> > > 
> > > which brings us to the code:
> > > if (ret < 0)
> > >          goto unlock_err;
> > > ...
> > > unlock_err:
> > > ...
> > > if (dio_data->data_space_reserved) {
> > >          btrfs_free_reserved_data_space()
> > > }
> > > 
> > > so the execution continues...
> > > 
> > > btrfs_free_reserved_data_space
> > >    btrfs_qgroup_free_data
> > >      __btrfs_qgroup_release_data
> > >        qgroup_free_reserved_data
> > >          btrfs_qgroup_free_refroot
> > > 
> > > This will result in a underflow of the reservation once everything
> > > outstanding gets released.
> > 
> > That should still be safe.
> > 
> > Firstly at alloc_ordered_extent() we released the qgroup space already, thus
> > there will be no EXTENT_QGROUP_RESERVED range in extent-io tree anymore.
> 
> Ah yes, good point. And that is before any chance of an error, so now I
> I agree that this patch is correct, and that my analysis was wrong.
> 
> > 
> > Then at the final cleanup, qgroup_free_reserved_data_space() will
> > release/free nothing, because the extent-io tree no longer has the
> > corresponding range with EXTENT_QGROUP_RESERVED.
> > 
> > This is the core design of qgroup data reserved space, which allows us to
> > call btrfs_release/free_data() twice without double accounting.
> > 
> > > 
> > > Furthermore, raw calls to free_refroot in cases where we have a reserved
> > > changeset make me worried, because they will run afoul of races with
> > > multiple threads touching the various bits. I don't see the bugs here,
> > > but the reservation lifetime is really tricky so I wouldn't be surprised
> > > if something like that was wrong too.
> > 
> > This free_refroot() is the common practice here. The extent-io tree based
> > accounting can only cover the reserved space before ordered extent is
> > allocated.
> > 
> > Then the reserved space is "transferred" to ordered extent, and eventually
> > go to the new data extent, and finally freed at
> > btrfs_qgroup_account_extents(), which also goes the free_refroot() way.
> > 
> 
> That makes sense. I was also thinking of it in terms of "transferring"
> when working on this. And yes, there are other post-OE-alloc error paths
> (some that I wrote, lol) but I think they are a smell. However, a key
> difference is that in those cases, the reservation either already
> belongs to the delayed ref logic, or the OE still exists. So this is
> still a weird case where we do not have the protection of an OE but
> call a raw free_refroot.
> 
> I would guess that extent locking during OE creation probably saves us,
> though, from glancing at the code. generic/475 with qgroups on will be
> the ultimate arbiter here.
> 

Unfortunately, I actually don't think generic/475 will be much help here,
since it doesn't inject memory allocation errors. I don't know offhand
which tests, if any, do that. I grepped around for the names in
https://docs.kernel.org/fault-injection/fault-injection.html
and didn't find any yet.

So we can forget about my theoretical problems until I think of an
actual possible race or a good test :)

> > > 
> > > As of the last time I looked at this, I think cow_file_range handles
> > > this correctly as well. Looking really quickly now, it looks like maybe
> > > submit_one_async_extent() might not do a qgroup free, but I'm not sure
> > > where the corresponding reservation is coming from.
> > > 
> > > I think if you have indeed found a different codepath that makes a data
> > > reservation but doesn't release the qgroup part when allocating the
> > > ordered extent fails, then the fastest path to a fix is to do that at
> > > the same level as where it calls btrfs_check_data_free_space or (however
> > > it gets the reservation), as is currently done by the main
> > > ordered_extent allocation paths. With async_extent, we might need to
> > > plumb through the reserved extent changeset through the async chunk to
> > > do it perfectly...
> > 
> > I agree with you that, the extent io tree based double freeing prevention
> > should only be the last safe net, not something we should abuse when
> > possible.
> 
> Yes, this does still feel icky to me.
> 
> > 
> > But I can't find a better solution, mostly due to the fact that after the
> > btrfs_qgroup_release_data() call, the qgroup reserved space is already
> > released, and we have no way to undo that...
> 
> I so wish we could have a nice RAII qgroup reservation object flow
> through the whole system :)
> 
> > 
> > Maybe we can delayed the qgroup release/free calls until the ordered extent
> > @entry is allocated?
> > 
> 
> I do think that is likely the cleaner fix. Though it would have to also be
> after the igrab() check later in the series.
> 
> > Thanks,
> > Qu
> > 
> > 
> > > 
> > > Thanks,
> > > Boris
> > > 
> > > > 
> > > > Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak")
> > > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > > ---
> > > >   fs/btrfs/ordered-data.c | 12 +++++++++---
> > > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > > > index ae49f87b27e8..e44d3dd17caf 100644
> > > > --- a/fs/btrfs/ordered-data.c
> > > > +++ b/fs/btrfs/ordered-data.c
> > > > @@ -153,9 +153,10 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> > > >   	struct btrfs_ordered_extent *entry;
> > > >   	int ret;
> > > >   	u64 qgroup_rsv = 0;
> > > > +	const bool is_nocow = (flags &
> > > > +	       ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC)));
> > > > -	if (flags &
> > > > -	    ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC))) {
> > > > +	if (is_nocow) {
> > > >   		/* For nocow write, we can release the qgroup rsv right now */
> > > >   		ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
> > > >   		if (ret < 0)
> > > > @@ -170,8 +171,13 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> > > >   			return ERR_PTR(ret);
> > > >   	}
> > > >   	entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
> > > > -	if (!entry)
> > > > +	if (!entry) {
> > > > +		if (!is_nocow)
> > > > +			btrfs_qgroup_free_refroot(inode->root->fs_info,
> > > > +						  btrfs_root_id(inode->root),
> > > > +						  qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
> > > >   		return ERR_PTR(-ENOMEM);
> > > > +	}
> > > >   	entry->file_offset = file_offset;
> > > >   	entry->num_bytes = num_bytes;
> > > > -- 
> > > > 2.47.2
> > > > 
> > > 
> > 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent
  2025-05-07 22:33   ` Boris Burkov
  2025-05-07 23:11     ` Qu Wenruo
@ 2025-05-08 13:40     ` Filipe Manana
  2025-05-08 16:19       ` Boris Burkov
  1 sibling, 1 reply; 16+ messages in thread
From: Filipe Manana @ 2025-05-08 13:40 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs

On Wed, May 7, 2025 at 11:33 PM Boris Burkov <boris@bur.io> wrote:
>
> On Wed, May 07, 2025 at 06:23:13PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > If we fail to allocate an ordered extent for a COW write we end up leaking
> > a qgroup data reservation since we called btrfs_qgroup_release_data() but
> > we didn't call btrfs_qgroup_free_refroot() (which would happen when
> > running the respective data delayed ref created by ordered extent
> > completion or when finishing the ordered extent in case an error happened).
> >
> > So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
> > ordered extent for a COW write.
>
> I haven't tried it myself yet, but I believe that this patch will double
> free reservation from the qgroup when this case occurs.

Nop, see below.

>
> Can you share the context where you saw this bug? Have you run fstests
> with qgroups or squotas enabled? I think this should show pretty quickly
> in generic/475 with qgroups on.

Yes, I have run fstests. I always do before sending a patch, no matter
how simple or trivial it is (or seems to be).

This isn't a scenario that can be triggered with fstests since there
are no test cases that inject memory allocation failures on ordered
extents or anything else.
generic/475 simulates IO failures with dm error, so I don't see why
you think that would be relevant when the problem here is on ordered
extent allocation failure and not IO errors.

>
> Consider, for example, the following execution of the dio case:
>
> btrfs_dio_iomap_begin
>   btrfs_check_data_free_space // reserves the data into `reserved`, sets dio_data->data_space_reserved
>   btrfs_get_blocks_direct_write
>     btrfs_create_dio_extent
>       btrfs_alloc_ordered_extent
>         alloc_ordered_extent // fails and frees refroot, reserved is "wrong" now.
>       // error propagates up
>     // error propagates up via PTR_ERR
>
> which brings us to the code:
> if (ret < 0)
>         goto unlock_err;
> ...
> unlock_err:
> ...
> if (dio_data->data_space_reserved) {
>         btrfs_free_reserved_data_space()
> }
>
> so the execution continues...
>
> btrfs_free_reserved_data_space
>   btrfs_qgroup_free_data
>     __btrfs_qgroup_release_data
>       qgroup_free_reserved_data
>         btrfs_qgroup_free_refroot
>
> This will result in a underflow of the reservation once everything
> outstanding gets released.

No, it won't.

For a COW write, before we failed to allocate the ordered extent, at
alloc_ordered_extent(), we called btrfs_qgroup_release_data().
That function will find all subranges in the inode's iotree marked
with EXTENT_QGROUP_RESERVED, clear that bit from them and sum their
lengths into @qgroup_rsv (local variable from alloc_ordered_extent()).

So calling qgroup_free_reserved_data() in an error path such as that
one will do nothing because it can't find any more ranges in the
inode's iotree marked with EXTENT_QGROUP_RESERVED.

So we leak reserved space... from the moment we called
btrfs_qgroup_release_data(), at alloc_ordered_extent(), we transferred
how we track the reserved space - which was intended to be in the
ordered extent and then when the ordered extent completes a delayed
data ref is created and when that delayed ref is ran we release the
space with btrfs_qgroup_free_refroot(). But since we failed to
allocate the ordered extent and the reserved space is no longer
tracked in the inode's iotree, we fail to release qgroup space.

Actually patch 3 in the patchset updates the comments at
alloc_ordered_extent() with those details to make it clear.

Hope it's more clear now what's going on and how qgroup tracks reserved space.

Thanks.



>
> Furthermore, raw calls to free_refroot in cases where we have a reserved
> changeset make me worried, because they will run afoul of races with
> multiple threads touching the various bits. I don't see the bugs here,
> but the reservation lifetime is really tricky so I wouldn't be surprised
> if something like that was wrong too.
>
> As of the last time I looked at this, I think cow_file_range handles
> this correctly as well. Looking really quickly now, it looks like maybe
> submit_one_async_extent() might not do a qgroup free, but I'm not sure
> where the corresponding reservation is coming from.
>
> I think if you have indeed found a different codepath that makes a data
> reservation but doesn't release the qgroup part when allocating the
> ordered extent fails, then the fastest path to a fix is to do that at
> the same level as where it calls btrfs_check_data_free_space or (however
> it gets the reservation), as is currently done by the main
> ordered_extent allocation paths. With async_extent, we might need to
> plumb through the reserved extent changeset through the async chunk to
> do it perfectly...
>
> Thanks,
> Boris
>
> >
> > Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak")
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/ordered-data.c | 12 +++++++++---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > index ae49f87b27e8..e44d3dd17caf 100644
> > --- a/fs/btrfs/ordered-data.c
> > +++ b/fs/btrfs/ordered-data.c
> > @@ -153,9 +153,10 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> >       struct btrfs_ordered_extent *entry;
> >       int ret;
> >       u64 qgroup_rsv = 0;
> > +     const bool is_nocow = (flags &
> > +            ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC)));
> >
> > -     if (flags &
> > -         ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC))) {
> > +     if (is_nocow) {
> >               /* For nocow write, we can release the qgroup rsv right now */
> >               ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
> >               if (ret < 0)
> > @@ -170,8 +171,13 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> >                       return ERR_PTR(ret);
> >       }
> >       entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
> > -     if (!entry)
> > +     if (!entry) {
> > +             if (!is_nocow)
> > +                     btrfs_qgroup_free_refroot(inode->root->fs_info,
> > +                                               btrfs_root_id(inode->root),
> > +                                               qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
> >               return ERR_PTR(-ENOMEM);
> > +     }
> >
> >       entry->file_offset = file_offset;
> >       entry->num_bytes = num_bytes;
> > --
> > 2.47.2
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent
  2025-05-07 23:11     ` Qu Wenruo
  2025-05-07 23:39       ` Boris Burkov
@ 2025-05-08 13:43       ` Filipe Manana
  2025-05-08 16:07         ` Boris Burkov
  2025-05-08 21:57         ` Qu Wenruo
  1 sibling, 2 replies; 16+ messages in thread
From: Filipe Manana @ 2025-05-08 13:43 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Boris Burkov, linux-btrfs

On Thu, May 8, 2025 at 12:12 AM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> 在 2025/5/8 08:03, Boris Burkov 写道:
> > On Wed, May 07, 2025 at 06:23:13PM +0100, fdmanana@kernel.org wrote:
> >> From: Filipe Manana <fdmanana@suse.com>
> >>
> >> If we fail to allocate an ordered extent for a COW write we end up leaking
> >> a qgroup data reservation since we called btrfs_qgroup_release_data() but
> >> we didn't call btrfs_qgroup_free_refroot() (which would happen when
> >> running the respective data delayed ref created by ordered extent
> >> completion or when finishing the ordered extent in case an error happened).
> >>
> >> So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
> >> ordered extent for a COW write.
> >
> > I haven't tried it myself yet, but I believe that this patch will double
> > free reservation from the qgroup when this case occurs.
> >
> > Can you share the context where you saw this bug? Have you run fstests
> > with qgroups or squotas enabled? I think this should show pretty quickly
> > in generic/475 with qgroups on.
> >
> > Consider, for example, the following execution of the dio case:
> >
> > btrfs_dio_iomap_begin
> >    btrfs_check_data_free_space // reserves the data into `reserved`, sets dio_data->data_space_reserved
> >    btrfs_get_blocks_direct_write
> >      btrfs_create_dio_extent
> >        btrfs_alloc_ordered_extent
> >          alloc_ordered_extent // fails and frees refroot, reserved is "wrong" now.
> >        // error propagates up
> >      // error propagates up via PTR_ERR
> >
> > which brings us to the code:
> > if (ret < 0)
> >          goto unlock_err;
> > ...
> > unlock_err:
> > ...
> > if (dio_data->data_space_reserved) {
> >          btrfs_free_reserved_data_space()
> > }
> >
> > so the execution continues...
> >
> > btrfs_free_reserved_data_space
> >    btrfs_qgroup_free_data
> >      __btrfs_qgroup_release_data
> >        qgroup_free_reserved_data
> >          btrfs_qgroup_free_refroot
> >
> > This will result in a underflow of the reservation once everything
> > outstanding gets released.
>
> That should still be safe.
>
> Firstly at alloc_ordered_extent() we released the qgroup space already,
> thus there will be no EXTENT_QGROUP_RESERVED range in extent-io tree
> anymore.
>
> Then at the final cleanup, qgroup_free_reserved_data_space() will
> release/free nothing, because the extent-io tree no longer has the
> corresponding range with EXTENT_QGROUP_RESERVED.
>
> This is the core design of qgroup data reserved space, which allows us
> to call btrfs_release/free_data() twice without double accounting.
>
> >
> > Furthermore, raw calls to free_refroot in cases where we have a reserved
> > changeset make me worried, because they will run afoul of races with
> > multiple threads touching the various bits. I don't see the bugs here,
> > but the reservation lifetime is really tricky so I wouldn't be surprised
> > if something like that was wrong too.
>
> This free_refroot() is the common practice here. The extent-io tree
> based accounting can only cover the reserved space before ordered extent
> is allocated.
>
> Then the reserved space is "transferred" to ordered extent, and
> eventually go to the new data extent, and finally freed at
> btrfs_qgroup_account_extents(), which also goes the free_refroot() way.
>
> >
> > As of the last time I looked at this, I think cow_file_range handles
> > this correctly as well. Looking really quickly now, it looks like maybe
> > submit_one_async_extent() might not do a qgroup free, but I'm not sure
> > where the corresponding reservation is coming from.
> >
> > I think if you have indeed found a different codepath that makes a data
> > reservation but doesn't release the qgroup part when allocating the
> > ordered extent fails, then the fastest path to a fix is to do that at
> > the same level as where it calls btrfs_check_data_free_space or (however
> > it gets the reservation), as is currently done by the main
> > ordered_extent allocation paths. With async_extent, we might need to
> > plumb through the reserved extent changeset through the async chunk to
> > do it perfectly...
>
> I agree with you that, the extent io tree based double freeing
> prevention should only be the last safe net, not something we should
> abuse when possible.
>
> But I can't find a better solution, mostly due to the fact that after
> the btrfs_qgroup_release_data() call, the qgroup reserved space is
> already released, and we have no way to undo that...
>
> Maybe we can delayed the qgroup release/free calls until the ordered
> extent @entry is allocated?

At some point I considered allocating first the ordered extent and
then doing the qgroup free/release calls, and that would fix the leak
too.
At the moment it seemed more clear to me the way I did, but if
everyone prefers that other way I'm fine with it and will change it.

Thanks.


>
> Thanks,
> Qu
>
>
> >
> > Thanks,
> > Boris
> >
> >>
> >> Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak")
> >> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >> ---
> >>   fs/btrfs/ordered-data.c | 12 +++++++++---
> >>   1 file changed, 9 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> >> index ae49f87b27e8..e44d3dd17caf 100644
> >> --- a/fs/btrfs/ordered-data.c
> >> +++ b/fs/btrfs/ordered-data.c
> >> @@ -153,9 +153,10 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> >>      struct btrfs_ordered_extent *entry;
> >>      int ret;
> >>      u64 qgroup_rsv = 0;
> >> +    const bool is_nocow = (flags &
> >> +           ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC)));
> >>
> >> -    if (flags &
> >> -        ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC))) {
> >> +    if (is_nocow) {
> >>              /* For nocow write, we can release the qgroup rsv right now */
> >>              ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
> >>              if (ret < 0)
> >> @@ -170,8 +171,13 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> >>                      return ERR_PTR(ret);
> >>      }
> >>      entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
> >> -    if (!entry)
> >> +    if (!entry) {
> >> +            if (!is_nocow)
> >> +                    btrfs_qgroup_free_refroot(inode->root->fs_info,
> >> +                                              btrfs_root_id(inode->root),
> >> +                                              qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
> >>              return ERR_PTR(-ENOMEM);
> >> +    }
> >>
> >>      entry->file_offset = file_offset;
> >>      entry->num_bytes = num_bytes;
> >> --
> >> 2.47.2
> >>
> >
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent
  2025-05-08 13:43       ` Filipe Manana
@ 2025-05-08 16:07         ` Boris Burkov
  2025-05-08 21:57         ` Qu Wenruo
  1 sibling, 0 replies; 16+ messages in thread
From: Boris Burkov @ 2025-05-08 16:07 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Qu Wenruo, linux-btrfs

On Thu, May 08, 2025 at 02:43:36PM +0100, Filipe Manana wrote:
> On Thu, May 8, 2025 at 12:12 AM Qu Wenruo <wqu@suse.com> wrote:
> >
> >
> >
> > 在 2025/5/8 08:03, Boris Burkov 写道:
> > > On Wed, May 07, 2025 at 06:23:13PM +0100, fdmanana@kernel.org wrote:
> > >> From: Filipe Manana <fdmanana@suse.com>
> > >>
> > >> If we fail to allocate an ordered extent for a COW write we end up leaking
> > >> a qgroup data reservation since we called btrfs_qgroup_release_data() but
> > >> we didn't call btrfs_qgroup_free_refroot() (which would happen when
> > >> running the respective data delayed ref created by ordered extent
> > >> completion or when finishing the ordered extent in case an error happened).
> > >>
> > >> So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
> > >> ordered extent for a COW write.
> > >
> > > I haven't tried it myself yet, but I believe that this patch will double
> > > free reservation from the qgroup when this case occurs.
> > >
> > > Can you share the context where you saw this bug? Have you run fstests
> > > with qgroups or squotas enabled? I think this should show pretty quickly
> > > in generic/475 with qgroups on.
> > >
> > > Consider, for example, the following execution of the dio case:
> > >
> > > btrfs_dio_iomap_begin
> > >    btrfs_check_data_free_space // reserves the data into `reserved`, sets dio_data->data_space_reserved
> > >    btrfs_get_blocks_direct_write
> > >      btrfs_create_dio_extent
> > >        btrfs_alloc_ordered_extent
> > >          alloc_ordered_extent // fails and frees refroot, reserved is "wrong" now.
> > >        // error propagates up
> > >      // error propagates up via PTR_ERR
> > >
> > > which brings us to the code:
> > > if (ret < 0)
> > >          goto unlock_err;
> > > ...
> > > unlock_err:
> > > ...
> > > if (dio_data->data_space_reserved) {
> > >          btrfs_free_reserved_data_space()
> > > }
> > >
> > > so the execution continues...
> > >
> > > btrfs_free_reserved_data_space
> > >    btrfs_qgroup_free_data
> > >      __btrfs_qgroup_release_data
> > >        qgroup_free_reserved_data
> > >          btrfs_qgroup_free_refroot
> > >
> > > This will result in a underflow of the reservation once everything
> > > outstanding gets released.
> >
> > That should still be safe.
> >
> > Firstly at alloc_ordered_extent() we released the qgroup space already,
> > thus there will be no EXTENT_QGROUP_RESERVED range in extent-io tree
> > anymore.
> >
> > Then at the final cleanup, qgroup_free_reserved_data_space() will
> > release/free nothing, because the extent-io tree no longer has the
> > corresponding range with EXTENT_QGROUP_RESERVED.
> >
> > This is the core design of qgroup data reserved space, which allows us
> > to call btrfs_release/free_data() twice without double accounting.
> >
> > >
> > > Furthermore, raw calls to free_refroot in cases where we have a reserved
> > > changeset make me worried, because they will run afoul of races with
> > > multiple threads touching the various bits. I don't see the bugs here,
> > > but the reservation lifetime is really tricky so I wouldn't be surprised
> > > if something like that was wrong too.
> >
> > This free_refroot() is the common practice here. The extent-io tree
> > based accounting can only cover the reserved space before ordered extent
> > is allocated.
> >
> > Then the reserved space is "transferred" to ordered extent, and
> > eventually go to the new data extent, and finally freed at
> > btrfs_qgroup_account_extents(), which also goes the free_refroot() way.
> >
> > >
> > > As of the last time I looked at this, I think cow_file_range handles
> > > this correctly as well. Looking really quickly now, it looks like maybe
> > > submit_one_async_extent() might not do a qgroup free, but I'm not sure
> > > where the corresponding reservation is coming from.
> > >
> > > I think if you have indeed found a different codepath that makes a data
> > > reservation but doesn't release the qgroup part when allocating the
> > > ordered extent fails, then the fastest path to a fix is to do that at
> > > the same level as where it calls btrfs_check_data_free_space or (however
> > > it gets the reservation), as is currently done by the main
> > > ordered_extent allocation paths. With async_extent, we might need to
> > > plumb through the reserved extent changeset through the async chunk to
> > > do it perfectly...
> >
> > I agree with you that, the extent io tree based double freeing
> > prevention should only be the last safe net, not something we should
> > abuse when possible.
> >
> > But I can't find a better solution, mostly due to the fact that after
> > the btrfs_qgroup_release_data() call, the qgroup reserved space is
> > already released, and we have no way to undo that...
> >
> > Maybe we can delayed the qgroup release/free calls until the ordered
> > extent @entry is allocated?
> 
> At some point I considered allocating first the ordered extent and
> then doing the qgroup free/release calls, and that would fix the leak
> too.
> At the moment it seemed more clear to me the way I did, but if
> everyone prefers that other way I'm fine with it and will change it.

I personally prefer not adding more naked btrfs_qgroup_free_refroots and
think it's better to move the release call. Either way, please feel free
to add:

Reviewed-by: Boris Burkov <boris@bur.io>

> 
> Thanks.
> 
> 
> >
> > Thanks,
> > Qu
> >
> >
> > >
> > > Thanks,
> > > Boris
> > >
> > >>
> > >> Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak")
> > >> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > >> ---
> > >>   fs/btrfs/ordered-data.c | 12 +++++++++---
> > >>   1 file changed, 9 insertions(+), 3 deletions(-)
> > >>
> > >> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > >> index ae49f87b27e8..e44d3dd17caf 100644
> > >> --- a/fs/btrfs/ordered-data.c
> > >> +++ b/fs/btrfs/ordered-data.c
> > >> @@ -153,9 +153,10 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> > >>      struct btrfs_ordered_extent *entry;
> > >>      int ret;
> > >>      u64 qgroup_rsv = 0;
> > >> +    const bool is_nocow = (flags &
> > >> +           ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC)));
> > >>
> > >> -    if (flags &
> > >> -        ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC))) {
> > >> +    if (is_nocow) {
> > >>              /* For nocow write, we can release the qgroup rsv right now */
> > >>              ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
> > >>              if (ret < 0)
> > >> @@ -170,8 +171,13 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> > >>                      return ERR_PTR(ret);
> > >>      }
> > >>      entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
> > >> -    if (!entry)
> > >> +    if (!entry) {
> > >> +            if (!is_nocow)
> > >> +                    btrfs_qgroup_free_refroot(inode->root->fs_info,
> > >> +                                              btrfs_root_id(inode->root),
> > >> +                                              qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
> > >>              return ERR_PTR(-ENOMEM);
> > >> +    }
> > >>
> > >>      entry->file_offset = file_offset;
> > >>      entry->num_bytes = num_bytes;
> > >> --
> > >> 2.47.2
> > >>
> > >
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent
  2025-05-08 13:40     ` Filipe Manana
@ 2025-05-08 16:19       ` Boris Burkov
  0 siblings, 0 replies; 16+ messages in thread
From: Boris Burkov @ 2025-05-08 16:19 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Thu, May 08, 2025 at 02:40:22PM +0100, Filipe Manana wrote:
> On Wed, May 7, 2025 at 11:33 PM Boris Burkov <boris@bur.io> wrote:
> >
> > On Wed, May 07, 2025 at 06:23:13PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > If we fail to allocate an ordered extent for a COW write we end up leaking
> > > a qgroup data reservation since we called btrfs_qgroup_release_data() but
> > > we didn't call btrfs_qgroup_free_refroot() (which would happen when
> > > running the respective data delayed ref created by ordered extent
> > > completion or when finishing the ordered extent in case an error happened).
> > >
> > > So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
> > > ordered extent for a COW write.
> >
> > I haven't tried it myself yet, but I believe that this patch will double
> > free reservation from the qgroup when this case occurs.
> 
> Nop, see below.
> 
> >
> > Can you share the context where you saw this bug? Have you run fstests
> > with qgroups or squotas enabled? I think this should show pretty quickly
> > in generic/475 with qgroups on.
> 
> Yes, I have run fstests. I always do before sending a patch, no matter
> how simple or trivial it is (or seems to be).

For the record, I was not suggesting that you hadn't run fstests, I just
wasn't sure if you ran them with qgroups on for every test as well.

> 
> This isn't a scenario that can be triggered with fstests since there
> are no test cases that inject memory allocation failures on ordered
> extents or anything else.
> generic/475 simulates IO failures with dm error, so I don't see why
> you think that would be relevant when the problem here is on ordered
> extent allocation failure and not IO errors.
> 

Yes, I noticed that later, while discussing with Qu.

> >
> > Consider, for example, the following execution of the dio case:
> >
> > btrfs_dio_iomap_begin
> >   btrfs_check_data_free_space // reserves the data into `reserved`, sets dio_data->data_space_reserved
> >   btrfs_get_blocks_direct_write
> >     btrfs_create_dio_extent
> >       btrfs_alloc_ordered_extent
> >         alloc_ordered_extent // fails and frees refroot, reserved is "wrong" now.
> >       // error propagates up
> >     // error propagates up via PTR_ERR
> >
> > which brings us to the code:
> > if (ret < 0)
> >         goto unlock_err;
> > ...
> > unlock_err:
> > ...
> > if (dio_data->data_space_reserved) {
> >         btrfs_free_reserved_data_space()
> > }
> >
> > so the execution continues...
> >
> > btrfs_free_reserved_data_space
> >   btrfs_qgroup_free_data
> >     __btrfs_qgroup_release_data
> >       qgroup_free_reserved_data
> >         btrfs_qgroup_free_refroot
> >
> > This will result in a underflow of the reservation once everything
> > outstanding gets released.
> 
> No, it won't.
> 
> For a COW write, before we failed to allocate the ordered extent, at
> alloc_ordered_extent(), we called btrfs_qgroup_release_data().
> That function will find all subranges in the inode's iotree marked
> with EXTENT_QGROUP_RESERVED, clear that bit from them and sum their
> lengths into @qgroup_rsv (local variable from alloc_ordered_extent()).
> 
> So calling qgroup_free_reserved_data() in an error path such as that
> one will do nothing because it can't find any more ranges in the
> inode's iotree marked with EXTENT_QGROUP_RESERVED.
> 
> So we leak reserved space... from the moment we called
> btrfs_qgroup_release_data(), at alloc_ordered_extent(), we transferred
> how we track the reserved space - which was intended to be in the
> ordered extent and then when the ordered extent completes a delayed
> data ref is created and when that delayed ref is ran we release the
> space with btrfs_qgroup_free_refroot(). But since we failed to
> allocate the ordered extent and the reserved space is no longer
> tracked in the inode's iotree, we fail to release qgroup space.
> 
> Actually patch 3 in the patchset updates the comments at
> alloc_ordered_extent() with those details to make it clear.
> 
> Hope it's more clear now what's going on and how qgroup tracks reserved space.
> 
> Thanks.
> 
> 
> 
> >
> > Furthermore, raw calls to free_refroot in cases where we have a reserved
> > changeset make me worried, because they will run afoul of races with
> > multiple threads touching the various bits. I don't see the bugs here,
> > but the reservation lifetime is really tricky so I wouldn't be surprised
> > if something like that was wrong too.
> >
> > As of the last time I looked at this, I think cow_file_range handles
> > this correctly as well. Looking really quickly now, it looks like maybe
> > submit_one_async_extent() might not do a qgroup free, but I'm not sure
> > where the corresponding reservation is coming from.
> >
> > I think if you have indeed found a different codepath that makes a data
> > reservation but doesn't release the qgroup part when allocating the
> > ordered extent fails, then the fastest path to a fix is to do that at
> > the same level as where it calls btrfs_check_data_free_space or (however
> > it gets the reservation), as is currently done by the main
> > ordered_extent allocation paths. With async_extent, we might need to
> > plumb through the reserved extent changeset through the async chunk to
> > do it perfectly...
> >
> > Thanks,
> > Boris
> >
> > >
> > > Fixes: 7dbeaad0af7d ("btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak")
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > >  fs/btrfs/ordered-data.c | 12 +++++++++---
> > >  1 file changed, 9 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > > index ae49f87b27e8..e44d3dd17caf 100644
> > > --- a/fs/btrfs/ordered-data.c
> > > +++ b/fs/btrfs/ordered-data.c
> > > @@ -153,9 +153,10 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> > >       struct btrfs_ordered_extent *entry;
> > >       int ret;
> > >       u64 qgroup_rsv = 0;
> > > +     const bool is_nocow = (flags &
> > > +            ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC)));
> > >
> > > -     if (flags &
> > > -         ((1U << BTRFS_ORDERED_NOCOW) | (1U << BTRFS_ORDERED_PREALLOC))) {
> > > +     if (is_nocow) {
> > >               /* For nocow write, we can release the qgroup rsv right now */
> > >               ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes, &qgroup_rsv);
> > >               if (ret < 0)
> > > @@ -170,8 +171,13 @@ static struct btrfs_ordered_extent *alloc_ordered_extent(
> > >                       return ERR_PTR(ret);
> > >       }
> > >       entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS);
> > > -     if (!entry)
> > > +     if (!entry) {
> > > +             if (!is_nocow)
> > > +                     btrfs_qgroup_free_refroot(inode->root->fs_info,
> > > +                                               btrfs_root_id(inode->root),
> > > +                                               qgroup_rsv, BTRFS_QGROUP_RSV_DATA);
> > >               return ERR_PTR(-ENOMEM);
> > > +     }
> > >
> > >       entry->file_offset = file_offset;
> > >       entry->num_bytes = num_bytes;
> > > --
> > > 2.47.2
> > >

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation
  2025-05-07 17:23 [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation fdmanana
                   ` (4 preceding siblings ...)
  2025-05-07 17:23 ` [PATCH 5/5] btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent() fdmanana
@ 2025-05-08 16:20 ` Boris Burkov
  5 siblings, 0 replies; 16+ messages in thread
From: Boris Burkov @ 2025-05-08 16:20 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, May 07, 2025 at 06:23:12PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Some fixes and cleanups around ordered extent allocation. Details in the
> change logs.

Now that we discussed the first patch in detail,
Reviewed-by: Boris Burkov <boris@bur.io>

> 
> Filipe Manana (5):
>   btrfs: fix qgroup reservation leak on failure to allocate ordered extent
>   btrfs: check we grabbed inode reference when allocating an ordered extent
>   btrfs: fold error checks when allocating ordered extent and update comments
>   btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes()
>   btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent()
> 
>  fs/btrfs/block-group.c  | 10 ++++----
>  fs/btrfs/block-group.h  |  2 +-
>  fs/btrfs/direct-io.c    |  3 +--
>  fs/btrfs/extent-tree.c  |  8 +++----
>  fs/btrfs/extent-tree.h  |  2 +-
>  fs/btrfs/inode.c        | 10 ++++----
>  fs/btrfs/ordered-data.c | 51 ++++++++++++++++++++++++++---------------
>  7 files changed, 50 insertions(+), 36 deletions(-)
> 
> -- 
> 2.47.2
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent
  2025-05-08 13:43       ` Filipe Manana
  2025-05-08 16:07         ` Boris Burkov
@ 2025-05-08 21:57         ` Qu Wenruo
  1 sibling, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2025-05-08 21:57 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Boris Burkov, linux-btrfs



在 2025/5/8 23:13, Filipe Manana 写道:
> On Thu, May 8, 2025 at 12:12 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>>
>>
>> 在 2025/5/8 08:03, Boris Burkov 写道:
>>> On Wed, May 07, 2025 at 06:23:13PM +0100, fdmanana@kernel.org wrote:
>>>> From: Filipe Manana <fdmanana@suse.com>
>>>>
>>>> If we fail to allocate an ordered extent for a COW write we end up leaking
>>>> a qgroup data reservation since we called btrfs_qgroup_release_data() but
>>>> we didn't call btrfs_qgroup_free_refroot() (which would happen when
>>>> running the respective data delayed ref created by ordered extent
>>>> completion or when finishing the ordered extent in case an error happened).
>>>>
>>>> So make sure we call btrfs_qgroup_free_refroot() if we fail to allocate an
>>>> ordered extent for a COW write.
>>>
>>> I haven't tried it myself yet, but I believe that this patch will double
>>> free reservation from the qgroup when this case occurs.
>>>
>>> Can you share the context where you saw this bug? Have you run fstests
>>> with qgroups or squotas enabled? I think this should show pretty quickly
>>> in generic/475 with qgroups on.
>>>
>>> Consider, for example, the following execution of the dio case:
>>>
>>> btrfs_dio_iomap_begin
>>>     btrfs_check_data_free_space // reserves the data into `reserved`, sets dio_data->data_space_reserved
>>>     btrfs_get_blocks_direct_write
>>>       btrfs_create_dio_extent
>>>         btrfs_alloc_ordered_extent
>>>           alloc_ordered_extent // fails and frees refroot, reserved is "wrong" now.
>>>         // error propagates up
>>>       // error propagates up via PTR_ERR
>>>
>>> which brings us to the code:
>>> if (ret < 0)
>>>           goto unlock_err;
>>> ...
>>> unlock_err:
>>> ...
>>> if (dio_data->data_space_reserved) {
>>>           btrfs_free_reserved_data_space()
>>> }
>>>
>>> so the execution continues...
>>>
>>> btrfs_free_reserved_data_space
>>>     btrfs_qgroup_free_data
>>>       __btrfs_qgroup_release_data
>>>         qgroup_free_reserved_data
>>>           btrfs_qgroup_free_refroot
>>>
>>> This will result in a underflow of the reservation once everything
>>> outstanding gets released.
>>
>> That should still be safe.
>>
>> Firstly at alloc_ordered_extent() we released the qgroup space already,
>> thus there will be no EXTENT_QGROUP_RESERVED range in extent-io tree
>> anymore.
>>
>> Then at the final cleanup, qgroup_free_reserved_data_space() will
>> release/free nothing, because the extent-io tree no longer has the
>> corresponding range with EXTENT_QGROUP_RESERVED.
>>
>> This is the core design of qgroup data reserved space, which allows us
>> to call btrfs_release/free_data() twice without double accounting.
>>
>>>
>>> Furthermore, raw calls to free_refroot in cases where we have a reserved
>>> changeset make me worried, because they will run afoul of races with
>>> multiple threads touching the various bits. I don't see the bugs here,
>>> but the reservation lifetime is really tricky so I wouldn't be surprised
>>> if something like that was wrong too.
>>
>> This free_refroot() is the common practice here. The extent-io tree
>> based accounting can only cover the reserved space before ordered extent
>> is allocated.
>>
>> Then the reserved space is "transferred" to ordered extent, and
>> eventually go to the new data extent, and finally freed at
>> btrfs_qgroup_account_extents(), which also goes the free_refroot() way.
>>
>>>
>>> As of the last time I looked at this, I think cow_file_range handles
>>> this correctly as well. Looking really quickly now, it looks like maybe
>>> submit_one_async_extent() might not do a qgroup free, but I'm not sure
>>> where the corresponding reservation is coming from.
>>>
>>> I think if you have indeed found a different codepath that makes a data
>>> reservation but doesn't release the qgroup part when allocating the
>>> ordered extent fails, then the fastest path to a fix is to do that at
>>> the same level as where it calls btrfs_check_data_free_space or (however
>>> it gets the reservation), as is currently done by the main
>>> ordered_extent allocation paths. With async_extent, we might need to
>>> plumb through the reserved extent changeset through the async chunk to
>>> do it perfectly...
>>
>> I agree with you that, the extent io tree based double freeing
>> prevention should only be the last safe net, not something we should
>> abuse when possible.
>>
>> But I can't find a better solution, mostly due to the fact that after
>> the btrfs_qgroup_release_data() call, the qgroup reserved space is
>> already released, and we have no way to undo that...
>>
>> Maybe we can delayed the qgroup release/free calls until the ordered
>> extent @entry is allocated?
> 
> At some point I considered allocating first the ordered extent and
> then doing the qgroup free/release calls, and that would fix the leak
> too.
> At the moment it seemed more clear to me the way I did, but if
> everyone prefers that other way I'm fine with it and will change it.

I'm fine either way for the fix.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu

> 
> Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-05-08 21:57 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-07 17:23 [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation fdmanana
2025-05-07 17:23 ` [PATCH 1/5] btrfs: fix qgroup reservation leak on failure to allocate ordered extent fdmanana
2025-05-07 22:33   ` Boris Burkov
2025-05-07 23:11     ` Qu Wenruo
2025-05-07 23:39       ` Boris Burkov
2025-05-08  0:08         ` Boris Burkov
2025-05-08 13:43       ` Filipe Manana
2025-05-08 16:07         ` Boris Burkov
2025-05-08 21:57         ` Qu Wenruo
2025-05-08 13:40     ` Filipe Manana
2025-05-08 16:19       ` Boris Burkov
2025-05-07 17:23 ` [PATCH 2/5] btrfs: check we grabbed inode reference when allocating an " fdmanana
2025-05-07 17:23 ` [PATCH 3/5] btrfs: fold error checks when allocating ordered extent and update comments fdmanana
2025-05-07 17:23 ` [PATCH 4/5] btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes() fdmanana
2025-05-07 17:23 ` [PATCH 5/5] btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent() fdmanana
2025-05-08 16:20 ` [PATCH 0/5] btrfs: fixes and cleanups for ordered extent allocation Boris Burkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox