public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] btrfs: block_group refcounting fixes
@ 2025-03-10 20:07 Boris Burkov
  2025-03-10 20:07 ` [PATCH v2 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups Boris Burkov
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Boris Burkov @ 2025-03-10 20:07 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We have observed a number of WARNINGs in the Meta fleet which are the
result of a block_group refcount underflowing. The refcount error
can happen at any point in the block group's lifetime, so it is hard to
conclude that we have reproduced/fixed all the bugs, I believe I have
found a few here that will hopefully improve things.

The main thrust of this patch series is that we need to take the
fs_info->unused_bgs_lock spin lock when modifying the bg_list of a
block_group. There are a number of code paths where we atomically check
that list_head for emptiness and then add/del get/put appropriately.
If any other thread messes with it in between without locking, then that
logic gets messed up. This is most obviously evident with
btrfs_mark_bg_unused.

I could imagine universally protecting bg_list's empty/not-empty nature
with a lock with smaller scope, but this is already the locking strategy
being used to synchronize reclaim/unused lists, so it seems reasonable
to just re-use it.

In addition, I attempted to simplify the refcounting logic in the
discard workfn, as the last time I fixed a bug in there, I made it far
too subtle. Hopefully this more explicit variant is easier to analyze in
the future.
--
Changelog
v2:
- fix mistaken placement of a btrfs_block_group put in the 2nd
  (locking) patch, when it ought to be in the 4th (ref-counting) patch.
- improve several commit messages with more details and using full
  function names instead of shorthand.
- add comments about over-paranoid locking.
- rename second patch to reflect that it is hardening rather than
  fixing any bugs.
- fix bad comment and variable names in btrfs_link_bg_list.


Boris Burkov (5):
  btrfs: fix bg refcount race in btrfs_create_pending_block_groups
  btrfs: harden bg->bg_list against list_del races
  btrfs: make discard_workfn block_group ref explicit
  btrfs: explicitly ref count block_group on new_bgs list
  btrfs: codify pattern for adding block_group to bg_list

 fs/btrfs/block-group.c | 57 +++++++++++++++++++++++++-----------------
 fs/btrfs/discard.c     | 34 ++++++++++++-------------
 fs/btrfs/extent-tree.c |  8 ++++++
 fs/btrfs/transaction.c | 13 ++++++++++
 4 files changed, 71 insertions(+), 41 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups
  2025-03-10 20:07 [PATCH v2 0/5] btrfs: block_group refcounting fixes Boris Burkov
@ 2025-03-10 20:07 ` Boris Burkov
  2025-03-11 19:02   ` David Sterba
  2025-03-10 20:07 ` [PATCH v2 2/5] btrfs: harden bg->bg_list against list_del races Boris Burkov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 9+ messages in thread
From: Boris Burkov @ 2025-03-10 20:07 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Block group creation is done in two phases, which results in a slightly
unintiuitive property: a block group can be allocated/deallocated from
after btrfs_make_block_group adds it to the space_info with
btrfs_add_bg_to_space_info, but before creation is completely completed
in btrfs_create_pending_block_groups. As a result, it is possible for a
block group to go unused and have 'btrfs_mark_bg_unused' called on it
concurrently with 'btrfs_create_pending_block_groups'. This causes a
number of issues, which were fixed with the block group flag
'BLOCK_GROUP_FLAG_NEW'.

However, this fix is not quite complete. Since it does not use the
unused_bg_lock, it is possible for the following race to occur:

btrfs_create_pending_block_groups            btrfs_mark_bg_unused
                                           if list_empty // false
        list_del_init
        clear_bit
                                           else if (test_bit) // true
                                                list_move_tail

And we get into the exact same broken ref count and invalid new_bgs
state for transaction cleanup that BLOCK_GROUP_FLAG_NEW was designed to
prevent.

The broken refcount aspect will result in a warning like:
[ 1272.943113] ------------[ cut here ]------------
[ 1272.943527] refcount_t: underflow; use-after-free.
[ 1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
[ 1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs]
[ 1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G        W          6.14.0-rc5+ #108
[ 1272.946368] Tainted: [W]=WARN
[ 1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
[ 1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110
[ 1272.948180] Code: 01 01 e8 e9 c7 a9 ff 0f 0b c3 cc cc cc cc 80 3d 3f 4a de 01 00 75 85 48 c7 c7 00 9b 9f 8f c6 05 2f 4a de 01 01 e8 c6 c7 a9 ff <0f> 0b c3 cc cc cc cc 80 3d 1d 4a de 01 00 0f 85 5e ff ff ff 48 c7
[ 1272.949532] RSP: 0018:ffffbf1200247df0 EFLAGS: 00010282
[ 1272.949901] RAX: 0000000000000000 RBX: ffffa14b00e3f800 RCX: 0000000000000000
[ 1272.950437] RDX: 0000000000000000 RSI: ffffbf1200247c78 RDI: 00000000ffffdfff
[ 1272.950986] RBP: ffffa14b00dc2860 R08: 00000000ffffdfff R09: ffffffff90526268
[ 1272.951512] R10: ffffffff904762c0 R11: 0000000063666572 R12: ffffa14b00dc28c0
[ 1272.952024] R13: 0000000000000000 R14: ffffa14b00dc2868 R15: 000001285dcd12c0
[ 1272.952850] FS:  0000000000000000(0000) GS:ffffa14d33c40000(0000) knlGS:0000000000000000
[ 1272.953458] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1272.953931] CR2: 00007f838cbda000 CR3: 000000010104e000 CR4: 00000000000006f0
[ 1272.954474] Call Trace:
[ 1272.954655]  <TASK>
[ 1272.954812]  ? refcount_warn_saturate+0xba/0x110
[ 1272.955173]  ? __warn.cold+0x93/0xd7
[ 1272.955487]  ? refcount_warn_saturate+0xba/0x110
[ 1272.955816]  ? report_bug+0xe7/0x120
[ 1272.956103]  ? handle_bug+0x53/0x90
[ 1272.956424]  ? exc_invalid_op+0x13/0x60
[ 1272.956700]  ? asm_exc_invalid_op+0x16/0x20
[ 1272.957011]  ? refcount_warn_saturate+0xba/0x110
[ 1272.957399]  btrfs_discard_cancel_work.cold+0x26/0x2b [btrfs]
[ 1272.957853]  btrfs_put_block_group.cold+0x5d/0x8e [btrfs]
[ 1272.958289]  btrfs_discard_workfn+0x194/0x380 [btrfs]
[ 1272.958729]  process_one_work+0x130/0x290
[ 1272.959026]  worker_thread+0x2ea/0x420
[ 1272.959335]  ? __pfx_worker_thread+0x10/0x10
[ 1272.959644]  kthread+0xd7/0x1c0
[ 1272.959872]  ? __pfx_kthread+0x10/0x10
[ 1272.960172]  ret_from_fork+0x30/0x50
[ 1272.960474]  ? __pfx_kthread+0x10/0x10
[ 1272.960745]  ret_from_fork_asm+0x1a/0x30
[ 1272.961035]  </TASK>
[ 1272.961238] ---[ end trace 0000000000000000 ]---

Though we have seen them in the async discard workfn as well. It is
most likely to happen after a relocation finishes which cancels discard,
tears down the block group, etc.

Fix this fully by taking the lock around the list_del_init + clear_bit
so that the two are done atomically.

Fixes: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/block-group.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 64f0268dcf02..2db1497b58d9 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2797,8 +2797,11 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 		/* Already aborted the transaction if it failed. */
 next:
 		btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
+
+		spin_lock(&fs_info->unused_bgs_lock);
 		list_del_init(&block_group->bg_list);
 		clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
+		spin_unlock(&fs_info->unused_bgs_lock);
 
 		/*
 		 * If the block group is still unused, add it to the list of
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/5] btrfs: harden bg->bg_list against list_del races
  2025-03-10 20:07 [PATCH v2 0/5] btrfs: block_group refcounting fixes Boris Burkov
  2025-03-10 20:07 ` [PATCH v2 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups Boris Burkov
@ 2025-03-10 20:07 ` Boris Burkov
  2025-03-11 11:35   ` Filipe Manana
  2025-03-10 20:07 ` [PATCH v2 3/5] btrfs: make discard_workfn block_group ref explicit Boris Burkov
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 9+ messages in thread
From: Boris Burkov @ 2025-03-10 20:07 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

As far as I can tell, these calls of list_del_init on bg_list can not
run concurrently with btrfs_mark_bg_unused or btrfs_mark_bg_to_reclaim,
as they are in transaction error paths and situations where the block
group is readonly.

However, if there is any chance at all of racing with mark_bg_unused,
or a different future user of bg_list, better to be safe than sorry.

Otherwise we risk the following interleaving (bg_list refcount in parens)

T1 (some random op)                       T2 (btrfs_mark_bg_unused)
                                        !list_empty(&bg->bg_list); (1)
list_del_init(&bg->bg_list); (1)
                                        list_move_tail (1)
btrfs_put_block_group (0)
                                        btrfs_delete_unused_bgs
                                             bg = list_first_entry
                                             list_del_init(&bg->bg_list);
                                             btrfs_put_block_group(bg); (-1)

Ultimately, this results in a broken ref count that hits zero one deref
early and the real final deref underflows the refcount, resulting in a WARNING.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/extent-tree.c |  8 ++++++++
 fs/btrfs/transaction.c | 12 ++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5de1a1293c93..5ead2f4976e4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2868,7 +2868,15 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 						   block_group->length,
 						   &trimmed);
 
+		/*
+		 * Not strictly necessary to lock, as the block_group should be
+		 * read-only from btrfs_delete_unused_bgs.
+		 */
+		ASSERT(block_group->ro);
+		spin_lock(&fs_info->unused_bgs_lock);
 		list_del_init(&block_group->bg_list);
+		spin_unlock(&fs_info->unused_bgs_lock);
+
 		btrfs_unfreeze_block_group(block_group);
 		btrfs_put_block_group(block_group);
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index db8fe291d010..470dfc3a1a5c 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -160,7 +160,13 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
 			cache = list_first_entry(&transaction->deleted_bgs,
 						 struct btrfs_block_group,
 						 bg_list);
+			/*
+			 * Not strictly necessary to lock, as no other task will be using a
+			 * block_group on the deleted_bgs list during a transaction abort.
+			 */
+			spin_lock(&transaction->fs_info->unused_bgs_lock);
 			list_del_init(&cache->bg_list);
+			spin_unlock(&transaction->fs_info->unused_bgs_lock);
 			btrfs_unfreeze_block_group(cache);
 			btrfs_put_block_group(cache);
 		}
@@ -2096,7 +2102,13 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
 
        list_for_each_entry_safe(block_group, tmp, &trans->new_bgs, bg_list) {
                btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
+		/*
+		* Not strictly necessary to lock, as no other task will be using a
+		* block_group on the new_bgs list during a transaction abort.
+		*/
+	       spin_lock(&fs_info->unused_bgs_lock);
                list_del_init(&block_group->bg_list);
+	       spin_unlock(&fs_info->unused_bgs_lock);
        }
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 3/5] btrfs: make discard_workfn block_group ref explicit
  2025-03-10 20:07 [PATCH v2 0/5] btrfs: block_group refcounting fixes Boris Burkov
  2025-03-10 20:07 ` [PATCH v2 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups Boris Burkov
  2025-03-10 20:07 ` [PATCH v2 2/5] btrfs: harden bg->bg_list against list_del races Boris Burkov
@ 2025-03-10 20:07 ` Boris Burkov
  2025-03-10 20:07 ` [PATCH v2 4/5] btrfs: explicitly ref count block_group on new_bgs list Boris Burkov
  2025-03-10 20:07 ` [PATCH v2 5/5] btrfs: codify pattern for adding block_group to bg_list Boris Burkov
  4 siblings, 0 replies; 9+ messages in thread
From: Boris Burkov @ 2025-03-10 20:07 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Currently, the async discard machinery owns a ref to the block_group
when the block_group is queued on a discard list. However, to handle
races with discard cancellation and the discard workfn, we have some
cute logic to detect that the block_group is *currently* running in the
workfn, to protect the workfn's usage amidst cancellation.

As far as I can tell, this doesn't have any overt bugs (though
finish_discard_pass and remove_from_discard_list racing can have a
surprising outcome for the caller of remove_from_discard_list in that it
is again added at the end)

But it is needlessly complicated to rely on locking and the nullity of
discard_ctl->block_group. Simplify this significantly by just taking a
refcount while we are in the workfn and uncondtionally drop it in both
the remove and workfn paths, regardless of if they race.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/discard.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index e815d165cccc..d6eef4bd9e9d 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -167,13 +167,7 @@ static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl,
 	block_group->discard_eligible_time = 0;
 	queued = !list_empty(&block_group->discard_list);
 	list_del_init(&block_group->discard_list);
-	/*
-	 * If the block group is currently running in the discard workfn, we
-	 * don't want to deref it, since it's still being used by the workfn.
-	 * The workfn will notice this case and deref the block group when it is
-	 * finished.
-	 */
-	if (queued && !running)
+	if (queued)
 		btrfs_put_block_group(block_group);
 
 	spin_unlock(&discard_ctl->lock);
@@ -260,9 +254,10 @@ static struct btrfs_block_group *peek_discard_list(
 			block_group->discard_cursor = block_group->start;
 			block_group->discard_state = BTRFS_DISCARD_EXTENTS;
 		}
-		discard_ctl->block_group = block_group;
 	}
 	if (block_group) {
+		btrfs_get_block_group(block_group);
+		discard_ctl->block_group = block_group;
 		*discard_state = block_group->discard_state;
 		*discard_index = block_group->discard_index;
 	}
@@ -493,9 +488,20 @@ static void btrfs_discard_workfn(struct work_struct *work)
 
 	block_group = peek_discard_list(discard_ctl, &discard_state,
 					&discard_index, now);
-	if (!block_group || !btrfs_run_discard_work(discard_ctl))
+	if (!block_group)
 		return;
+	if (!btrfs_run_discard_work(discard_ctl)) {
+		spin_lock(&discard_ctl->lock);
+		btrfs_put_block_group(block_group);
+		discard_ctl->block_group = NULL;
+		spin_unlock(&discard_ctl->lock);
+		return;
+	}
 	if (now < block_group->discard_eligible_time) {
+		spin_lock(&discard_ctl->lock);
+		btrfs_put_block_group(block_group);
+		discard_ctl->block_group = NULL;
+		spin_unlock(&discard_ctl->lock);
 		btrfs_discard_schedule_work(discard_ctl, false);
 		return;
 	}
@@ -547,15 +553,7 @@ static void btrfs_discard_workfn(struct work_struct *work)
 	spin_lock(&discard_ctl->lock);
 	discard_ctl->prev_discard = trimmed;
 	discard_ctl->prev_discard_time = now;
-	/*
-	 * If the block group was removed from the discard list while it was
-	 * running in this workfn, then we didn't deref it, since this function
-	 * still owned that reference. But we set the discard_ctl->block_group
-	 * back to NULL, so we can use that condition to know that now we need
-	 * to deref the block_group.
-	 */
-	if (discard_ctl->block_group == NULL)
-		btrfs_put_block_group(block_group);
+	btrfs_put_block_group(block_group);
 	discard_ctl->block_group = NULL;
 	__btrfs_discard_schedule_work(discard_ctl, now, false);
 	spin_unlock(&discard_ctl->lock);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 4/5] btrfs: explicitly ref count block_group on new_bgs list
  2025-03-10 20:07 [PATCH v2 0/5] btrfs: block_group refcounting fixes Boris Burkov
                   ` (2 preceding siblings ...)
  2025-03-10 20:07 ` [PATCH v2 3/5] btrfs: make discard_workfn block_group ref explicit Boris Burkov
@ 2025-03-10 20:07 ` Boris Burkov
  2025-03-10 20:07 ` [PATCH v2 5/5] btrfs: codify pattern for adding block_group to bg_list Boris Burkov
  4 siblings, 0 replies; 9+ messages in thread
From: Boris Burkov @ 2025-03-10 20:07 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

All other users of the bg_list list_head inc the refcount when adding to
a list and dec it when deleting from the list. Just for the sake of
uniformity and to try to avoid refcounting bugs, do it for this list as
well.

This does not fix any known ref-counting bug, as the reference belongs
to a single task (trans_handle is not shared and this represents
trans_handle->new_bgs linkage) and will not lose its original refcount
while that thread is running. And BLOCK_GROUP_FLAG_NEW protects against
ref-counting errors "moving" the block group to the unused list without
taking a ref.

With that said, I still believe it is simpler to just hold the extra ref
count for this list user as well.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/block-group.c | 2 ++
 fs/btrfs/transaction.c | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 2db1497b58d9..e4071897c9a8 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2801,6 +2801,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 		spin_lock(&fs_info->unused_bgs_lock);
 		list_del_init(&block_group->bg_list);
 		clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
+		btrfs_put_block_group(block_group);
 		spin_unlock(&fs_info->unused_bgs_lock);
 
 		/*
@@ -2939,6 +2940,7 @@ struct btrfs_block_group *btrfs_make_block_group(struct btrfs_trans_handle *tran
 	}
 #endif
 
+	btrfs_get_block_group(cache);
 	list_add_tail(&cache->bg_list, &trans->new_bgs);
 	btrfs_inc_delayed_refs_rsv_bg_inserts(fs_info);
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 470dfc3a1a5c..f26a394a9ec5 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2108,6 +2108,7 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
 		*/
 	       spin_lock(&fs_info->unused_bgs_lock);
                list_del_init(&block_group->bg_list);
+	       btrfs_put_block_group(block_group);
 	       spin_unlock(&fs_info->unused_bgs_lock);
        }
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 5/5] btrfs: codify pattern for adding block_group to bg_list
  2025-03-10 20:07 [PATCH v2 0/5] btrfs: block_group refcounting fixes Boris Burkov
                   ` (3 preceding siblings ...)
  2025-03-10 20:07 ` [PATCH v2 4/5] btrfs: explicitly ref count block_group on new_bgs list Boris Burkov
@ 2025-03-10 20:07 ` Boris Burkov
  2025-03-11 19:03   ` David Sterba
  4 siblings, 1 reply; 9+ messages in thread
From: Boris Burkov @ 2025-03-10 20:07 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Similar to mark_bg_unused and mark_bg_to_reclaim, we have a few places
that use bg_list with refcounting, mostly for retrying failures to
reclaim/delete unused.

These have custom logic for handling locking and refcounting the bg_list
properly, but they actually all want to do the same thing, so pull that
logic out into a helper. Unfortunately, mark_bg_unused does still need
the NEW flag to avoid prematurely marking stuff unused (even if refcount
is fine, we don't want to mess with bg creation), so it cannot use the
new helper.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/block-group.c | 54 +++++++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index e4071897c9a8..5b13e086c7a8 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1455,6 +1455,31 @@ static bool clean_pinned_extents(struct btrfs_trans_handle *trans,
 	return ret == 0;
 }
 
+/*
+ * Link the block_group to a list via bg_list.
+ *
+ * Use this rather than list_add_tail directly to ensure proper respect
+ * to locking and refcounting.
+ *
+ * @bg: the block_group to link to the list.
+ * @list: the list to link it to.
+ * Returns: true if the bg was linked with a refcount bump and false otherwise.
+ */
+static bool btrfs_link_bg_list(struct btrfs_block_group *bg, struct list_head *list)
+{
+	struct btrfs_fs_info *fs_info = bg->fs_info;
+	bool added = false;
+
+	spin_lock(&fs_info->unused_bgs_lock);
+	if (list_empty(&bg->bg_list)) {
+		btrfs_get_block_group(bg);
+		list_add_tail(&bg->bg_list, list);
+		added = true;
+	}
+	spin_unlock(&fs_info->unused_bgs_lock);
+	return added;
+}
+
 /*
  * Process the unused_bgs list and remove any that don't have any allocated
  * space inside of them.
@@ -1570,8 +1595,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 			 * drop under the "next" label for the
 			 * fs_info->unused_bgs list.
 			 */
-			btrfs_get_block_group(block_group);
-			list_add_tail(&block_group->bg_list, &retry_list);
+			btrfs_link_bg_list(block_group, &retry_list);
 
 			trace_btrfs_skip_unused_block_group(block_group);
 			spin_unlock(&block_group->lock);
@@ -1968,20 +1992,8 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 		spin_unlock(&space_info->lock);
 
 next:
-		if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
-			/* Refcount held by the reclaim_bgs list after splice. */
-			spin_lock(&fs_info->unused_bgs_lock);
-			/*
-			 * This block group might be added to the unused list
-			 * during the above process. Move it back to the
-			 * reclaim list otherwise.
-			 */
-			if (list_empty(&bg->bg_list)) {
-				btrfs_get_block_group(bg);
-				list_add_tail(&bg->bg_list, &retry_list);
-			}
-			spin_unlock(&fs_info->unused_bgs_lock);
-		}
+		if (ret && !READ_ONCE(space_info->periodic_reclaim))
+			btrfs_link_bg_list(bg, &retry_list);
 		btrfs_put_block_group(bg);
 
 		mutex_unlock(&fs_info->reclaim_bgs_lock);
@@ -2021,13 +2033,8 @@ void btrfs_mark_bg_to_reclaim(struct btrfs_block_group *bg)
 {
 	struct btrfs_fs_info *fs_info = bg->fs_info;
 
-	spin_lock(&fs_info->unused_bgs_lock);
-	if (list_empty(&bg->bg_list)) {
-		btrfs_get_block_group(bg);
+	if (btrfs_link_bg_list(bg, &fs_info->reclaim_bgs))
 		trace_btrfs_add_reclaim_block_group(bg);
-		list_add_tail(&bg->bg_list, &fs_info->reclaim_bgs);
-	}
-	spin_unlock(&fs_info->unused_bgs_lock);
 }
 
 static int read_bg_from_eb(struct btrfs_fs_info *fs_info, const struct btrfs_key *key,
@@ -2940,8 +2947,7 @@ struct btrfs_block_group *btrfs_make_block_group(struct btrfs_trans_handle *tran
 	}
 #endif
 
-	btrfs_get_block_group(cache);
-	list_add_tail(&cache->bg_list, &trans->new_bgs);
+	btrfs_link_bg_list(cache, &trans->new_bgs);
 	btrfs_inc_delayed_refs_rsv_bg_inserts(fs_info);
 
 	set_avail_alloc_bits(fs_info, type);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/5] btrfs: harden bg->bg_list against list_del races
  2025-03-10 20:07 ` [PATCH v2 2/5] btrfs: harden bg->bg_list against list_del races Boris Burkov
@ 2025-03-11 11:35   ` Filipe Manana
  0 siblings, 0 replies; 9+ messages in thread
From: Filipe Manana @ 2025-03-11 11:35 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Mon, Mar 10, 2025 at 8:06 PM Boris Burkov <boris@bur.io> wrote:
>
> As far as I can tell, these calls of list_del_init on bg_list can not
> run concurrently with btrfs_mark_bg_unused or btrfs_mark_bg_to_reclaim,
> as they are in transaction error paths and situations where the block
> group is readonly.
>
> However, if there is any chance at all of racing with mark_bg_unused,
> or a different future user of bg_list, better to be safe than sorry.
>
> Otherwise we risk the following interleaving (bg_list refcount in parens)
>
> T1 (some random op)                       T2 (btrfs_mark_bg_unused)
>                                         !list_empty(&bg->bg_list); (1)
> list_del_init(&bg->bg_list); (1)
>                                         list_move_tail (1)
> btrfs_put_block_group (0)
>                                         btrfs_delete_unused_bgs
>                                              bg = list_first_entry
>                                              list_del_init(&bg->bg_list);
>                                              btrfs_put_block_group(bg); (-1)
>
> Ultimately, this results in a broken ref count that hits zero one deref
> early and the real final deref underflows the refcount, resulting in a WARNING.
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good now, thanks.

> ---
>  fs/btrfs/extent-tree.c |  8 ++++++++
>  fs/btrfs/transaction.c | 12 ++++++++++++
>  2 files changed, 20 insertions(+)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 5de1a1293c93..5ead2f4976e4 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2868,7 +2868,15 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
>                                                    block_group->length,
>                                                    &trimmed);
>
> +               /*
> +                * Not strictly necessary to lock, as the block_group should be
> +                * read-only from btrfs_delete_unused_bgs.
> +                */
> +               ASSERT(block_group->ro);
> +               spin_lock(&fs_info->unused_bgs_lock);
>                 list_del_init(&block_group->bg_list);
> +               spin_unlock(&fs_info->unused_bgs_lock);
> +
>                 btrfs_unfreeze_block_group(block_group);
>                 btrfs_put_block_group(block_group);
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index db8fe291d010..470dfc3a1a5c 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -160,7 +160,13 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
>                         cache = list_first_entry(&transaction->deleted_bgs,
>                                                  struct btrfs_block_group,
>                                                  bg_list);
> +                       /*
> +                        * Not strictly necessary to lock, as no other task will be using a
> +                        * block_group on the deleted_bgs list during a transaction abort.
> +                        */
> +                       spin_lock(&transaction->fs_info->unused_bgs_lock);
>                         list_del_init(&cache->bg_list);
> +                       spin_unlock(&transaction->fs_info->unused_bgs_lock);
>                         btrfs_unfreeze_block_group(cache);
>                         btrfs_put_block_group(cache);
>                 }
> @@ -2096,7 +2102,13 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
>
>         list_for_each_entry_safe(block_group, tmp, &trans->new_bgs, bg_list) {
>                 btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
> +               /*
> +               * Not strictly necessary to lock, as no other task will be using a
> +               * block_group on the new_bgs list during a transaction abort.
> +               */
> +              spin_lock(&fs_info->unused_bgs_lock);
>                 list_del_init(&block_group->bg_list);
> +              spin_unlock(&fs_info->unused_bgs_lock);
>         }
>  }
>
> --
> 2.48.1
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups
  2025-03-10 20:07 ` [PATCH v2 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups Boris Burkov
@ 2025-03-11 19:02   ` David Sterba
  0 siblings, 0 replies; 9+ messages in thread
From: David Sterba @ 2025-03-11 19:02 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Mon, Mar 10, 2025 at 01:07:01PM -0700, Boris Burkov wrote:
> Block group creation is done in two phases, which results in a slightly
> unintiuitive property: a block group can be allocated/deallocated from
> after btrfs_make_block_group adds it to the space_info with
> btrfs_add_bg_to_space_info, but before creation is completely completed

Please write function references with the (), this makes it clear when
there are other identifiers like struct names or variables.

> in btrfs_create_pending_block_groups. As a result, it is possible for a
> block group to go unused and have 'btrfs_mark_bg_unused' called on it
> concurrently with 'btrfs_create_pending_block_groups'. This causes a
> number of issues, which were fixed with the block group flag
> 'BLOCK_GROUP_FLAG_NEW'.
> 
> However, this fix is not quite complete. Since it does not use the
> unused_bg_lock, it is possible for the following race to occur:
> 
> btrfs_create_pending_block_groups            btrfs_mark_bg_unused
>                                            if list_empty // false
>         list_del_init
>         clear_bit
>                                            else if (test_bit) // true
>                                                 list_move_tail
> 
> And we get into the exact same broken ref count and invalid new_bgs
> state for transaction cleanup that BLOCK_GROUP_FLAG_NEW was designed to
> prevent.
> 
> The broken refcount aspect will result in a warning like:
> [ 1272.943113] ------------[ cut here ]------------
> [ 1272.943527] refcount_t: underflow; use-after-free.
> [ 1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
> [ 1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs]
> [ 1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G        W          6.14.0-rc5+ #108
> [ 1272.946368] Tainted: [W]=WARN
> [ 1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> [ 1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
> [ 1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110
> [ 1272.948180] Code: 01 01 e8 e9 c7 a9 ff 0f 0b c3 cc cc cc cc 80 3d 3f 4a de 01 00 75 85 48 c7 c7 00 9b 9f 8f c6 05 2f 4a de 01 01 e8 c6 c7 a9 ff <0f> 0b c3 cc cc cc cc 80 3d 1d 4a de 01 00 0f 85 5e ff ff ff 48 c7

Please remove the Code: line unless it's really needed for the fix.

I've fixed it for-next.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 5/5] btrfs: codify pattern for adding block_group to bg_list
  2025-03-10 20:07 ` [PATCH v2 5/5] btrfs: codify pattern for adding block_group to bg_list Boris Burkov
@ 2025-03-11 19:03   ` David Sterba
  0 siblings, 0 replies; 9+ messages in thread
From: David Sterba @ 2025-03-11 19:03 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Mon, Mar 10, 2025 at 01:07:05PM -0700, Boris Burkov wrote:
> Similar to mark_bg_unused and mark_bg_to_reclaim, we have a few places
> that use bg_list with refcounting, mostly for retrying failures to
> reclaim/delete unused.
> 
> These have custom logic for handling locking and refcounting the bg_list
> properly, but they actually all want to do the same thing, so pull that
> logic out into a helper. Unfortunately, mark_bg_unused does still need
> the NEW flag to avoid prematurely marking stuff unused (even if refcount
> is fine, we don't want to mess with bg creation), so it cannot use the
> new helper.
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/block-group.c | 54 +++++++++++++++++++++++-------------------
>  1 file changed, 30 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index e4071897c9a8..5b13e086c7a8 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1455,6 +1455,31 @@ static bool clean_pinned_extents(struct btrfs_trans_handle *trans,
>  	return ret == 0;
>  }
>  
> +/*
> + * Link the block_group to a list via bg_list.
> + *
> + * Use this rather than list_add_tail directly to ensure proper respect
> + * to locking and refcounting.
> + *
> + * @bg: the block_group to link to the list.
> + * @list: the list to link it to.
> + * Returns: true if the bg was linked with a refcount bump and false otherwise.

A minor thing, in the function comment the parameter descriptions are
below the first line. Fixed in for-next.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-03-11 19:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-10 20:07 [PATCH v2 0/5] btrfs: block_group refcounting fixes Boris Burkov
2025-03-10 20:07 ` [PATCH v2 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups Boris Burkov
2025-03-11 19:02   ` David Sterba
2025-03-10 20:07 ` [PATCH v2 2/5] btrfs: harden bg->bg_list against list_del races Boris Burkov
2025-03-11 11:35   ` Filipe Manana
2025-03-10 20:07 ` [PATCH v2 3/5] btrfs: make discard_workfn block_group ref explicit Boris Burkov
2025-03-10 20:07 ` [PATCH v2 4/5] btrfs: explicitly ref count block_group on new_bgs list Boris Burkov
2025-03-10 20:07 ` [PATCH v2 5/5] btrfs: codify pattern for adding block_group to bg_list Boris Burkov
2025-03-11 19:03   ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox