[PATCH 0/5] btrfs: block_group refcounting fixes

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] btrfs: block_group refcounting fixes
@ 2025-03-07  0:29 Boris Burkov
  2025-03-07  0:29 ` [PATCH 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups Boris Burkov
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Boris Burkov @ 2025-03-07  0:29 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We have observed a number of WARNINGs in the Meta fleet which are the
result of a block_group refcount underflowing. The refcount error
can happen at any point in the block group's lifetime, so it is hard to
conclude that we have reproduced/fixed all the bugs, I believe I have
found a few here that will hopefully improve things.

The main thrust of this patch series is that we need to take the
fs_info->unused_bgs_lock spin lock when modifying the bg_list of a
block_group. There are a number of code paths where we atomically check
that list_head for emptiness and then add/del get/put appropriately.
If any other thread messes with it in between without locking, then that
logic gets messed up. This is most obviously evident with
mark_bg_unused.

I could imagine universally protecting bg_list's empty/not-empty nature
with a lock with smaller scope, but this is already the locking strategy
being used to synchronize reclaim/unused lists, so it seems reasonable
to just re-use it.

In addition, I attempted to simplify the refcounting logic in the
discard workfn, as the last time I fixed a bug in there, I made it far
too subtle. Hopefully this more explicit variant is easier to analyze in
the future.

Boris Burkov (5):
  btrfs: fix bg refcount race in btrfs_create_pending_block_groups
  btrfs: fix bg->bg_list list_del refcount races
  btrfs: make discard_workfn block_group ref explicit
  btrfs: explicitly ref count block_group on new_bgs list
  btrfs: codify pattern for adding block_group to bg_list

 fs/btrfs/block-group.c | 57 +++++++++++++++++++++++++-----------------
 fs/btrfs/discard.c     | 34 ++++++++++++-------------
 fs/btrfs/extent-tree.c |  3 +++
 fs/btrfs/transaction.c |  5 ++++
 4 files changed, 58 insertions(+), 41 deletions(-)

-- 
2.48.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups
  2025-03-07  0:29 [PATCH 0/5] btrfs: block_group refcounting fixes Boris Burkov
@ 2025-03-07  0:29 ` Boris Burkov
  2025-03-07  6:51   ` Qu Wenruo
  2025-03-07 14:13   ` Filipe Manana
  2025-03-07  0:29 ` [PATCH 2/5] btrfs: fix bg->bg_list list_del refcount races Boris Burkov
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 21+ messages in thread
From: Boris Burkov @ 2025-03-07  0:29 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

To avoid a race where mark_bg_unused spuriously "moved" the block_group
from one bg_list attachment to another without taking a ref, we mark a
new block group with the bit BLOCK_GROUP_FLAG_NEW.

However, this fix is not quite complete. Since it does not use the
unused_bg_lock, it is possible for the following race to occur:

create_pending_block_groups                     mark_bg_unused
                                           if list_empty // false
        list_del_init
        clear_bit
                                           else if (test_bit) // true
                                                list_move_tail

And we get into the exact same broken ref count situation.
Those look something like:
[ 1272.943113] ------------[ cut here ]------------
[ 1272.943527] refcount_t: underflow; use-after-free.
[ 1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
[ 1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs]
[ 1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G        W          6.14.0-rc5+ #108
[ 1272.946368] Tainted: [W]=WARN
[ 1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
[ 1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110
[ 1272.948180] Code: 01 01 e8 e9 c7 a9 ff 0f 0b c3 cc cc cc cc 80 3d 3f 4a de 01 00 75 85 48 c7 c7 00 9b 9f 8f c6 05 2f 4a de 01 01 e8 c6 c7 a9 ff <0f> 0b c3 cc cc cc cc 80 3d 1d 4a de 01 00 0f 85 5e ff ff ff 48 c7
[ 1272.949532] RSP: 0018:ffffbf1200247df0 EFLAGS: 00010282
[ 1272.949901] RAX: 0000000000000000 RBX: ffffa14b00e3f800 RCX: 0000000000000000
[ 1272.950437] RDX: 0000000000000000 RSI: ffffbf1200247c78 RDI: 00000000ffffdfff
[ 1272.950986] RBP: ffffa14b00dc2860 R08: 00000000ffffdfff R09: ffffffff90526268
[ 1272.951512] R10: ffffffff904762c0 R11: 0000000063666572 R12: ffffa14b00dc28c0
[ 1272.952024] R13: 0000000000000000 R14: ffffa14b00dc2868 R15: 000001285dcd12c0
[ 1272.952850] FS:  0000000000000000(0000) GS:ffffa14d33c40000(0000) knlGS:0000000000000000
[ 1272.953458] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1272.953931] CR2: 00007f838cbda000 CR3: 000000010104e000 CR4: 00000000000006f0
[ 1272.954474] Call Trace:
[ 1272.954655]  <TASK>
[ 1272.954812]  ? refcount_warn_saturate+0xba/0x110
[ 1272.955173]  ? __warn.cold+0x93/0xd7
[ 1272.955487]  ? refcount_warn_saturate+0xba/0x110
[ 1272.955816]  ? report_bug+0xe7/0x120
[ 1272.956103]  ? handle_bug+0x53/0x90
[ 1272.956424]  ? exc_invalid_op+0x13/0x60
[ 1272.956700]  ? asm_exc_invalid_op+0x16/0x20
[ 1272.957011]  ? refcount_warn_saturate+0xba/0x110
[ 1272.957399]  btrfs_discard_cancel_work.cold+0x26/0x2b [btrfs]
[ 1272.957853]  btrfs_put_block_group.cold+0x5d/0x8e [btrfs]
[ 1272.958289]  btrfs_discard_workfn+0x194/0x380 [btrfs]
[ 1272.958729]  process_one_work+0x130/0x290
[ 1272.959026]  worker_thread+0x2ea/0x420
[ 1272.959335]  ? __pfx_worker_thread+0x10/0x10
[ 1272.959644]  kthread+0xd7/0x1c0
[ 1272.959872]  ? __pfx_kthread+0x10/0x10
[ 1272.960172]  ret_from_fork+0x30/0x50
[ 1272.960474]  ? __pfx_kthread+0x10/0x10
[ 1272.960745]  ret_from_fork_asm+0x1a/0x30
[ 1272.961035]  </TASK>
[ 1272.961238] ---[ end trace 0000000000000000 ]---

Though we have seen them in the async discard workfn as well. It is
most likely to happen after a relocation finishes which cancels discard,
tears down the block group, etc.

Fix this fully by taking the lock around the list_del_init + clear_bit
so that the two are done atomically.

Fixes: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused")
Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/block-group.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 64f0268dcf02..2db1497b58d9 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2797,8 +2797,11 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 		/* Already aborted the transaction if it failed. */
 next:
 		btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
+
+		spin_lock(&fs_info->unused_bgs_lock);
 		list_del_init(&block_group->bg_list);
 		clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
+		spin_unlock(&fs_info->unused_bgs_lock);
 
 		/*
 		 * If the block group is still unused, add it to the list of
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups
  2025-03-07  0:29 ` [PATCH 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups Boris Burkov
@ 2025-03-07  6:51   ` Qu Wenruo
  2025-03-07 14:13   ` Filipe Manana
  1 sibling, 0 replies; 21+ messages in thread
From: Qu Wenruo @ 2025-03-07  6:51 UTC (permalink / raw)
  To: Boris Burkov, linux-btrfs, kernel-team



在 2025/3/7 10:59, Boris Burkov 写道:
> To avoid a race where mark_bg_unused spuriously "moved" the block_group
> from one bg_list attachment to another without taking a ref, we mark a
> new block group with the bit BLOCK_GROUP_FLAG_NEW.
> 
> However, this fix is not quite complete. Since it does not use the
> unused_bg_lock, it is possible for the following race to occur:
> 
> create_pending_block_groups                     mark_bg_unused
>                                             if list_empty // false
>          list_del_init
>          clear_bit
>                                             else if (test_bit) // true
>                                                  list_move_tail
> 
> And we get into the exact same broken ref count situation.
> Those look something like:
> [ 1272.943113] ------------[ cut here ]------------
> [ 1272.943527] refcount_t: underflow; use-after-free.
> [ 1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
> [ 1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs]
> [ 1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G        W          6.14.0-rc5+ #108
> [ 1272.946368] Tainted: [W]=WARN
> [ 1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> [ 1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
> [ 1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110
> [ 1272.948180] Code: 01 01 e8 e9 c7 a9 ff 0f 0b c3 cc cc cc cc 80 3d 3f 4a de 01 00 75 85 48 c7 c7 00 9b 9f 8f c6 05 2f 4a de 01 01 e8 c6 c7 a9 ff <0f> 0b c3 cc cc cc cc 80 3d 1d 4a de 01 00 0f 85 5e ff ff ff 48 c7
> [ 1272.949532] RSP: 0018:ffffbf1200247df0 EFLAGS: 00010282
> [ 1272.949901] RAX: 0000000000000000 RBX: ffffa14b00e3f800 RCX: 0000000000000000
> [ 1272.950437] RDX: 0000000000000000 RSI: ffffbf1200247c78 RDI: 00000000ffffdfff
> [ 1272.950986] RBP: ffffa14b00dc2860 R08: 00000000ffffdfff R09: ffffffff90526268
> [ 1272.951512] R10: ffffffff904762c0 R11: 0000000063666572 R12: ffffa14b00dc28c0
> [ 1272.952024] R13: 0000000000000000 R14: ffffa14b00dc2868 R15: 000001285dcd12c0
> [ 1272.952850] FS:  0000000000000000(0000) GS:ffffa14d33c40000(0000) knlGS:0000000000000000
> [ 1272.953458] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1272.953931] CR2: 00007f838cbda000 CR3: 000000010104e000 CR4: 00000000000006f0
> [ 1272.954474] Call Trace:
> [ 1272.954655]  <TASK>
> [ 1272.954812]  ? refcount_warn_saturate+0xba/0x110
> [ 1272.955173]  ? __warn.cold+0x93/0xd7
> [ 1272.955487]  ? refcount_warn_saturate+0xba/0x110
> [ 1272.955816]  ? report_bug+0xe7/0x120
> [ 1272.956103]  ? handle_bug+0x53/0x90
> [ 1272.956424]  ? exc_invalid_op+0x13/0x60
> [ 1272.956700]  ? asm_exc_invalid_op+0x16/0x20
> [ 1272.957011]  ? refcount_warn_saturate+0xba/0x110
> [ 1272.957399]  btrfs_discard_cancel_work.cold+0x26/0x2b [btrfs]
> [ 1272.957853]  btrfs_put_block_group.cold+0x5d/0x8e [btrfs]
> [ 1272.958289]  btrfs_discard_workfn+0x194/0x380 [btrfs]
> [ 1272.958729]  process_one_work+0x130/0x290
> [ 1272.959026]  worker_thread+0x2ea/0x420
> [ 1272.959335]  ? __pfx_worker_thread+0x10/0x10
> [ 1272.959644]  kthread+0xd7/0x1c0
> [ 1272.959872]  ? __pfx_kthread+0x10/0x10
> [ 1272.960172]  ret_from_fork+0x30/0x50
> [ 1272.960474]  ? __pfx_kthread+0x10/0x10
> [ 1272.960745]  ret_from_fork_asm+0x1a/0x30
> [ 1272.961035]  </TASK>
> [ 1272.961238] ---[ end trace 0000000000000000 ]---
> 
> Though we have seen them in the async discard workfn as well. It is
> most likely to happen after a relocation finishes which cancels discard,
> tears down the block group, etc.
> 
> Fix this fully by taking the lock around the list_del_init + clear_bit
> so that the two are done atomically.
> 
> Fixes: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused")
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/block-group.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 64f0268dcf02..2db1497b58d9 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -2797,8 +2797,11 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
>   		/* Already aborted the transaction if it failed. */
>   next:
>   		btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
> +
> +		spin_lock(&fs_info->unused_bgs_lock);
>   		list_del_init(&block_group->bg_list);
>   		clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
> +		spin_unlock(&fs_info->unused_bgs_lock);
>   
>   		/*
>   		 * If the block group is still unused, add it to the list of


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups
  2025-03-07  0:29 ` [PATCH 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups Boris Burkov
  2025-03-07  6:51   ` Qu Wenruo
@ 2025-03-07 14:13   ` Filipe Manana
  2025-03-07 21:32     ` Boris Burkov
  1 sibling, 1 reply; 21+ messages in thread
From: Filipe Manana @ 2025-03-07 14:13 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Fri, Mar 7, 2025 at 12:31 AM Boris Burkov <boris@bur.io> wrote:
>
> To avoid a race where mark_bg_unused spuriously "moved" the block_group
> from one bg_list attachment to another without taking a ref, we mark a
> new block group with the bit BLOCK_GROUP_FLAG_NEW.
>
> However, this fix is not quite complete. Since it does not use the
> unused_bg_lock, it is possible for the following race to occur:
>
> create_pending_block_groups                     mark_bg_unused

mark_bg_unused -> btrfs_mark_bg_unused

>                                            if list_empty // false
>         list_del_init
>         clear_bit
>                                            else if (test_bit) // true
>                                                 list_move_tail

This should mention how that sequence is possible, i.e. on a higher level.

For example the task that created the block group ended up not
allocating extents from it,
and other tasks allocated extents from it and deallocated so that the
block group became empty
and was added to the unused list before the task that created it
finished btrfs_create_pending_block_groups().

Or was it some other scenario?

Thanks.

>
> And we get into the exact same broken ref count situation.
> Those look something like:
> [ 1272.943113] ------------[ cut here ]------------
> [ 1272.943527] refcount_t: underflow; use-after-free.
> [ 1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
> [ 1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs]
> [ 1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G        W          6.14.0-rc5+ #108
> [ 1272.946368] Tainted: [W]=WARN
> [ 1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> [ 1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
> [ 1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110
> [ 1272.948180] Code: 01 01 e8 e9 c7 a9 ff 0f 0b c3 cc cc cc cc 80 3d 3f 4a de 01 00 75 85 48 c7 c7 00 9b 9f 8f c6 05 2f 4a de 01 01 e8 c6 c7 a9 ff <0f> 0b c3 cc cc cc cc 80 3d 1d 4a de 01 00 0f 85 5e ff ff ff 48 c7
> [ 1272.949532] RSP: 0018:ffffbf1200247df0 EFLAGS: 00010282
> [ 1272.949901] RAX: 0000000000000000 RBX: ffffa14b00e3f800 RCX: 0000000000000000
> [ 1272.950437] RDX: 0000000000000000 RSI: ffffbf1200247c78 RDI: 00000000ffffdfff
> [ 1272.950986] RBP: ffffa14b00dc2860 R08: 00000000ffffdfff R09: ffffffff90526268
> [ 1272.951512] R10: ffffffff904762c0 R11: 0000000063666572 R12: ffffa14b00dc28c0
> [ 1272.952024] R13: 0000000000000000 R14: ffffa14b00dc2868 R15: 000001285dcd12c0
> [ 1272.952850] FS:  0000000000000000(0000) GS:ffffa14d33c40000(0000) knlGS:0000000000000000
> [ 1272.953458] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1272.953931] CR2: 00007f838cbda000 CR3: 000000010104e000 CR4: 00000000000006f0
> [ 1272.954474] Call Trace:
> [ 1272.954655]  <TASK>
> [ 1272.954812]  ? refcount_warn_saturate+0xba/0x110
> [ 1272.955173]  ? __warn.cold+0x93/0xd7
> [ 1272.955487]  ? refcount_warn_saturate+0xba/0x110
> [ 1272.955816]  ? report_bug+0xe7/0x120
> [ 1272.956103]  ? handle_bug+0x53/0x90
> [ 1272.956424]  ? exc_invalid_op+0x13/0x60
> [ 1272.956700]  ? asm_exc_invalid_op+0x16/0x20
> [ 1272.957011]  ? refcount_warn_saturate+0xba/0x110
> [ 1272.957399]  btrfs_discard_cancel_work.cold+0x26/0x2b [btrfs]
> [ 1272.957853]  btrfs_put_block_group.cold+0x5d/0x8e [btrfs]
> [ 1272.958289]  btrfs_discard_workfn+0x194/0x380 [btrfs]
> [ 1272.958729]  process_one_work+0x130/0x290
> [ 1272.959026]  worker_thread+0x2ea/0x420
> [ 1272.959335]  ? __pfx_worker_thread+0x10/0x10
> [ 1272.959644]  kthread+0xd7/0x1c0
> [ 1272.959872]  ? __pfx_kthread+0x10/0x10
> [ 1272.960172]  ret_from_fork+0x30/0x50
> [ 1272.960474]  ? __pfx_kthread+0x10/0x10
> [ 1272.960745]  ret_from_fork_asm+0x1a/0x30
> [ 1272.961035]  </TASK>
> [ 1272.961238] ---[ end trace 0000000000000000 ]---
>
> Though we have seen them in the async discard workfn as well. It is
> most likely to happen after a relocation finishes which cancels discard,
> tears down the block group, etc.
>
> Fix this fully by taking the lock around the list_del_init + clear_bit
> so that the two are done atomically.
>
> Fixes: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused")
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/block-group.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 64f0268dcf02..2db1497b58d9 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -2797,8 +2797,11 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
>                 /* Already aborted the transaction if it failed. */
>  next:
>                 btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
> +
> +               spin_lock(&fs_info->unused_bgs_lock);
>                 list_del_init(&block_group->bg_list);
>                 clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
> +               spin_unlock(&fs_info->unused_bgs_lock);
>
>                 /*
>                  * If the block group is still unused, add it to the list of
> --
> 2.48.1
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups
  2025-03-07 14:13   ` Filipe Manana
@ 2025-03-07 21:32     ` Boris Burkov
  2025-03-10 12:41       ` Filipe Manana
  0 siblings, 1 reply; 21+ messages in thread
From: Boris Burkov @ 2025-03-07 21:32 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Fri, Mar 07, 2025 at 02:13:15PM +0000, Filipe Manana wrote:
> On Fri, Mar 7, 2025 at 12:31 AM Boris Burkov <boris@bur.io> wrote:
> >
> > To avoid a race where mark_bg_unused spuriously "moved" the block_group
> > from one bg_list attachment to another without taking a ref, we mark a
> > new block group with the bit BLOCK_GROUP_FLAG_NEW.
> >
> > However, this fix is not quite complete. Since it does not use the
> > unused_bg_lock, it is possible for the following race to occur:
> >
> > create_pending_block_groups                     mark_bg_unused
> 
> mark_bg_unused -> btrfs_mark_bg_unused
> 
> >                                            if list_empty // false
> >         list_del_init
> >         clear_bit
> >                                            else if (test_bit) // true
> >                                                 list_move_tail
> 
> This should mention how that sequence is possible, i.e. on a higher level.
> 
> For example the task that created the block group ended up not
> allocating extents from it,
> and other tasks allocated extents from it and deallocated so that the
> block group became empty
> and was added to the unused list before the task that created it
> finished btrfs_create_pending_block_groups().
> 
> Or was it some other scenario?

To be honest, since the detection of the error is so non-local to the
time of the error with these refcounting issues, I don't have any
proof that exactly this is happening. I just have a bunch of hosts with
wrong refcounts detected after a relocation and started squinting at
places it could have gone wrong.

Your original patch and the very existence of the BLOCK_GROUP_FLAG_NEW
flag suggest to me that the two running concurrently is expected.

Would you like me to attempt to produce this condition on the current
kernel? Or I can duplicate/link to some of the reasoning from your first
fix here so that this commit message tells a better self-contained story?

Thanks,
Boris

> 
> Thanks.
> 
> >
> > And we get into the exact same broken ref count situation.
> > Those look something like:
> > [ 1272.943113] ------------[ cut here ]------------
> > [ 1272.943527] refcount_t: underflow; use-after-free.
> > [ 1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
> > [ 1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs]
> > [ 1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G        W          6.14.0-rc5+ #108
> > [ 1272.946368] Tainted: [W]=WARN
> > [ 1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> > [ 1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
> > [ 1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110
> > [ 1272.948180] Code: 01 01 e8 e9 c7 a9 ff 0f 0b c3 cc cc cc cc 80 3d 3f 4a de 01 00 75 85 48 c7 c7 00 9b 9f 8f c6 05 2f 4a de 01 01 e8 c6 c7 a9 ff <0f> 0b c3 cc cc cc cc 80 3d 1d 4a de 01 00 0f 85 5e ff ff ff 48 c7
> > [ 1272.949532] RSP: 0018:ffffbf1200247df0 EFLAGS: 00010282
> > [ 1272.949901] RAX: 0000000000000000 RBX: ffffa14b00e3f800 RCX: 0000000000000000
> > [ 1272.950437] RDX: 0000000000000000 RSI: ffffbf1200247c78 RDI: 00000000ffffdfff
> > [ 1272.950986] RBP: ffffa14b00dc2860 R08: 00000000ffffdfff R09: ffffffff90526268
> > [ 1272.951512] R10: ffffffff904762c0 R11: 0000000063666572 R12: ffffa14b00dc28c0
> > [ 1272.952024] R13: 0000000000000000 R14: ffffa14b00dc2868 R15: 000001285dcd12c0
> > [ 1272.952850] FS:  0000000000000000(0000) GS:ffffa14d33c40000(0000) knlGS:0000000000000000
> > [ 1272.953458] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1272.953931] CR2: 00007f838cbda000 CR3: 000000010104e000 CR4: 00000000000006f0
> > [ 1272.954474] Call Trace:
> > [ 1272.954655]  <TASK>
> > [ 1272.954812]  ? refcount_warn_saturate+0xba/0x110
> > [ 1272.955173]  ? __warn.cold+0x93/0xd7
> > [ 1272.955487]  ? refcount_warn_saturate+0xba/0x110
> > [ 1272.955816]  ? report_bug+0xe7/0x120
> > [ 1272.956103]  ? handle_bug+0x53/0x90
> > [ 1272.956424]  ? exc_invalid_op+0x13/0x60
> > [ 1272.956700]  ? asm_exc_invalid_op+0x16/0x20
> > [ 1272.957011]  ? refcount_warn_saturate+0xba/0x110
> > [ 1272.957399]  btrfs_discard_cancel_work.cold+0x26/0x2b [btrfs]
> > [ 1272.957853]  btrfs_put_block_group.cold+0x5d/0x8e [btrfs]
> > [ 1272.958289]  btrfs_discard_workfn+0x194/0x380 [btrfs]
> > [ 1272.958729]  process_one_work+0x130/0x290
> > [ 1272.959026]  worker_thread+0x2ea/0x420
> > [ 1272.959335]  ? __pfx_worker_thread+0x10/0x10
> > [ 1272.959644]  kthread+0xd7/0x1c0
> > [ 1272.959872]  ? __pfx_kthread+0x10/0x10
> > [ 1272.960172]  ret_from_fork+0x30/0x50
> > [ 1272.960474]  ? __pfx_kthread+0x10/0x10
> > [ 1272.960745]  ret_from_fork_asm+0x1a/0x30
> > [ 1272.961035]  </TASK>
> > [ 1272.961238] ---[ end trace 0000000000000000 ]---
> >
> > Though we have seen them in the async discard workfn as well. It is
> > most likely to happen after a relocation finishes which cancels discard,
> > tears down the block group, etc.
> >
> > Fix this fully by taking the lock around the list_del_init + clear_bit
> > so that the two are done atomically.
> >
> > Fixes: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused")
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/block-group.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > index 64f0268dcf02..2db1497b58d9 100644
> > --- a/fs/btrfs/block-group.c
> > +++ b/fs/btrfs/block-group.c
> > @@ -2797,8 +2797,11 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
> >                 /* Already aborted the transaction if it failed. */
> >  next:
> >                 btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
> > +
> > +               spin_lock(&fs_info->unused_bgs_lock);
> >                 list_del_init(&block_group->bg_list);
> >                 clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
> > +               spin_unlock(&fs_info->unused_bgs_lock);
> >
> >                 /*
> >                  * If the block group is still unused, add it to the list of
> > --
> > 2.48.1
> >
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups
  2025-03-07 21:32     ` Boris Burkov
@ 2025-03-10 12:41       ` Filipe Manana
  2025-03-10 19:28         ` Boris Burkov
  0 siblings, 1 reply; 21+ messages in thread
From: Filipe Manana @ 2025-03-10 12:41 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Fri, Mar 7, 2025 at 9:31 PM Boris Burkov <boris@bur.io> wrote:
>
> On Fri, Mar 07, 2025 at 02:13:15PM +0000, Filipe Manana wrote:
> > On Fri, Mar 7, 2025 at 12:31 AM Boris Burkov <boris@bur.io> wrote:
> > >
> > > To avoid a race where mark_bg_unused spuriously "moved" the block_group
> > > from one bg_list attachment to another without taking a ref, we mark a
> > > new block group with the bit BLOCK_GROUP_FLAG_NEW.
> > >
> > > However, this fix is not quite complete. Since it does not use the
> > > unused_bg_lock, it is possible for the following race to occur:
> > >
> > > create_pending_block_groups                     mark_bg_unused
> >
> > mark_bg_unused -> btrfs_mark_bg_unused
> >
> > >                                            if list_empty // false
> > >         list_del_init
> > >         clear_bit
> > >                                            else if (test_bit) // true
> > >                                                 list_move_tail
> >
> > This should mention how that sequence is possible, i.e. on a higher level.
> >
> > For example the task that created the block group ended up not
> > allocating extents from it,
> > and other tasks allocated extents from it and deallocated so that the
> > block group became empty
> > and was added to the unused list before the task that created it
> > finished btrfs_create_pending_block_groups().
> >
> > Or was it some other scenario?
>
> To be honest, since the detection of the error is so non-local to the
> time of the error with these refcounting issues, I don't have any
> proof that exactly this is happening. I just have a bunch of hosts with
> wrong refcounts detected after a relocation and started squinting at
> places it could have gone wrong.
>
> Your original patch and the very existence of the BLOCK_GROUP_FLAG_NEW
> flag suggest to me that the two running concurrently is expected.

Well, even before that flag, there were problems already, but not so
easy to analyze because list_del_init(), list_empty() and
list_add_tail() are not atomic operations.

>
> Would you like me to attempt to produce this condition on the current
> kernel? Or I can duplicate/link to some of the reasoning from your first
> fix here so that this commit message tells a better self-contained story?

You don't need to go trigger it and trace exactly how it happens, but
mentioning a hypothetical scenario is enough.

Even before finishing the creation of a new block group, other tasks
can allocate and deallocate extents from it, and on deallocation when
it becomes empty we may add the new block group to the unused list.
This is possible because during the lifetime of the transaction, we
often flush (trigger writeback) dirty extent buffers because we
reached the BTRFS_DIRTY_METADATA_THRESH limit or we're under memory
pressure, so btree modifications on the extent buffers will be forced
to COW, freeing the extent buffers previously allocated in the new
block group and allocating new ones in some other block group, so that
the new one becomes empty and is added to the unused list.

With that:
Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

>
> Thanks,
> Boris
>
> >
> > Thanks.
> >
> > >
> > > And we get into the exact same broken ref count situation.
> > > Those look something like:
> > > [ 1272.943113] ------------[ cut here ]------------
> > > [ 1272.943527] refcount_t: underflow; use-after-free.
> > > [ 1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
> > > [ 1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs]
> > > [ 1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G        W          6.14.0-rc5+ #108
> > > [ 1272.946368] Tainted: [W]=WARN
> > > [ 1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> > > [ 1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
> > > [ 1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110
> > > [ 1272.948180] Code: 01 01 e8 e9 c7 a9 ff 0f 0b c3 cc cc cc cc 80 3d 3f 4a de 01 00 75 85 48 c7 c7 00 9b 9f 8f c6 05 2f 4a de 01 01 e8 c6 c7 a9 ff <0f> 0b c3 cc cc cc cc 80 3d 1d 4a de 01 00 0f 85 5e ff ff ff 48 c7
> > > [ 1272.949532] RSP: 0018:ffffbf1200247df0 EFLAGS: 00010282
> > > [ 1272.949901] RAX: 0000000000000000 RBX: ffffa14b00e3f800 RCX: 0000000000000000
> > > [ 1272.950437] RDX: 0000000000000000 RSI: ffffbf1200247c78 RDI: 00000000ffffdfff
> > > [ 1272.950986] RBP: ffffa14b00dc2860 R08: 00000000ffffdfff R09: ffffffff90526268
> > > [ 1272.951512] R10: ffffffff904762c0 R11: 0000000063666572 R12: ffffa14b00dc28c0
> > > [ 1272.952024] R13: 0000000000000000 R14: ffffa14b00dc2868 R15: 000001285dcd12c0
> > > [ 1272.952850] FS:  0000000000000000(0000) GS:ffffa14d33c40000(0000) knlGS:0000000000000000
> > > [ 1272.953458] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 1272.953931] CR2: 00007f838cbda000 CR3: 000000010104e000 CR4: 00000000000006f0
> > > [ 1272.954474] Call Trace:
> > > [ 1272.954655]  <TASK>
> > > [ 1272.954812]  ? refcount_warn_saturate+0xba/0x110
> > > [ 1272.955173]  ? __warn.cold+0x93/0xd7
> > > [ 1272.955487]  ? refcount_warn_saturate+0xba/0x110
> > > [ 1272.955816]  ? report_bug+0xe7/0x120
> > > [ 1272.956103]  ? handle_bug+0x53/0x90
> > > [ 1272.956424]  ? exc_invalid_op+0x13/0x60
> > > [ 1272.956700]  ? asm_exc_invalid_op+0x16/0x20
> > > [ 1272.957011]  ? refcount_warn_saturate+0xba/0x110
> > > [ 1272.957399]  btrfs_discard_cancel_work.cold+0x26/0x2b [btrfs]
> > > [ 1272.957853]  btrfs_put_block_group.cold+0x5d/0x8e [btrfs]
> > > [ 1272.958289]  btrfs_discard_workfn+0x194/0x380 [btrfs]
> > > [ 1272.958729]  process_one_work+0x130/0x290
> > > [ 1272.959026]  worker_thread+0x2ea/0x420
> > > [ 1272.959335]  ? __pfx_worker_thread+0x10/0x10
> > > [ 1272.959644]  kthread+0xd7/0x1c0
> > > [ 1272.959872]  ? __pfx_kthread+0x10/0x10
> > > [ 1272.960172]  ret_from_fork+0x30/0x50
> > > [ 1272.960474]  ? __pfx_kthread+0x10/0x10
> > > [ 1272.960745]  ret_from_fork_asm+0x1a/0x30
> > > [ 1272.961035]  </TASK>
> > > [ 1272.961238] ---[ end trace 0000000000000000 ]---
> > >
> > > Though we have seen them in the async discard workfn as well. It is
> > > most likely to happen after a relocation finishes which cancels discard,
> > > tears down the block group, etc.
> > >
> > > Fix this fully by taking the lock around the list_del_init + clear_bit
> > > so that the two are done atomically.
> > >
> > > Fixes: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused")
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > >  fs/btrfs/block-group.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > > index 64f0268dcf02..2db1497b58d9 100644
> > > --- a/fs/btrfs/block-group.c
> > > +++ b/fs/btrfs/block-group.c
> > > @@ -2797,8 +2797,11 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
> > >                 /* Already aborted the transaction if it failed. */
> > >  next:
> > >                 btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
> > > +
> > > +               spin_lock(&fs_info->unused_bgs_lock);
> > >                 list_del_init(&block_group->bg_list);
> > >                 clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
> > > +               spin_unlock(&fs_info->unused_bgs_lock);
> > >
> > >                 /*
> > >                  * If the block group is still unused, add it to the list of
> > > --
> > > 2.48.1
> > >
> > >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups
  2025-03-10 12:41       ` Filipe Manana
@ 2025-03-10 19:28         ` Boris Burkov
  0 siblings, 0 replies; 21+ messages in thread
From: Boris Burkov @ 2025-03-10 19:28 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Mon, Mar 10, 2025 at 12:41:20PM +0000, Filipe Manana wrote:
> On Fri, Mar 7, 2025 at 9:31 PM Boris Burkov <boris@bur.io> wrote:
> >
> > On Fri, Mar 07, 2025 at 02:13:15PM +0000, Filipe Manana wrote:
> > > On Fri, Mar 7, 2025 at 12:31 AM Boris Burkov <boris@bur.io> wrote:
> > > >
> > > > To avoid a race where mark_bg_unused spuriously "moved" the block_group
> > > > from one bg_list attachment to another without taking a ref, we mark a
> > > > new block group with the bit BLOCK_GROUP_FLAG_NEW.
> > > >
> > > > However, this fix is not quite complete. Since it does not use the
> > > > unused_bg_lock, it is possible for the following race to occur:
> > > >
> > > > create_pending_block_groups                     mark_bg_unused
> > >
> > > mark_bg_unused -> btrfs_mark_bg_unused
> > >
> > > >                                            if list_empty // false
> > > >         list_del_init
> > > >         clear_bit
> > > >                                            else if (test_bit) // true
> > > >                                                 list_move_tail
> > >
> > > This should mention how that sequence is possible, i.e. on a higher level.
> > >
> > > For example the task that created the block group ended up not
> > > allocating extents from it,
> > > and other tasks allocated extents from it and deallocated so that the
> > > block group became empty
> > > and was added to the unused list before the task that created it
> > > finished btrfs_create_pending_block_groups().
> > >
> > > Or was it some other scenario?
> >
> > To be honest, since the detection of the error is so non-local to the
> > time of the error with these refcounting issues, I don't have any
> > proof that exactly this is happening. I just have a bunch of hosts with
> > wrong refcounts detected after a relocation and started squinting at
> > places it could have gone wrong.
> >
> > Your original patch and the very existence of the BLOCK_GROUP_FLAG_NEW
> > flag suggest to me that the two running concurrently is expected.
> 
> Well, even before that flag, there were problems already, but not so
> easy to analyze because list_del_init(), list_empty() and
> list_add_tail() are not atomic operations.
> 

It's funny because at first I was worried this was going to be some kind
of "we can't guarantee the order of the bit set vs the list empty"
re-ordering optimization kind of thing before I realized it just needed
a lock. Grateful it was the simpler case :)

> >
> > Would you like me to attempt to produce this condition on the current
> > kernel? Or I can duplicate/link to some of the reasoning from your first
> > fix here so that this commit message tells a better self-contained story?
> 
> You don't need to go trigger it and trace exactly how it happens, but
> mentioning a hypothetical scenario is enough.
> 
> Even before finishing the creation of a new block group, other tasks
> can allocate and deallocate extents from it, and on deallocation when
> it becomes empty we may add the new block group to the unused list.
> This is possible because during the lifetime of the transaction, we
> often flush (trigger writeback) dirty extent buffers because we
> reached the BTRFS_DIRTY_METADATA_THRESH limit or we're under memory
> pressure, so btree modifications on the extent buffers will be forced
> to COW, freeing the extent buffers previously allocated in the new
> block group and allocating new ones in some other block group, so that
> the new one becomes empty and is added to the unused list.

By the way, do you have a good argument why we don't need to check the
flag in btrfs_mark_bg_to_reclaim? It seems similar enough at first
blush, especially if you imagine a very low reclaim threshold, like 1%.

I'll look into your original patch more closely and think harder about
it, but also curious what you think.

> 
> With that:
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> Thanks.
> 
> >
> > Thanks,
> > Boris
> >
> > >
> > > Thanks.
> > >
> > > >
> > > > And we get into the exact same broken ref count situation.
> > > > Those look something like:
> > > > [ 1272.943113] ------------[ cut here ]------------
> > > > [ 1272.943527] refcount_t: underflow; use-after-free.
> > > > [ 1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
> > > > [ 1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs]
> > > > [ 1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G        W          6.14.0-rc5+ #108
> > > > [ 1272.946368] Tainted: [W]=WARN
> > > > [ 1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> > > > [ 1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
> > > > [ 1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110
> > > > [ 1272.948180] Code: 01 01 e8 e9 c7 a9 ff 0f 0b c3 cc cc cc cc 80 3d 3f 4a de 01 00 75 85 48 c7 c7 00 9b 9f 8f c6 05 2f 4a de 01 01 e8 c6 c7 a9 ff <0f> 0b c3 cc cc cc cc 80 3d 1d 4a de 01 00 0f 85 5e ff ff ff 48 c7
> > > > [ 1272.949532] RSP: 0018:ffffbf1200247df0 EFLAGS: 00010282
> > > > [ 1272.949901] RAX: 0000000000000000 RBX: ffffa14b00e3f800 RCX: 0000000000000000
> > > > [ 1272.950437] RDX: 0000000000000000 RSI: ffffbf1200247c78 RDI: 00000000ffffdfff
> > > > [ 1272.950986] RBP: ffffa14b00dc2860 R08: 00000000ffffdfff R09: ffffffff90526268
> > > > [ 1272.951512] R10: ffffffff904762c0 R11: 0000000063666572 R12: ffffa14b00dc28c0
> > > > [ 1272.952024] R13: 0000000000000000 R14: ffffa14b00dc2868 R15: 000001285dcd12c0
> > > > [ 1272.952850] FS:  0000000000000000(0000) GS:ffffa14d33c40000(0000) knlGS:0000000000000000
> > > > [ 1272.953458] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [ 1272.953931] CR2: 00007f838cbda000 CR3: 000000010104e000 CR4: 00000000000006f0
> > > > [ 1272.954474] Call Trace:
> > > > [ 1272.954655]  <TASK>
> > > > [ 1272.954812]  ? refcount_warn_saturate+0xba/0x110
> > > > [ 1272.955173]  ? __warn.cold+0x93/0xd7
> > > > [ 1272.955487]  ? refcount_warn_saturate+0xba/0x110
> > > > [ 1272.955816]  ? report_bug+0xe7/0x120
> > > > [ 1272.956103]  ? handle_bug+0x53/0x90
> > > > [ 1272.956424]  ? exc_invalid_op+0x13/0x60
> > > > [ 1272.956700]  ? asm_exc_invalid_op+0x16/0x20
> > > > [ 1272.957011]  ? refcount_warn_saturate+0xba/0x110
> > > > [ 1272.957399]  btrfs_discard_cancel_work.cold+0x26/0x2b [btrfs]
> > > > [ 1272.957853]  btrfs_put_block_group.cold+0x5d/0x8e [btrfs]
> > > > [ 1272.958289]  btrfs_discard_workfn+0x194/0x380 [btrfs]
> > > > [ 1272.958729]  process_one_work+0x130/0x290
> > > > [ 1272.959026]  worker_thread+0x2ea/0x420
> > > > [ 1272.959335]  ? __pfx_worker_thread+0x10/0x10
> > > > [ 1272.959644]  kthread+0xd7/0x1c0
> > > > [ 1272.959872]  ? __pfx_kthread+0x10/0x10
> > > > [ 1272.960172]  ret_from_fork+0x30/0x50
> > > > [ 1272.960474]  ? __pfx_kthread+0x10/0x10
> > > > [ 1272.960745]  ret_from_fork_asm+0x1a/0x30
> > > > [ 1272.961035]  </TASK>
> > > > [ 1272.961238] ---[ end trace 0000000000000000 ]---
> > > >
> > > > Though we have seen them in the async discard workfn as well. It is
> > > > most likely to happen after a relocation finishes which cancels discard,
> > > > tears down the block group, etc.
> > > >
> > > > Fix this fully by taking the lock around the list_del_init + clear_bit
> > > > so that the two are done atomically.
> > > >
> > > > Fixes: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused")
> > > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > > ---
> > > >  fs/btrfs/block-group.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > > > index 64f0268dcf02..2db1497b58d9 100644
> > > > --- a/fs/btrfs/block-group.c
> > > > +++ b/fs/btrfs/block-group.c
> > > > @@ -2797,8 +2797,11 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
> > > >                 /* Already aborted the transaction if it failed. */
> > > >  next:
> > > >                 btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
> > > > +
> > > > +               spin_lock(&fs_info->unused_bgs_lock);
> > > >                 list_del_init(&block_group->bg_list);
> > > >                 clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
> > > > +               spin_unlock(&fs_info->unused_bgs_lock);
> > > >
> > > >                 /*
> > > >                  * If the block group is still unused, add it to the list of
> > > > --
> > > > 2.48.1
> > > >
> > > >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/5] btrfs: fix bg->bg_list list_del refcount races
  2025-03-07  0:29 [PATCH 0/5] btrfs: block_group refcounting fixes Boris Burkov
  2025-03-07  0:29 ` [PATCH 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups Boris Burkov
@ 2025-03-07  0:29 ` Boris Burkov
  2025-03-07  6:52   ` Qu Wenruo
  2025-03-07 14:24   ` Filipe Manana
  2025-03-07  0:29 ` [PATCH 3/5] btrfs: make discard_workfn block_group ref explicit Boris Burkov
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 21+ messages in thread
From: Boris Burkov @ 2025-03-07  0:29 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

If there is any chance at all of racing with mark_bg_unused, better safe
than sorry.

Otherwise we risk the following interleaving (bg_list refcount in parens)

T1 (some random op)                         T2 (mark_bg_unused)
                                        !list_empty(&bg->bg_list); (1)
list_del_init(&bg->bg_list); (1)
                                        list_move_tail (1)
btrfs_put_block_group (0)
                                        btrfs_delete_unused_bgs
                                             bg = list_first_entry
                                             list_del_init(&bg->bg_list);
                                             btrfs_put_block_group(bg); (-1)

Ultimately, this results in a broken ref count that hits zero one deref
early and the real final deref underflows the refcount, resulting in a WARNING.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/extent-tree.c | 3 +++
 fs/btrfs/transaction.c | 5 +++++
 2 files changed, 8 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5de1a1293c93..80560065cc1b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2868,7 +2868,10 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
 						   block_group->length,
 						   &trimmed);
 
+		spin_lock(&fs_info->unused_bgs_lock);
 		list_del_init(&block_group->bg_list);
+		spin_unlock(&fs_info->unused_bgs_lock);
+
 		btrfs_unfreeze_block_group(block_group);
 		btrfs_put_block_group(block_group);
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index db8fe291d010..c98a8efd1bea 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -160,7 +160,9 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
 			cache = list_first_entry(&transaction->deleted_bgs,
 						 struct btrfs_block_group,
 						 bg_list);
+			spin_lock(&transaction->fs_info->unused_bgs_lock);
 			list_del_init(&cache->bg_list);
+			spin_unlock(&transaction->fs_info->unused_bgs_lock);
 			btrfs_unfreeze_block_group(cache);
 			btrfs_put_block_group(cache);
 		}
@@ -2096,7 +2098,10 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
 
        list_for_each_entry_safe(block_group, tmp, &trans->new_bgs, bg_list) {
                btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
+	       spin_lock(&fs_info->unused_bgs_lock);
                list_del_init(&block_group->bg_list);
+	       btrfs_put_block_group(block_group);
+	       spin_unlock(&fs_info->unused_bgs_lock);
        }
 }
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] btrfs: fix bg->bg_list list_del refcount races
  2025-03-07  0:29 ` [PATCH 2/5] btrfs: fix bg->bg_list list_del refcount races Boris Burkov
@ 2025-03-07  6:52   ` Qu Wenruo
  2025-03-07 14:24   ` Filipe Manana
  1 sibling, 0 replies; 21+ messages in thread
From: Qu Wenruo @ 2025-03-07  6:52 UTC (permalink / raw)
  To: Boris Burkov, linux-btrfs, kernel-team



在 2025/3/7 10:59, Boris Burkov 写道:
> If there is any chance at all of racing with mark_bg_unused, better safe
> than sorry.
> 
> Otherwise we risk the following interleaving (bg_list refcount in parens)
> 
> T1 (some random op)                         T2 (mark_bg_unused)
>                                          !list_empty(&bg->bg_list); (1)
> list_del_init(&bg->bg_list); (1)
>                                          list_move_tail (1)
> btrfs_put_block_group (0)
>                                          btrfs_delete_unused_bgs
>                                               bg = list_first_entry
>                                               list_del_init(&bg->bg_list);
>                                               btrfs_put_block_group(bg); (-1)
> 
> Ultimately, this results in a broken ref count that hits zero one deref
> early and the real final deref underflows the refcount, resulting in a WARNING.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/extent-tree.c | 3 +++
>   fs/btrfs/transaction.c | 5 +++++
>   2 files changed, 8 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 5de1a1293c93..80560065cc1b 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2868,7 +2868,10 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
>   						   block_group->length,
>   						   &trimmed);
>   
> +		spin_lock(&fs_info->unused_bgs_lock);
>   		list_del_init(&block_group->bg_list);
> +		spin_unlock(&fs_info->unused_bgs_lock);
> +
>   		btrfs_unfreeze_block_group(block_group);
>   		btrfs_put_block_group(block_group);
>   
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index db8fe291d010..c98a8efd1bea 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -160,7 +160,9 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
>   			cache = list_first_entry(&transaction->deleted_bgs,
>   						 struct btrfs_block_group,
>   						 bg_list);
> +			spin_lock(&transaction->fs_info->unused_bgs_lock);
>   			list_del_init(&cache->bg_list);
> +			spin_unlock(&transaction->fs_info->unused_bgs_lock);
>   			btrfs_unfreeze_block_group(cache);
>   			btrfs_put_block_group(cache);
>   		}
> @@ -2096,7 +2098,10 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
>   
>          list_for_each_entry_safe(block_group, tmp, &trans->new_bgs, bg_list) {
>                  btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
> +	       spin_lock(&fs_info->unused_bgs_lock);
>                  list_del_init(&block_group->bg_list);
> +	       btrfs_put_block_group(block_group);
> +	       spin_unlock(&fs_info->unused_bgs_lock);
>          }
>   }
>   


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] btrfs: fix bg->bg_list list_del refcount races
  2025-03-07  0:29 ` [PATCH 2/5] btrfs: fix bg->bg_list list_del refcount races Boris Burkov
  2025-03-07  6:52   ` Qu Wenruo
@ 2025-03-07 14:24   ` Filipe Manana
  2025-03-07 21:37     ` Boris Burkov
  1 sibling, 1 reply; 21+ messages in thread
From: Filipe Manana @ 2025-03-07 14:24 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Fri, Mar 7, 2025 at 12:31 AM Boris Burkov <boris@bur.io> wrote:
>
> If there is any chance at all of racing with mark_bg_unused, better safe
> than sorry.
>
> Otherwise we risk the following interleaving (bg_list refcount in parens)
>
> T1 (some random op)                         T2 (mark_bg_unused)

mark_bg_unused -> btrfs_mark_bg_unused

Please use complete names everywhere.

>                                         !list_empty(&bg->bg_list); (1)
> list_del_init(&bg->bg_list); (1)
>                                         list_move_tail (1)
> btrfs_put_block_group (0)
>                                         btrfs_delete_unused_bgs
>                                              bg = list_first_entry
>                                              list_del_init(&bg->bg_list);
>                                              btrfs_put_block_group(bg); (-1)
>
> Ultimately, this results in a broken ref count that hits zero one deref
> early and the real final deref underflows the refcount, resulting in a WARNING.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/extent-tree.c | 3 +++
>  fs/btrfs/transaction.c | 5 +++++
>  2 files changed, 8 insertions(+)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 5de1a1293c93..80560065cc1b 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2868,7 +2868,10 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
>                                                    block_group->length,
>                                                    &trimmed);
>
> +               spin_lock(&fs_info->unused_bgs_lock);
>                 list_del_init(&block_group->bg_list);
> +               spin_unlock(&fs_info->unused_bgs_lock);

This shouldn't need the lock, because block groups added to the
transaction->deleted_bgs list were made RO only before at
btrfs_delete_unused_bgs().

> +
>                 btrfs_unfreeze_block_group(block_group);
>                 btrfs_put_block_group(block_group);
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index db8fe291d010..c98a8efd1bea 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -160,7 +160,9 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
>                         cache = list_first_entry(&transaction->deleted_bgs,
>                                                  struct btrfs_block_group,
>                                                  bg_list);
> +                       spin_lock(&transaction->fs_info->unused_bgs_lock);
>                         list_del_init(&cache->bg_list);
> +                       spin_unlock(&transaction->fs_info->unused_bgs_lock);

In the transaction abort path no else should be messing up with the list too.

>                         btrfs_unfreeze_block_group(cache);
>                         btrfs_put_block_group(cache);
>                 }
> @@ -2096,7 +2098,10 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
>
>         list_for_each_entry_safe(block_group, tmp, &trans->new_bgs, bg_list) {
>                 btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
> +              spin_lock(&fs_info->unused_bgs_lock);
>                 list_del_init(&block_group->bg_list);
> +              btrfs_put_block_group(block_group);
> +              spin_unlock(&fs_info->unused_bgs_lock);

What's this new btrfs_put_block_group() for? I don't see a matching
ref count increase in the patch.
Or is this fixing a separate bug? Where's the matching ref count
increase in the codebase?

As before, we're in the transaction abort path, no one should be
messing with the list too.

I don't mind adding the locking just to be safe, but leaving a comment
to mention it shouldn't be needed and why there's concurrency expected
in these cases would be nice.

Thanks.

>         }
>  }

>
> --
> 2.48.1
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] btrfs: fix bg->bg_list list_del refcount races
  2025-03-07 14:24   ` Filipe Manana
@ 2025-03-07 21:37     ` Boris Burkov
  2025-03-10 12:47       ` Filipe Manana
  0 siblings, 1 reply; 21+ messages in thread
From: Boris Burkov @ 2025-03-07 21:37 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Fri, Mar 07, 2025 at 02:24:34PM +0000, Filipe Manana wrote:
> On Fri, Mar 7, 2025 at 12:31 AM Boris Burkov <boris@bur.io> wrote:
> >
> > If there is any chance at all of racing with mark_bg_unused, better safe
> > than sorry.
> >
> > Otherwise we risk the following interleaving (bg_list refcount in parens)
> >
> > T1 (some random op)                         T2 (mark_bg_unused)
> 
> mark_bg_unused -> btrfs_mark_bg_unused
> 
> Please use complete names everywhere.
> 
> >                                         !list_empty(&bg->bg_list); (1)
> > list_del_init(&bg->bg_list); (1)
> >                                         list_move_tail (1)
> > btrfs_put_block_group (0)
> >                                         btrfs_delete_unused_bgs
> >                                              bg = list_first_entry
> >                                              list_del_init(&bg->bg_list);
> >                                              btrfs_put_block_group(bg); (-1)
> >
> > Ultimately, this results in a broken ref count that hits zero one deref
> > early and the real final deref underflows the refcount, resulting in a WARNING.
> >
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/extent-tree.c | 3 +++
> >  fs/btrfs/transaction.c | 5 +++++
> >  2 files changed, 8 insertions(+)
> >
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 5de1a1293c93..80560065cc1b 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -2868,7 +2868,10 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
> >                                                    block_group->length,
> >                                                    &trimmed);
> >
> > +               spin_lock(&fs_info->unused_bgs_lock);
> >                 list_del_init(&block_group->bg_list);
> > +               spin_unlock(&fs_info->unused_bgs_lock);
> 
> This shouldn't need the lock, because block groups added to the
> transaction->deleted_bgs list were made RO only before at
> btrfs_delete_unused_bgs().
> 

My thinking is that it is a lot easier to reason about this if we also
lock it when modifying the bg_list. That will insulate us against any
possible bugs with different uses of bg_list attaching to various lists.

When hunting for "broken refcount maybe" this time around, I had to dig
through all of these and carefully analyze them.

I agree with you that these are probably not strictly necessary in any
way, which is partly why I made them a separate patch from the one I
think is a bug. Perhaps I should retitle the patch to not use the terms
"fix" and "race" and instead call it something like:
"harden uses of bg_list against possible races" or something?

> > +
> >                 btrfs_unfreeze_block_group(block_group);
> >                 btrfs_put_block_group(block_group);
> >
> > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > index db8fe291d010..c98a8efd1bea 100644
> > --- a/fs/btrfs/transaction.c
> > +++ b/fs/btrfs/transaction.c
> > @@ -160,7 +160,9 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
> >                         cache = list_first_entry(&transaction->deleted_bgs,
> >                                                  struct btrfs_block_group,
> >                                                  bg_list);
> > +                       spin_lock(&transaction->fs_info->unused_bgs_lock);
> >                         list_del_init(&cache->bg_list);
> > +                       spin_unlock(&transaction->fs_info->unused_bgs_lock);
> 
> In the transaction abort path no else should be messing up with the list too.
> 
> >                         btrfs_unfreeze_block_group(cache);
> >                         btrfs_put_block_group(cache);
> >                 }
> > @@ -2096,7 +2098,10 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
> >
> >         list_for_each_entry_safe(block_group, tmp, &trans->new_bgs, bg_list) {
> >                 btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
> > +              spin_lock(&fs_info->unused_bgs_lock);
> >                 list_del_init(&block_group->bg_list);
> > +              btrfs_put_block_group(block_group);
> > +              spin_unlock(&fs_info->unused_bgs_lock);
> 
> What's this new btrfs_put_block_group() for? I don't see a matching
> ref count increase in the patch.
> Or is this fixing a separate bug? Where's the matching ref count
> increase in the codebase?

Sloppy to include it here, sorry. I can pull it out separately if you
like.

The intention of this change is to simply be disciplined about
maintaining the invariant that "bg is linked to a list via bg_list iff
bg refcount is incremented". That way we can confidently assert that the
list should be empty upon deletion, and catch more bugs, for example.

It certainly matters the least on transaction abort, when the state is
messed up anyway.

> 
> As before, we're in the transaction abort path, no one should be
> messing with the list too.
> 
> I don't mind adding the locking just to be safe, but leaving a comment
> to mention it shouldn't be needed and why there's concurrency expected
> in these cases would be nice.

I can definitely add a comment to the ones we expect don't care. (As
well as the re-titling I suggested above)

> 
> Thanks.
> 
> >         }
> >  }
> 
> >
> > --
> > 2.48.1
> >
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] btrfs: fix bg->bg_list list_del refcount races
  2025-03-07 21:37     ` Boris Burkov
@ 2025-03-10 12:47       ` Filipe Manana
  0 siblings, 0 replies; 21+ messages in thread
From: Filipe Manana @ 2025-03-10 12:47 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Fri, Mar 7, 2025 at 9:36 PM Boris Burkov <boris@bur.io> wrote:
>
> On Fri, Mar 07, 2025 at 02:24:34PM +0000, Filipe Manana wrote:
> > On Fri, Mar 7, 2025 at 12:31 AM Boris Burkov <boris@bur.io> wrote:
> > >
> > > If there is any chance at all of racing with mark_bg_unused, better safe
> > > than sorry.
> > >
> > > Otherwise we risk the following interleaving (bg_list refcount in parens)
> > >
> > > T1 (some random op)                         T2 (mark_bg_unused)
> >
> > mark_bg_unused -> btrfs_mark_bg_unused
> >
> > Please use complete names everywhere.
> >
> > >                                         !list_empty(&bg->bg_list); (1)
> > > list_del_init(&bg->bg_list); (1)
> > >                                         list_move_tail (1)
> > > btrfs_put_block_group (0)
> > >                                         btrfs_delete_unused_bgs
> > >                                              bg = list_first_entry
> > >                                              list_del_init(&bg->bg_list);
> > >                                              btrfs_put_block_group(bg); (-1)
> > >
> > > Ultimately, this results in a broken ref count that hits zero one deref
> > > early and the real final deref underflows the refcount, resulting in a WARNING.
> > >
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > >  fs/btrfs/extent-tree.c | 3 +++
> > >  fs/btrfs/transaction.c | 5 +++++
> > >  2 files changed, 8 insertions(+)
> > >
> > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > > index 5de1a1293c93..80560065cc1b 100644
> > > --- a/fs/btrfs/extent-tree.c
> > > +++ b/fs/btrfs/extent-tree.c
> > > @@ -2868,7 +2868,10 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans)
> > >                                                    block_group->length,
> > >                                                    &trimmed);
> > >
> > > +               spin_lock(&fs_info->unused_bgs_lock);
> > >                 list_del_init(&block_group->bg_list);
> > > +               spin_unlock(&fs_info->unused_bgs_lock);
> >
> > This shouldn't need the lock, because block groups added to the
> > transaction->deleted_bgs list were made RO only before at
> > btrfs_delete_unused_bgs().
> >
>
> My thinking is that it is a lot easier to reason about this if we also
> lock it when modifying the bg_list. That will insulate us against any
> possible bugs with different uses of bg_list attaching to various lists.

Sure, I get it, while it may not be a problem today, with code changes
elsewhere, it may become a problem.

>
> When hunting for "broken refcount maybe" this time around, I had to dig
> through all of these and carefully analyze them.
>
> I agree with you that these are probably not strictly necessary in any
> way, which is partly why I made them a separate patch from the one I
> think is a bug. Perhaps I should retitle the patch to not use the terms
> "fix" and "race" and instead call it something like:
> "harden uses of bg_list against possible races" or something?

Yes, the "harden ..." is a better choice IMO.

>
> > > +
> > >                 btrfs_unfreeze_block_group(block_group);
> > >                 btrfs_put_block_group(block_group);
> > >
> > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> > > index db8fe291d010..c98a8efd1bea 100644
> > > --- a/fs/btrfs/transaction.c
> > > +++ b/fs/btrfs/transaction.c
> > > @@ -160,7 +160,9 @@ void btrfs_put_transaction(struct btrfs_transaction *transaction)
> > >                         cache = list_first_entry(&transaction->deleted_bgs,
> > >                                                  struct btrfs_block_group,
> > >                                                  bg_list);
> > > +                       spin_lock(&transaction->fs_info->unused_bgs_lock);
> > >                         list_del_init(&cache->bg_list);
> > > +                       spin_unlock(&transaction->fs_info->unused_bgs_lock);
> >
> > In the transaction abort path no else should be messing up with the list too.
> >
> > >                         btrfs_unfreeze_block_group(cache);
> > >                         btrfs_put_block_group(cache);
> > >                 }
> > > @@ -2096,7 +2098,10 @@ static void btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans)
> > >
> > >         list_for_each_entry_safe(block_group, tmp, &trans->new_bgs, bg_list) {
> > >                 btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
> > > +              spin_lock(&fs_info->unused_bgs_lock);
> > >                 list_del_init(&block_group->bg_list);
> > > +              btrfs_put_block_group(block_group);
> > > +              spin_unlock(&fs_info->unused_bgs_lock);
> >
> > What's this new btrfs_put_block_group() for? I don't see a matching
> > ref count increase in the patch.
> > Or is this fixing a separate bug? Where's the matching ref count
> > increase in the codebase?
>
> Sloppy to include it here, sorry. I can pull it out separately if you
> like.

Yeah, seeing the reply on patch 4/5, this put should've been part of
patch 4/5 and not this one.
So just move it to that patch, it doesn't make sense here and actually
introduces a bug if the other patch is not applied.

Thanks.

>
> The intention of this change is to simply be disciplined about
> maintaining the invariant that "bg is linked to a list via bg_list iff
> bg refcount is incremented". That way we can confidently assert that the
> list should be empty upon deletion, and catch more bugs, for example.
>
> It certainly matters the least on transaction abort, when the state is
> messed up anyway.
>
> >
> > As before, we're in the transaction abort path, no one should be
> > messing with the list too.
> >
> > I don't mind adding the locking just to be safe, but leaving a comment
> > to mention it shouldn't be needed and why there's concurrency expected
> > in these cases would be nice.
>
> I can definitely add a comment to the ones we expect don't care. (As
> well as the re-titling I suggested above)
>
> >
> > Thanks.
> >
> > >         }
> > >  }
> >
> > >
> > > --
> > > 2.48.1
> > >
> > >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/5] btrfs: make discard_workfn block_group ref explicit
  2025-03-07  0:29 [PATCH 0/5] btrfs: block_group refcounting fixes Boris Burkov
  2025-03-07  0:29 ` [PATCH 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups Boris Burkov
  2025-03-07  0:29 ` [PATCH 2/5] btrfs: fix bg->bg_list list_del refcount races Boris Burkov
@ 2025-03-07  0:29 ` Boris Burkov
  2025-03-07 14:33   ` Filipe Manana
  2025-03-07  0:29 ` [PATCH 4/5] btrfs: explicitly ref count block_group on new_bgs list Boris Burkov
  2025-03-07  0:29 ` [PATCH 5/5] btrfs: codify pattern for adding block_group to bg_list Boris Burkov
  4 siblings, 1 reply; 21+ messages in thread
From: Boris Burkov @ 2025-03-07  0:29 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Currently, the async discard machinery owns a ref to the block_group
when the block_group is queued on a discard list. However, to handle
races with discard cancellation and the discard workfn, we have some
cute logic to detect that the block_group is *currently* running in the
workfn, to protect the workfn's usage amidst cancellation.

As far as I can tell, this doesn't have any overt bugs (though
finish_discard_pass and remove_from_discard_list racing can have a
surprising outcome for the caller of remove_from_discard_list in that it
is again added at the end)

But it is needlessly complicated to rely on locking and the nullity of
discard_ctl->block_group. Simplify this significantly by just taking a
refcount while we are in the workfn and uncondtionally drop it in both
the remove and workfn paths, regardless of if they race.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/discard.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index e815d165cccc..d6eef4bd9e9d 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -167,13 +167,7 @@ static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl,
 	block_group->discard_eligible_time = 0;
 	queued = !list_empty(&block_group->discard_list);
 	list_del_init(&block_group->discard_list);
-	/*
-	 * If the block group is currently running in the discard workfn, we
-	 * don't want to deref it, since it's still being used by the workfn.
-	 * The workfn will notice this case and deref the block group when it is
-	 * finished.
-	 */
-	if (queued && !running)
+	if (queued)
 		btrfs_put_block_group(block_group);
 
 	spin_unlock(&discard_ctl->lock);
@@ -260,9 +254,10 @@ static struct btrfs_block_group *peek_discard_list(
 			block_group->discard_cursor = block_group->start;
 			block_group->discard_state = BTRFS_DISCARD_EXTENTS;
 		}
-		discard_ctl->block_group = block_group;
 	}
 	if (block_group) {
+		btrfs_get_block_group(block_group);
+		discard_ctl->block_group = block_group;
 		*discard_state = block_group->discard_state;
 		*discard_index = block_group->discard_index;
 	}
@@ -493,9 +488,20 @@ static void btrfs_discard_workfn(struct work_struct *work)
 
 	block_group = peek_discard_list(discard_ctl, &discard_state,
 					&discard_index, now);
-	if (!block_group || !btrfs_run_discard_work(discard_ctl))
+	if (!block_group)
 		return;
+	if (!btrfs_run_discard_work(discard_ctl)) {
+		spin_lock(&discard_ctl->lock);
+		btrfs_put_block_group(block_group);
+		discard_ctl->block_group = NULL;
+		spin_unlock(&discard_ctl->lock);
+		return;
+	}
 	if (now < block_group->discard_eligible_time) {
+		spin_lock(&discard_ctl->lock);
+		btrfs_put_block_group(block_group);
+		discard_ctl->block_group = NULL;
+		spin_unlock(&discard_ctl->lock);
 		btrfs_discard_schedule_work(discard_ctl, false);
 		return;
 	}
@@ -547,15 +553,7 @@ static void btrfs_discard_workfn(struct work_struct *work)
 	spin_lock(&discard_ctl->lock);
 	discard_ctl->prev_discard = trimmed;
 	discard_ctl->prev_discard_time = now;
-	/*
-	 * If the block group was removed from the discard list while it was
-	 * running in this workfn, then we didn't deref it, since this function
-	 * still owned that reference. But we set the discard_ctl->block_group
-	 * back to NULL, so we can use that condition to know that now we need
-	 * to deref the block_group.
-	 */
-	if (discard_ctl->block_group == NULL)
-		btrfs_put_block_group(block_group);
+	btrfs_put_block_group(block_group);
 	discard_ctl->block_group = NULL;
 	__btrfs_discard_schedule_work(discard_ctl, now, false);
 	spin_unlock(&discard_ctl->lock);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] btrfs: make discard_workfn block_group ref explicit
  2025-03-07  0:29 ` [PATCH 3/5] btrfs: make discard_workfn block_group ref explicit Boris Burkov
@ 2025-03-07 14:33   ` Filipe Manana
  0 siblings, 0 replies; 21+ messages in thread
From: Filipe Manana @ 2025-03-07 14:33 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Fri, Mar 7, 2025 at 12:31 AM Boris Burkov <boris@bur.io> wrote:
>
> Currently, the async discard machinery owns a ref to the block_group
> when the block_group is queued on a discard list. However, to handle
> races with discard cancellation and the discard workfn, we have some
> cute logic to detect that the block_group is *currently* running in the
> workfn, to protect the workfn's usage amidst cancellation.
>
> As far as I can tell, this doesn't have any overt bugs (though
> finish_discard_pass and remove_from_discard_list racing can have a
> surprising outcome for the caller of remove_from_discard_list in that it
> is again added at the end)
>
> But it is needlessly complicated to rely on locking and the nullity of
> discard_ctl->block_group. Simplify this significantly by just taking a
> refcount while we are in the workfn and uncondtionally drop it in both
> the remove and workfn paths, regardless of if they race.
>
> Signed-off-by: Boris Burkov <boris@bur.io>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/discard.c | 34 ++++++++++++++++------------------
>  1 file changed, 16 insertions(+), 18 deletions(-)
>
> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> index e815d165cccc..d6eef4bd9e9d 100644
> --- a/fs/btrfs/discard.c
> +++ b/fs/btrfs/discard.c
> @@ -167,13 +167,7 @@ static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl,
>         block_group->discard_eligible_time = 0;
>         queued = !list_empty(&block_group->discard_list);
>         list_del_init(&block_group->discard_list);
> -       /*
> -        * If the block group is currently running in the discard workfn, we
> -        * don't want to deref it, since it's still being used by the workfn.
> -        * The workfn will notice this case and deref the block group when it is
> -        * finished.
> -        */
> -       if (queued && !running)
> +       if (queued)
>                 btrfs_put_block_group(block_group);
>
>         spin_unlock(&discard_ctl->lock);
> @@ -260,9 +254,10 @@ static struct btrfs_block_group *peek_discard_list(
>                         block_group->discard_cursor = block_group->start;
>                         block_group->discard_state = BTRFS_DISCARD_EXTENTS;
>                 }
> -               discard_ctl->block_group = block_group;
>         }
>         if (block_group) {
> +               btrfs_get_block_group(block_group);
> +               discard_ctl->block_group = block_group;
>                 *discard_state = block_group->discard_state;
>                 *discard_index = block_group->discard_index;
>         }
> @@ -493,9 +488,20 @@ static void btrfs_discard_workfn(struct work_struct *work)
>
>         block_group = peek_discard_list(discard_ctl, &discard_state,
>                                         &discard_index, now);
> -       if (!block_group || !btrfs_run_discard_work(discard_ctl))
> +       if (!block_group)
>                 return;
> +       if (!btrfs_run_discard_work(discard_ctl)) {
> +               spin_lock(&discard_ctl->lock);
> +               btrfs_put_block_group(block_group);
> +               discard_ctl->block_group = NULL;
> +               spin_unlock(&discard_ctl->lock);
> +               return;
> +       }
>         if (now < block_group->discard_eligible_time) {
> +               spin_lock(&discard_ctl->lock);
> +               btrfs_put_block_group(block_group);
> +               discard_ctl->block_group = NULL;
> +               spin_unlock(&discard_ctl->lock);
>                 btrfs_discard_schedule_work(discard_ctl, false);
>                 return;
>         }
> @@ -547,15 +553,7 @@ static void btrfs_discard_workfn(struct work_struct *work)
>         spin_lock(&discard_ctl->lock);
>         discard_ctl->prev_discard = trimmed;
>         discard_ctl->prev_discard_time = now;
> -       /*
> -        * If the block group was removed from the discard list while it was
> -        * running in this workfn, then we didn't deref it, since this function
> -        * still owned that reference. But we set the discard_ctl->block_group
> -        * back to NULL, so we can use that condition to know that now we need
> -        * to deref the block_group.
> -        */
> -       if (discard_ctl->block_group == NULL)
> -               btrfs_put_block_group(block_group);
> +       btrfs_put_block_group(block_group);
>         discard_ctl->block_group = NULL;
>         __btrfs_discard_schedule_work(discard_ctl, now, false);
>         spin_unlock(&discard_ctl->lock);
> --
> 2.48.1
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 4/5] btrfs: explicitly ref count block_group on new_bgs list
  2025-03-07  0:29 [PATCH 0/5] btrfs: block_group refcounting fixes Boris Burkov
                   ` (2 preceding siblings ...)
  2025-03-07  0:29 ` [PATCH 3/5] btrfs: make discard_workfn block_group ref explicit Boris Burkov
@ 2025-03-07  0:29 ` Boris Burkov
  2025-03-07 14:37   ` Filipe Manana
  2025-03-07  0:29 ` [PATCH 5/5] btrfs: codify pattern for adding block_group to bg_list Boris Burkov
  4 siblings, 1 reply; 21+ messages in thread
From: Boris Burkov @ 2025-03-07  0:29 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

All other users of the bg_list list_head inc the refcount when adding to
a list and dec it when deleting from the list. Just for the sake of
uniformity and to try to avoid refcounting bugs, do it for this list as
well.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/block-group.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 2db1497b58d9..e4071897c9a8 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2801,6 +2801,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 		spin_lock(&fs_info->unused_bgs_lock);
 		list_del_init(&block_group->bg_list);
 		clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
+		btrfs_put_block_group(block_group);
 		spin_unlock(&fs_info->unused_bgs_lock);
 
 		/*
@@ -2939,6 +2940,7 @@ struct btrfs_block_group *btrfs_make_block_group(struct btrfs_trans_handle *tran
 	}
 #endif
 
+	btrfs_get_block_group(cache);
 	list_add_tail(&cache->bg_list, &trans->new_bgs);
 	btrfs_inc_delayed_refs_rsv_bg_inserts(fs_info);
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] btrfs: explicitly ref count block_group on new_bgs list
  2025-03-07  0:29 ` [PATCH 4/5] btrfs: explicitly ref count block_group on new_bgs list Boris Burkov
@ 2025-03-07 14:37   ` Filipe Manana
  2025-03-07 21:40     ` Boris Burkov
  0 siblings, 1 reply; 21+ messages in thread
From: Filipe Manana @ 2025-03-07 14:37 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Fri, Mar 7, 2025 at 12:31 AM Boris Burkov <boris@bur.io> wrote:
>
> All other users of the bg_list list_head inc the refcount when adding to
> a list and dec it when deleting from the list. Just for the sake of
> uniformity and to try to avoid refcounting bugs, do it for this list as
> well.

Please add a note that the reason why it's not ref counted is because
the list of new block groups belongs to a transaction handle, which is
local and therefore no other tasks can access it.

>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/block-group.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 2db1497b58d9..e4071897c9a8 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -2801,6 +2801,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
>                 spin_lock(&fs_info->unused_bgs_lock);
>                 list_del_init(&block_group->bg_list);
>                 clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
> +               btrfs_put_block_group(block_group);
>                 spin_unlock(&fs_info->unused_bgs_lock);
>
>                 /*
> @@ -2939,6 +2940,7 @@ struct btrfs_block_group *btrfs_make_block_group(struct btrfs_trans_handle *tran
>         }
>  #endif
>
> +       btrfs_get_block_group(cache);
>         list_add_tail(&cache->bg_list, &trans->new_bgs);
>         btrfs_inc_delayed_refs_rsv_bg_inserts(fs_info);

There's a missing btrfs_put_block_group() call at
btrfs_cleanup_pending_block_groups().

Thanks.


>
> --
> 2.48.1
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] btrfs: explicitly ref count block_group on new_bgs list
  2025-03-07 14:37   ` Filipe Manana
@ 2025-03-07 21:40     ` Boris Burkov
  2025-03-07 22:32       ` Boris Burkov
  2025-03-10 12:52       ` Filipe Manana
  0 siblings, 2 replies; 21+ messages in thread
From: Boris Burkov @ 2025-03-07 21:40 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Fri, Mar 07, 2025 at 02:37:28PM +0000, Filipe Manana wrote:
> On Fri, Mar 7, 2025 at 12:31 AM Boris Burkov <boris@bur.io> wrote:
> >
> > All other users of the bg_list list_head inc the refcount when adding to
> > a list and dec it when deleting from the list. Just for the sake of
> > uniformity and to try to avoid refcounting bugs, do it for this list as
> > well.
> 
> Please add a note that the reason why it's not ref counted is because
> the list of new block groups belongs to a transaction handle, which is
> local and therefore no other tasks can access it.
> 

Just to make sure, I understand you correctly: you'd like me to add this
as a historical note to the commit message? Happy to do so if that's what
you mean.

Otherwise, I'm confused about your intent. If you are saying that it's
better to not refcount it, then we can drop this patch, it's not a fix,
just another "try to establish invariants" kinda thing.

> >
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/block-group.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > index 2db1497b58d9..e4071897c9a8 100644
> > --- a/fs/btrfs/block-group.c
> > +++ b/fs/btrfs/block-group.c
> > @@ -2801,6 +2801,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
> >                 spin_lock(&fs_info->unused_bgs_lock);
> >                 list_del_init(&block_group->bg_list);
> >                 clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
> > +               btrfs_put_block_group(block_group);
> >                 spin_unlock(&fs_info->unused_bgs_lock);
> >
> >                 /*
> > @@ -2939,6 +2940,7 @@ struct btrfs_block_group *btrfs_make_block_group(struct btrfs_trans_handle *tran
> >         }
> >  #endif
> >
> > +       btrfs_get_block_group(cache);
> >         list_add_tail(&cache->bg_list, &trans->new_bgs);
> >         btrfs_inc_delayed_refs_rsv_bg_inserts(fs_info);
> 
> There's a missing btrfs_put_block_group() call at
> btrfs_cleanup_pending_block_groups().

Good catch, thanks.

> 
> Thanks.
> 
> 
> >
> > --
> > 2.48.1
> >
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] btrfs: explicitly ref count block_group on new_bgs list
  2025-03-07 21:40     ` Boris Burkov
@ 2025-03-07 22:32       ` Boris Burkov
  2025-03-10 12:52       ` Filipe Manana
  1 sibling, 0 replies; 21+ messages in thread
From: Boris Burkov @ 2025-03-07 22:32 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs, kernel-team

On Fri, Mar 07, 2025 at 01:40:40PM -0800, Boris Burkov wrote:
> On Fri, Mar 07, 2025 at 02:37:28PM +0000, Filipe Manana wrote:
> > On Fri, Mar 7, 2025 at 12:31 AM Boris Burkov <boris@bur.io> wrote:
> > >
> > > All other users of the bg_list list_head inc the refcount when adding to
> > > a list and dec it when deleting from the list. Just for the sake of
> > > uniformity and to try to avoid refcounting bugs, do it for this list as
> > > well.
> > 
> > Please add a note that the reason why it's not ref counted is because
> > the list of new block groups belongs to a transaction handle, which is
> > local and therefore no other tasks can access it.
> > 
> 
> Just to make sure, I understand you correctly: you'd like me to add this
> as a historical note to the commit message? Happy to do so if that's what
> you mean.
> 
> Otherwise, I'm confused about your intent. If you are saying that it's
> better to not refcount it, then we can drop this patch, it's not a fix,
> just another "try to establish invariants" kinda thing.
> 
> > >
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > >  fs/btrfs/block-group.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > > index 2db1497b58d9..e4071897c9a8 100644
> > > --- a/fs/btrfs/block-group.c
> > > +++ b/fs/btrfs/block-group.c
> > > @@ -2801,6 +2801,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
> > >                 spin_lock(&fs_info->unused_bgs_lock);
> > >                 list_del_init(&block_group->bg_list);
> > >                 clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
> > > +               btrfs_put_block_group(block_group);
> > >                 spin_unlock(&fs_info->unused_bgs_lock);
> > >
> > >                 /*
> > > @@ -2939,6 +2940,7 @@ struct btrfs_block_group *btrfs_make_block_group(struct btrfs_trans_handle *tran
> > >         }
> > >  #endif
> > >
> > > +       btrfs_get_block_group(cache);
> > >         list_add_tail(&cache->bg_list, &trans->new_bgs);
> > >         btrfs_inc_delayed_refs_rsv_bg_inserts(fs_info);
> > 
> > There's a missing btrfs_put_block_group() call at
> > btrfs_cleanup_pending_block_groups().
> 
> Good catch, thanks.
> 

Actually, just for the record, the btrfs_put_block_group *is* there,
it just was the mystery btrfs_put_block_group from the other patch
that you also noticed. So that put should be in this patch! I think
that will make both of them make more sense.

Good eye.

> > 
> > Thanks.
> > 
> > 
> > >
> > > --
> > > 2.48.1
> > >
> > >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] btrfs: explicitly ref count block_group on new_bgs list
  2025-03-07 21:40     ` Boris Burkov
  2025-03-07 22:32       ` Boris Burkov
@ 2025-03-10 12:52       ` Filipe Manana
  1 sibling, 0 replies; 21+ messages in thread
From: Filipe Manana @ 2025-03-10 12:52 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Fri, Mar 7, 2025 at 9:39 PM Boris Burkov <boris@bur.io> wrote:
>
> On Fri, Mar 07, 2025 at 02:37:28PM +0000, Filipe Manana wrote:
> > On Fri, Mar 7, 2025 at 12:31 AM Boris Burkov <boris@bur.io> wrote:
> > >
> > > All other users of the bg_list list_head inc the refcount when adding to
> > > a list and dec it when deleting from the list. Just for the sake of
> > > uniformity and to try to avoid refcounting bugs, do it for this list as
> > > well.
> >
> > Please add a note that the reason why it's not ref counted is because
> > the list of new block groups belongs to a transaction handle, which is
> > local and therefore no other tasks can access it.
> >
>
> Just to make sure, I understand you correctly: you'd like me to add this
> as a historical note to the commit message? Happy to do so if that's what
> you mean.

Yes, that's what I meant.

>
> Otherwise, I'm confused about your intent. If you are saying that it's
> better to not refcount it, then we can drop this patch, it's not a fix,
> just another "try to establish invariants" kinda thing.

Was just saying it's not needed due to being a local list and we can
detect if the block group is in that list with the
BLOCK_GROUP_FLAG_NEW flag.

>
> > >
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > >  fs/btrfs/block-group.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > > index 2db1497b58d9..e4071897c9a8 100644
> > > --- a/fs/btrfs/block-group.c
> > > +++ b/fs/btrfs/block-group.c
> > > @@ -2801,6 +2801,7 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
> > >                 spin_lock(&fs_info->unused_bgs_lock);
> > >                 list_del_init(&block_group->bg_list);
> > >                 clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
> > > +               btrfs_put_block_group(block_group);
> > >                 spin_unlock(&fs_info->unused_bgs_lock);
> > >
> > >                 /*
> > > @@ -2939,6 +2940,7 @@ struct btrfs_block_group *btrfs_make_block_group(struct btrfs_trans_handle *tran
> > >         }
> > >  #endif
> > >
> > > +       btrfs_get_block_group(cache);
> > >         list_add_tail(&cache->bg_list, &trans->new_bgs);
> > >         btrfs_inc_delayed_refs_rsv_bg_inserts(fs_info);
> >
> > There's a missing btrfs_put_block_group() call at
> > btrfs_cleanup_pending_block_groups().
>
> Good catch, thanks.

Yep, as you noticed later, it was added in patch 2/5 instead of this one (4/5).
So it just needs to be moved from that patch to this one.

Thanks.

>
> >
> > Thanks.
> >
> >
> > >
> > > --
> > > 2.48.1
> > >
> > >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 5/5] btrfs: codify pattern for adding block_group to bg_list
  2025-03-07  0:29 [PATCH 0/5] btrfs: block_group refcounting fixes Boris Burkov
                   ` (3 preceding siblings ...)
  2025-03-07  0:29 ` [PATCH 4/5] btrfs: explicitly ref count block_group on new_bgs list Boris Burkov
@ 2025-03-07  0:29 ` Boris Burkov
  2025-03-07 14:45   ` Filipe Manana
  4 siblings, 1 reply; 21+ messages in thread
From: Boris Burkov @ 2025-03-07  0:29 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Similar to mark_bg_unused and mark_bg_to_reclaim, we have a few places
that use bg_list with refcounting, mostly for retrying failures to
reclaim/delete unused.

These have custom logic for handling locking and refcounting the bg_list
properly, but they actually all want to do the same thing, so pull that
logic out into a helper. Unfortunately, mark_bg_unused does still need
the NEW flag to avoid prematurely marking stuff unused (even if refcount
is fine, we don't want to mess with bg creation), so it cannot use the
new helper.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/block-group.c | 54 +++++++++++++++++++++++-------------------
 1 file changed, 30 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index e4071897c9a8..a570d89ff0c3 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1455,6 +1455,31 @@ static bool clean_pinned_extents(struct btrfs_trans_handle *trans,
 	return ret == 0;
 }
 
+/*
+ * link the block_group to l via bg_list
+ *
+ * Use this rather than list_add_tail directly to ensure proper respect
+ * to locking and refcounting.
+ *
+ * @bg: the block_group to link to the list.
+ * @l: the list to link it to.
+ * Returns: true if the bg was linked with a refcount bump and false otherwise.
+ */
+static bool btrfs_link_bg_list(struct btrfs_block_group *bg, struct list_head *l)
+{
+	struct btrfs_fs_info *fs_info = bg->fs_info;
+	bool added = false;
+
+	spin_lock(&fs_info->unused_bgs_lock);
+	if (list_empty(&bg->bg_list)) {
+		btrfs_get_block_group(bg);
+		list_add_tail(&bg->bg_list, l);
+		added = true;
+	}
+	spin_unlock(&fs_info->unused_bgs_lock);
+	return added;
+}
+
 /*
  * Process the unused_bgs list and remove any that don't have any allocated
  * space inside of them.
@@ -1570,8 +1595,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 			 * drop under the "next" label for the
 			 * fs_info->unused_bgs list.
 			 */
-			btrfs_get_block_group(block_group);
-			list_add_tail(&block_group->bg_list, &retry_list);
+			btrfs_link_bg_list(block_group, &retry_list);
 
 			trace_btrfs_skip_unused_block_group(block_group);
 			spin_unlock(&block_group->lock);
@@ -1968,20 +1992,8 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 		spin_unlock(&space_info->lock);
 
 next:
-		if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
-			/* Refcount held by the reclaim_bgs list after splice. */
-			spin_lock(&fs_info->unused_bgs_lock);
-			/*
-			 * This block group might be added to the unused list
-			 * during the above process. Move it back to the
-			 * reclaim list otherwise.
-			 */
-			if (list_empty(&bg->bg_list)) {
-				btrfs_get_block_group(bg);
-				list_add_tail(&bg->bg_list, &retry_list);
-			}
-			spin_unlock(&fs_info->unused_bgs_lock);
-		}
+		if (ret && !READ_ONCE(space_info->periodic_reclaim))
+			btrfs_link_bg_list(bg, &retry_list);
 		btrfs_put_block_group(bg);
 
 		mutex_unlock(&fs_info->reclaim_bgs_lock);
@@ -2021,13 +2033,8 @@ void btrfs_mark_bg_to_reclaim(struct btrfs_block_group *bg)
 {
 	struct btrfs_fs_info *fs_info = bg->fs_info;
 
-	spin_lock(&fs_info->unused_bgs_lock);
-	if (list_empty(&bg->bg_list)) {
-		btrfs_get_block_group(bg);
+	if (btrfs_link_bg_list(bg, &fs_info->reclaim_bgs))
 		trace_btrfs_add_reclaim_block_group(bg);
-		list_add_tail(&bg->bg_list, &fs_info->reclaim_bgs);
-	}
-	spin_unlock(&fs_info->unused_bgs_lock);
 }
 
 static int read_bg_from_eb(struct btrfs_fs_info *fs_info, const struct btrfs_key *key,
@@ -2940,8 +2947,7 @@ struct btrfs_block_group *btrfs_make_block_group(struct btrfs_trans_handle *tran
 	}
 #endif
 
-	btrfs_get_block_group(cache);
-	list_add_tail(&cache->bg_list, &trans->new_bgs);
+	btrfs_link_bg_list(cache, &trans->new_bgs);
 	btrfs_inc_delayed_refs_rsv_bg_inserts(fs_info);
 
 	set_avail_alloc_bits(fs_info, type);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] btrfs: codify pattern for adding block_group to bg_list
  2025-03-07  0:29 ` [PATCH 5/5] btrfs: codify pattern for adding block_group to bg_list Boris Burkov
@ 2025-03-07 14:45   ` Filipe Manana
  0 siblings, 0 replies; 21+ messages in thread
From: Filipe Manana @ 2025-03-07 14:45 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs, kernel-team

On Fri, Mar 7, 2025 at 12:31 AM Boris Burkov <boris@bur.io> wrote:
>
> Similar to mark_bg_unused and mark_bg_to_reclaim, we have a few places
> that use bg_list with refcounting, mostly for retrying failures to
> reclaim/delete unused.
>
> These have custom logic for handling locking and refcounting the bg_list
> properly, but they actually all want to do the same thing, so pull that
> logic out into a helper. Unfortunately, mark_bg_unused does still need
> the NEW flag to avoid prematurely marking stuff unused (even if refcount
> is fine, we don't want to mess with bg creation), so it cannot use the
> new helper.
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/block-group.c | 54 +++++++++++++++++++++++-------------------
>  1 file changed, 30 insertions(+), 24 deletions(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index e4071897c9a8..a570d89ff0c3 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1455,6 +1455,31 @@ static bool clean_pinned_extents(struct btrfs_trans_handle *trans,
>         return ret == 0;
>  }
>
> +/*
> + * link the block_group to l via bg_list

Please make the style consistent with the rest, first word capitalized
and ending the sentence with punctuation.

> + *
> + * Use this rather than list_add_tail directly to ensure proper respect
> + * to locking and refcounting.
> + *
> + * @bg: the block_group to link to the list.
> + * @l: the list to link it to.
> + * Returns: true if the bg was linked with a refcount bump and false otherwise.
> + */
> +static bool btrfs_link_bg_list(struct btrfs_block_group *bg, struct list_head *l)

Don't use single letter variable names, it's discouraged, except for
loop index variables like 'i' or 'j' for inner loops.
Use something else like 'list' for example.

Otherwise it looks fine.

With that:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

> +{
> +       struct btrfs_fs_info *fs_info = bg->fs_info;
> +       bool added = false;
> +
> +       spin_lock(&fs_info->unused_bgs_lock);
> +       if (list_empty(&bg->bg_list)) {
> +               btrfs_get_block_group(bg);
> +               list_add_tail(&bg->bg_list, l);
> +               added = true;
> +       }
> +       spin_unlock(&fs_info->unused_bgs_lock);
> +       return added;
> +}
> +
>  /*
>   * Process the unused_bgs list and remove any that don't have any allocated
>   * space inside of them.
> @@ -1570,8 +1595,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>                          * drop under the "next" label for the
>                          * fs_info->unused_bgs list.
>                          */
> -                       btrfs_get_block_group(block_group);
> -                       list_add_tail(&block_group->bg_list, &retry_list);
> +                       btrfs_link_bg_list(block_group, &retry_list);
>
>                         trace_btrfs_skip_unused_block_group(block_group);
>                         spin_unlock(&block_group->lock);
> @@ -1968,20 +1992,8 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>                 spin_unlock(&space_info->lock);
>
>  next:
> -               if (ret && !READ_ONCE(space_info->periodic_reclaim)) {
> -                       /* Refcount held by the reclaim_bgs list after splice. */
> -                       spin_lock(&fs_info->unused_bgs_lock);
> -                       /*
> -                        * This block group might be added to the unused list
> -                        * during the above process. Move it back to the
> -                        * reclaim list otherwise.
> -                        */
> -                       if (list_empty(&bg->bg_list)) {
> -                               btrfs_get_block_group(bg);
> -                               list_add_tail(&bg->bg_list, &retry_list);
> -                       }
> -                       spin_unlock(&fs_info->unused_bgs_lock);
> -               }
> +               if (ret && !READ_ONCE(space_info->periodic_reclaim))
> +                       btrfs_link_bg_list(bg, &retry_list);
>                 btrfs_put_block_group(bg);
>
>                 mutex_unlock(&fs_info->reclaim_bgs_lock);
> @@ -2021,13 +2033,8 @@ void btrfs_mark_bg_to_reclaim(struct btrfs_block_group *bg)
>  {
>         struct btrfs_fs_info *fs_info = bg->fs_info;
>
> -       spin_lock(&fs_info->unused_bgs_lock);
> -       if (list_empty(&bg->bg_list)) {
> -               btrfs_get_block_group(bg);
> +       if (btrfs_link_bg_list(bg, &fs_info->reclaim_bgs))
>                 trace_btrfs_add_reclaim_block_group(bg);
> -               list_add_tail(&bg->bg_list, &fs_info->reclaim_bgs);
> -       }
> -       spin_unlock(&fs_info->unused_bgs_lock);
>  }
>
>  static int read_bg_from_eb(struct btrfs_fs_info *fs_info, const struct btrfs_key *key,
> @@ -2940,8 +2947,7 @@ struct btrfs_block_group *btrfs_make_block_group(struct btrfs_trans_handle *tran
>         }
>  #endif
>
> -       btrfs_get_block_group(cache);
> -       list_add_tail(&cache->bg_list, &trans->new_bgs);
> +       btrfs_link_bg_list(cache, &trans->new_bgs);
>         btrfs_inc_delayed_refs_rsv_bg_inserts(fs_info);
>
>         set_avail_alloc_bits(fs_info, type);
> --
> 2.48.1
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-03-10 19:28 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-07  0:29 [PATCH 0/5] btrfs: block_group refcounting fixes Boris Burkov
2025-03-07  0:29 ` [PATCH 1/5] btrfs: fix bg refcount race in btrfs_create_pending_block_groups Boris Burkov
2025-03-07  6:51   ` Qu Wenruo
2025-03-07 14:13   ` Filipe Manana
2025-03-07 21:32     ` Boris Burkov
2025-03-10 12:41       ` Filipe Manana
2025-03-10 19:28         ` Boris Burkov
2025-03-07  0:29 ` [PATCH 2/5] btrfs: fix bg->bg_list list_del refcount races Boris Burkov
2025-03-07  6:52   ` Qu Wenruo
2025-03-07 14:24   ` Filipe Manana
2025-03-07 21:37     ` Boris Burkov
2025-03-10 12:47       ` Filipe Manana
2025-03-07  0:29 ` [PATCH 3/5] btrfs: make discard_workfn block_group ref explicit Boris Burkov
2025-03-07 14:33   ` Filipe Manana
2025-03-07  0:29 ` [PATCH 4/5] btrfs: explicitly ref count block_group on new_bgs list Boris Burkov
2025-03-07 14:37   ` Filipe Manana
2025-03-07 21:40     ` Boris Burkov
2025-03-07 22:32       ` Boris Burkov
2025-03-10 12:52       ` Filipe Manana
2025-03-07  0:29 ` [PATCH 5/5] btrfs: codify pattern for adding block_group to bg_list Boris Burkov
2025-03-07 14:45   ` Filipe Manana

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox