[PATCH 0/2] btrfs: fix an async discard infinite loop when disabling it

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] btrfs: fix an async discard infinite loop when disabling it
@ 2025-05-05 15:49 fdmanana
  2025-05-05 15:49 ` [PATCH 1/2] btrfs: fix discard worker infinite loop after disabling discard fdmanana
  2025-05-05 15:49 ` [PATCH 2/2] btrfs: use verbose assert at peek_discard_list() fdmanana
  0 siblings, 2 replies; 7+ messages in thread
From: fdmanana @ 2025-05-05 15:49 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Details in the change logs.

Filipe Manana (2):
  btrfs: fix discard worker infinite loop after disabling discard
  btrfs: use verbose assert at peek_discard_list()

 fs/btrfs/discard.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

-- 
2.47.2


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] btrfs: fix discard worker infinite loop after disabling discard
  2025-05-05 15:49 [PATCH 0/2] btrfs: fix an async discard infinite loop when disabling it fdmanana
@ 2025-05-05 15:49 ` fdmanana
  2025-05-05 17:20   ` Boris Burkov
  2025-05-06 16:55   ` Daniel Vacek
  2025-05-05 15:49 ` [PATCH 2/2] btrfs: use verbose assert at peek_discard_list() fdmanana
  1 sibling, 2 replies; 7+ messages in thread
From: fdmanana @ 2025-05-05 15:49 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

If the discard worker is running and there's currently only one block
group, that block group is a data block group, it's in the unused block
groups discard list and is being used (it got an extent allocated from it
after becoming unused), the worker can end up in an infinite loop if a
transaction abort happens or the async discard is disabled (during remount
or unmount for example).

This happens like this:

1) Task A, the discard worker, is at peek_discard_list() and
   find_next_block_group() returns block group X;

2) Block group X is in the unused block groups discard list (its discard
   index is BTRFS_DISCARD_INDEX_UNUSED) since at some point in the past
   it become an unused block group and was added to that list, but then
   later it got an extent allocated from it, so its ->used counter is not
   zero anymore;

3) The current transaction is aborted by task B and we end up at
   __btrfs_handle_fs_error() in the transaction abort path, where we call
   btrfs_discard_stop(), which clears BTRFS_FS_DISCARD_RUNNING from
   fs_info, and then at __btrfs_handle_fs_error() we set the fs to RO mode
   (setting SB_RDONLY in the super block's s_flags field);

4) Task A calls __add_to_discard_list() with the goal of moving the block
   group from the unused block groups discard list into another discard
   list, but at __add_to_discard_list() we end up doing nothing because
   btrfs_run_discard_work() returns false, since the super block has
   SB_RDONLY set in its flags and BTRFS_FS_DISCARD_RUNNING is not set
   anymore in fs_info->flags. So block group X remains in the unused block
   groups discard list;

5) Task A then does a goto into the 'again' label, calls
   find_next_block_group() again we gets block group X again. Then it
   repeats the previous steps over and over since there are not other
   block groups in the discard lists and block group X is never moved
   out of the unused block groups discard list since
   btrfs_run_discard_work() keeps return false and there
   __add_to_discard_list() doesn't move block group X out of that discard
   list.

When this happens we can get a soft lockup report like this:

   [   71.957163][    C0] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [kworker/u4:3:97]
   [   71.957169][    C0] Modules linked in: xfs af_packet rfkill (...)
   [   71.957236][    C0] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G        W          6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
   [   71.957241][    C0] Tainted: [W]=WARN
   [   71.957242][    C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
   [   71.957244][    C0] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
   [   71.957307][    C0] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
   [   71.957355][    C0] Code: c1 01 48 83 (...)
   [   71.957357][    C0] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
   [   71.957359][    C0] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
   [   71.957360][    C0] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
   [   71.957361][    C0] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
   [   71.957362][    C0] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
   [   71.957363][    C0] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
   [   71.957368][    C0] FS:  0000000000000000(0000) GS:ffff89707bc00000(0000) knlGS:0000000000000000
   [   71.957369][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [   71.957370][    C0] CR2: 00005605bcc8d2f0 CR3: 000000010376a001 CR4: 0000000000770ef0
   [   71.957373][    C0] PKRU: 55555554
   [   71.957374][    C0] Call Trace:
   [   71.957377][    C0]  <TASK>
   [   71.957381][    C0]  process_one_work+0x17e/0x330
   [   71.957386][    C0]  worker_thread+0x2ce/0x3f0
   [   71.957389][    C0]  ? __pfx_worker_thread+0x10/0x10
   [   71.957391][    C0]  kthread+0xef/0x220
   [   71.957394][    C0]  ? __pfx_kthread+0x10/0x10
   [   71.957397][    C0]  ret_from_fork+0x34/0x50
   [   71.957401][    C0]  ? __pfx_kthread+0x10/0x10
   [   71.957403][    C0]  ret_from_fork_asm+0x1a/0x30
   [   71.957410][    C0]  </TASK>
   [   71.957412][    C0] Kernel panic - not syncing: softlockup: hung tasks
   [   71.987152][    C0] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G        W    L     6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
   [   71.989063][    C0] Tainted: [W]=WARN, [L]=SOFTLOCKUP
   [   71.989934][    C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
   [   71.991832][    C0] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
   [   71.992925][    C0] Call Trace:
   [   71.993601][    C0]  <IRQ>
   [   71.994202][    C0]  dump_stack_lvl+0x5a/0x80
   [   71.994923][    C0]  panic+0x10b/0x2da
   [   71.995582][    C0]  watchdog_timer_fn.cold+0x9a/0xa1
   [   71.996366][    C0]  ? __pfx_watchdog_timer_fn+0x10/0x10
   [   71.997148][    C0]  __hrtimer_run_queues+0x132/0x2a0
   [   71.997913][    C0]  hrtimer_interrupt+0xff/0x230
   [   71.998630][    C0]  __sysvec_apic_timer_interrupt+0x55/0x100
   [   71.999435][    C0]  sysvec_apic_timer_interrupt+0x6c/0x90
   [   72.000208][    C0]  </IRQ>
   [   72.000732][    C0]  <TASK>
   [   72.001250][    C0]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
   [   72.002056][    C0] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
   [   72.002950][    C0] Code: c1 01 48 83 (...)
   [   72.005191][    C0] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
   [   72.006011][    C0] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
   [   72.006978][    C0] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
   [   72.007939][    C0] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
   [   72.008895][    C0] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
   [   72.009861][    C0] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
   [   72.010825][    C0]  ? btrfs_discard_workfn+0x51/0x400 [btrfs 23b01089228eb964071fb7ca156eee8cd3bf996f]
   [   72.011978][    C0]  process_one_work+0x17e/0x330
   [   72.012663][    C0]  worker_thread+0x2ce/0x3f0
   [   72.013325][    C0]  ? __pfx_worker_thread+0x10/0x10
   [   72.014044][    C0]  kthread+0xef/0x220
   [   72.014647][    C0]  ? __pfx_kthread+0x10/0x10
   [   72.015302][    C0]  ret_from_fork+0x34/0x50
   [   72.015939][    C0]  ? __pfx_kthread+0x10/0x10
   [   72.016592][    C0]  ret_from_fork_asm+0x1a/0x30
   [   72.017281][    C0]  </TASK>
   [   72.017931][    C0] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
   [   72.019478][    C0] Rebooting in 90 seconds..

So fix this by making sure we move a block group out of the unused block
groups discard list when calling __add_to_discard_list().

Fixes: 2bee7eb8bb81 ("btrfs: discard one region at a time in async discard")
Link: https://bugzilla.suse.com/show_bug.cgi?id=1242012
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/discard.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index d6eef4bd9e9d..de23c4b3515e 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -94,8 +94,6 @@ static void __add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
 				  struct btrfs_block_group *block_group)
 {
 	lockdep_assert_held(&discard_ctl->lock);
-	if (!btrfs_run_discard_work(discard_ctl))
-		return;
 
 	if (list_empty(&block_group->discard_list) ||
 	    block_group->discard_index == BTRFS_DISCARD_INDEX_UNUSED) {
@@ -118,6 +116,9 @@ static void add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
 	if (!btrfs_is_block_group_data_only(block_group))
 		return;
 
+	if (!btrfs_run_discard_work(discard_ctl))
+		return;
+
 	spin_lock(&discard_ctl->lock);
 	__add_to_discard_list(discard_ctl, block_group);
 	spin_unlock(&discard_ctl->lock);
@@ -244,6 +245,18 @@ static struct btrfs_block_group *peek_discard_list(
 		    block_group->used != 0) {
 			if (btrfs_is_block_group_data_only(block_group)) {
 				__add_to_discard_list(discard_ctl, block_group);
+				/*
+				 * The block group must have been moved to other
+				 * discard list even if discard was disabled in
+				 * the meantime or a transaction abort happened,
+				 * otherwise we can end up in an infinite loop,
+				 * always jumping into the 'again' label and
+				 * keep getting this block group over and over
+				 * in case there are no other block groups in
+				 * the discard lists.
+				 */
+				ASSERT(block_group->discard_index !=
+				       BTRFS_DISCARD_INDEX_UNUSED);
 			} else {
 				list_del_init(&block_group->discard_list);
 				btrfs_put_block_group(block_group);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] btrfs: fix discard worker infinite loop after disabling discard
  2025-05-05 15:49 ` [PATCH 1/2] btrfs: fix discard worker infinite loop after disabling discard fdmanana
@ 2025-05-05 17:20   ` Boris Burkov
  2025-05-06 16:55   ` Daniel Vacek
  1 sibling, 0 replies; 7+ messages in thread
From: Boris Burkov @ 2025-05-05 17:20 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Mon, May 05, 2025 at 04:49:55PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> If the discard worker is running and there's currently only one block
> group, that block group is a data block group, it's in the unused block
> groups discard list and is being used (it got an extent allocated from it
> after becoming unused), the worker can end up in an infinite loop if a
> transaction abort happens or the async discard is disabled (during remount
> or unmount for example).
> 
> This happens like this:
> 
> 1) Task A, the discard worker, is at peek_discard_list() and
>    find_next_block_group() returns block group X;
> 
> 2) Block group X is in the unused block groups discard list (its discard
>    index is BTRFS_DISCARD_INDEX_UNUSED) since at some point in the past
>    it become an unused block group and was added to that list, but then
>    later it got an extent allocated from it, so its ->used counter is not
>    zero anymore;
> 
> 3) The current transaction is aborted by task B and we end up at
>    __btrfs_handle_fs_error() in the transaction abort path, where we call
>    btrfs_discard_stop(), which clears BTRFS_FS_DISCARD_RUNNING from
>    fs_info, and then at __btrfs_handle_fs_error() we set the fs to RO mode
>    (setting SB_RDONLY in the super block's s_flags field);
> 
> 4) Task A calls __add_to_discard_list() with the goal of moving the block
>    group from the unused block groups discard list into another discard
>    list, but at __add_to_discard_list() we end up doing nothing because
>    btrfs_run_discard_work() returns false, since the super block has
>    SB_RDONLY set in its flags and BTRFS_FS_DISCARD_RUNNING is not set
>    anymore in fs_info->flags. So block group X remains in the unused block
>    groups discard list;
> 
> 5) Task A then does a goto into the 'again' label, calls
>    find_next_block_group() again we gets block group X again. Then it
>    repeats the previous steps over and over since there are not other
>    block groups in the discard lists and block group X is never moved
>    out of the unused block groups discard list since
>    btrfs_run_discard_work() keeps return false and there
>    __add_to_discard_list() doesn't move block group X out of that discard
>    list.
> 
> When this happens we can get a soft lockup report like this:
> 
>    [   71.957163][    C0] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [kworker/u4:3:97]
>    [   71.957169][    C0] Modules linked in: xfs af_packet rfkill (...)
>    [   71.957236][    C0] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G        W          6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
>    [   71.957241][    C0] Tainted: [W]=WARN
>    [   71.957242][    C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
>    [   71.957244][    C0] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
>    [   71.957307][    C0] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
>    [   71.957355][    C0] Code: c1 01 48 83 (...)
>    [   71.957357][    C0] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
>    [   71.957359][    C0] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
>    [   71.957360][    C0] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
>    [   71.957361][    C0] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
>    [   71.957362][    C0] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
>    [   71.957363][    C0] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
>    [   71.957368][    C0] FS:  0000000000000000(0000) GS:ffff89707bc00000(0000) knlGS:0000000000000000
>    [   71.957369][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [   71.957370][    C0] CR2: 00005605bcc8d2f0 CR3: 000000010376a001 CR4: 0000000000770ef0
>    [   71.957373][    C0] PKRU: 55555554
>    [   71.957374][    C0] Call Trace:
>    [   71.957377][    C0]  <TASK>
>    [   71.957381][    C0]  process_one_work+0x17e/0x330
>    [   71.957386][    C0]  worker_thread+0x2ce/0x3f0
>    [   71.957389][    C0]  ? __pfx_worker_thread+0x10/0x10
>    [   71.957391][    C0]  kthread+0xef/0x220
>    [   71.957394][    C0]  ? __pfx_kthread+0x10/0x10
>    [   71.957397][    C0]  ret_from_fork+0x34/0x50
>    [   71.957401][    C0]  ? __pfx_kthread+0x10/0x10
>    [   71.957403][    C0]  ret_from_fork_asm+0x1a/0x30
>    [   71.957410][    C0]  </TASK>
>    [   71.957412][    C0] Kernel panic - not syncing: softlockup: hung tasks
>    [   71.987152][    C0] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G        W    L     6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
>    [   71.989063][    C0] Tainted: [W]=WARN, [L]=SOFTLOCKUP
>    [   71.989934][    C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
>    [   71.991832][    C0] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
>    [   71.992925][    C0] Call Trace:
>    [   71.993601][    C0]  <IRQ>
>    [   71.994202][    C0]  dump_stack_lvl+0x5a/0x80
>    [   71.994923][    C0]  panic+0x10b/0x2da
>    [   71.995582][    C0]  watchdog_timer_fn.cold+0x9a/0xa1
>    [   71.996366][    C0]  ? __pfx_watchdog_timer_fn+0x10/0x10
>    [   71.997148][    C0]  __hrtimer_run_queues+0x132/0x2a0
>    [   71.997913][    C0]  hrtimer_interrupt+0xff/0x230
>    [   71.998630][    C0]  __sysvec_apic_timer_interrupt+0x55/0x100
>    [   71.999435][    C0]  sysvec_apic_timer_interrupt+0x6c/0x90
>    [   72.000208][    C0]  </IRQ>
>    [   72.000732][    C0]  <TASK>
>    [   72.001250][    C0]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
>    [   72.002056][    C0] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
>    [   72.002950][    C0] Code: c1 01 48 83 (...)
>    [   72.005191][    C0] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
>    [   72.006011][    C0] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
>    [   72.006978][    C0] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
>    [   72.007939][    C0] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
>    [   72.008895][    C0] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
>    [   72.009861][    C0] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
>    [   72.010825][    C0]  ? btrfs_discard_workfn+0x51/0x400 [btrfs 23b01089228eb964071fb7ca156eee8cd3bf996f]
>    [   72.011978][    C0]  process_one_work+0x17e/0x330
>    [   72.012663][    C0]  worker_thread+0x2ce/0x3f0
>    [   72.013325][    C0]  ? __pfx_worker_thread+0x10/0x10
>    [   72.014044][    C0]  kthread+0xef/0x220
>    [   72.014647][    C0]  ? __pfx_kthread+0x10/0x10
>    [   72.015302][    C0]  ret_from_fork+0x34/0x50
>    [   72.015939][    C0]  ? __pfx_kthread+0x10/0x10
>    [   72.016592][    C0]  ret_from_fork_asm+0x1a/0x30
>    [   72.017281][    C0]  </TASK>
>    [   72.017931][    C0] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>    [   72.019478][    C0] Rebooting in 90 seconds..
> 
> So fix this by making sure we move a block group out of the unused block
> groups discard list when calling __add_to_discard_list().
> 
> Fixes: 2bee7eb8bb81 ("btrfs: discard one region at a time in async discard")

Fix LGTM, thanks.

I think it wouldn't hurt to also add another layer of defense of bailing
out of the again: loop in peek_discard_list() if btrfs_run_discard_work()
returns false. OTOH, that might wallpaper over some other bug like this
one in a way that bites us.

Reviewed-by: Boris Burkov <boris@bur.io>

> Link: https://bugzilla.suse.com/show_bug.cgi?id=1242012
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/discard.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> index d6eef4bd9e9d..de23c4b3515e 100644
> --- a/fs/btrfs/discard.c
> +++ b/fs/btrfs/discard.c
> @@ -94,8 +94,6 @@ static void __add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
>  				  struct btrfs_block_group *block_group)
>  {
>  	lockdep_assert_held(&discard_ctl->lock);
> -	if (!btrfs_run_discard_work(discard_ctl))
> -		return;
>  
>  	if (list_empty(&block_group->discard_list) ||
>  	    block_group->discard_index == BTRFS_DISCARD_INDEX_UNUSED) {
> @@ -118,6 +116,9 @@ static void add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
>  	if (!btrfs_is_block_group_data_only(block_group))
>  		return;
>  
> +	if (!btrfs_run_discard_work(discard_ctl))
> +		return;
> +
>  	spin_lock(&discard_ctl->lock);
>  	__add_to_discard_list(discard_ctl, block_group);
>  	spin_unlock(&discard_ctl->lock);
> @@ -244,6 +245,18 @@ static struct btrfs_block_group *peek_discard_list(
>  		    block_group->used != 0) {
>  			if (btrfs_is_block_group_data_only(block_group)) {
>  				__add_to_discard_list(discard_ctl, block_group);
> +				/*
> +				 * The block group must have been moved to other
> +				 * discard list even if discard was disabled in
> +				 * the meantime or a transaction abort happened,
> +				 * otherwise we can end up in an infinite loop,
> +				 * always jumping into the 'again' label and
> +				 * keep getting this block group over and over
> +				 * in case there are no other block groups in
> +				 * the discard lists.
> +				 */
> +				ASSERT(block_group->discard_index !=
> +				       BTRFS_DISCARD_INDEX_UNUSED);
>  			} else {
>  				list_del_init(&block_group->discard_list);
>  				btrfs_put_block_group(block_group);
> -- 
> 2.47.2
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] btrfs: fix discard worker infinite loop after disabling discard
  2025-05-05 15:49 ` [PATCH 1/2] btrfs: fix discard worker infinite loop after disabling discard fdmanana
  2025-05-05 17:20   ` Boris Burkov
@ 2025-05-06 16:55   ` Daniel Vacek
  1 sibling, 0 replies; 7+ messages in thread
From: Daniel Vacek @ 2025-05-06 16:55 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Mon, 5 May 2025 at 17:50, <fdmanana@kernel.org> wrote:
>
> From: Filipe Manana <fdmanana@suse.com>
>
> If the discard worker is running and there's currently only one block
> group, that block group is a data block group, it's in the unused block
> groups discard list and is being used (it got an extent allocated from it
> after becoming unused), the worker can end up in an infinite loop if a
> transaction abort happens or the async discard is disabled (during remount
> or unmount for example).
>
> This happens like this:
>
> 1) Task A, the discard worker, is at peek_discard_list() and
>    find_next_block_group() returns block group X;
>
> 2) Block group X is in the unused block groups discard list (its discard
>    index is BTRFS_DISCARD_INDEX_UNUSED) since at some point in the past
>    it become an unused block group and was added to that list, but then
>    later it got an extent allocated from it, so its ->used counter is not
>    zero anymore;
>
> 3) The current transaction is aborted by task B and we end up at
>    __btrfs_handle_fs_error() in the transaction abort path, where we call
>    btrfs_discard_stop(), which clears BTRFS_FS_DISCARD_RUNNING from
>    fs_info, and then at __btrfs_handle_fs_error() we set the fs to RO mode
>    (setting SB_RDONLY in the super block's s_flags field);
>
> 4) Task A calls __add_to_discard_list() with the goal of moving the block
>    group from the unused block groups discard list into another discard
>    list, but at __add_to_discard_list() we end up doing nothing because
>    btrfs_run_discard_work() returns false, since the super block has
>    SB_RDONLY set in its flags and BTRFS_FS_DISCARD_RUNNING is not set
>    anymore in fs_info->flags. So block group X remains in the unused block
>    groups discard list;
>
> 5) Task A then does a goto into the 'again' label, calls
>    find_next_block_group() again we gets block group X again. Then it
>    repeats the previous steps over and over since there are not other
>    block groups in the discard lists and block group X is never moved
>    out of the unused block groups discard list since
>    btrfs_run_discard_work() keeps return false and there
>    __add_to_discard_list() doesn't move block group X out of that discard
>    list.
>
> When this happens we can get a soft lockup report like this:
>
>    [   71.957163][    C0] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [kworker/u4:3:97]
>    [   71.957169][    C0] Modules linked in: xfs af_packet rfkill (...)
>    [   71.957236][    C0] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G        W          6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
>    [   71.957241][    C0] Tainted: [W]=WARN
>    [   71.957242][    C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
>    [   71.957244][    C0] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
>    [   71.957307][    C0] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
>    [   71.957355][    C0] Code: c1 01 48 83 (...)
>    [   71.957357][    C0] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
>    [   71.957359][    C0] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
>    [   71.957360][    C0] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
>    [   71.957361][    C0] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
>    [   71.957362][    C0] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
>    [   71.957363][    C0] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
>    [   71.957368][    C0] FS:  0000000000000000(0000) GS:ffff89707bc00000(0000) knlGS:0000000000000000
>    [   71.957369][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [   71.957370][    C0] CR2: 00005605bcc8d2f0 CR3: 000000010376a001 CR4: 0000000000770ef0
>    [   71.957373][    C0] PKRU: 55555554
>    [   71.957374][    C0] Call Trace:
>    [   71.957377][    C0]  <TASK>
>    [   71.957381][    C0]  process_one_work+0x17e/0x330
>    [   71.957386][    C0]  worker_thread+0x2ce/0x3f0
>    [   71.957389][    C0]  ? __pfx_worker_thread+0x10/0x10
>    [   71.957391][    C0]  kthread+0xef/0x220
>    [   71.957394][    C0]  ? __pfx_kthread+0x10/0x10
>    [   71.957397][    C0]  ret_from_fork+0x34/0x50
>    [   71.957401][    C0]  ? __pfx_kthread+0x10/0x10
>    [   71.957403][    C0]  ret_from_fork_asm+0x1a/0x30
>    [   71.957410][    C0]  </TASK>
>    [   71.957412][    C0] Kernel panic - not syncing: softlockup: hung tasks
>    [   71.987152][    C0] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G        W    L     6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa
>    [   71.989063][    C0] Tainted: [W]=WARN, [L]=SOFTLOCKUP
>    [   71.989934][    C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
>    [   71.991832][    C0] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
>    [   71.992925][    C0] Call Trace:
>    [   71.993601][    C0]  <IRQ>
>    [   71.994202][    C0]  dump_stack_lvl+0x5a/0x80
>    [   71.994923][    C0]  panic+0x10b/0x2da
>    [   71.995582][    C0]  watchdog_timer_fn.cold+0x9a/0xa1
>    [   71.996366][    C0]  ? __pfx_watchdog_timer_fn+0x10/0x10
>    [   71.997148][    C0]  __hrtimer_run_queues+0x132/0x2a0
>    [   71.997913][    C0]  hrtimer_interrupt+0xff/0x230
>    [   71.998630][    C0]  __sysvec_apic_timer_interrupt+0x55/0x100
>    [   71.999435][    C0]  sysvec_apic_timer_interrupt+0x6c/0x90
>    [   72.000208][    C0]  </IRQ>
>    [   72.000732][    C0]  <TASK>
>    [   72.001250][    C0]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
>    [   72.002056][    C0] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs]
>    [   72.002950][    C0] Code: c1 01 48 83 (...)
>    [   72.005191][    C0] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246
>    [   72.006011][    C0] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000
>    [   72.006978][    C0] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad
>    [   72.007939][    C0] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080
>    [   72.008895][    C0] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860
>    [   72.009861][    C0] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000
>    [   72.010825][    C0]  ? btrfs_discard_workfn+0x51/0x400 [btrfs 23b01089228eb964071fb7ca156eee8cd3bf996f]
>    [   72.011978][    C0]  process_one_work+0x17e/0x330
>    [   72.012663][    C0]  worker_thread+0x2ce/0x3f0
>    [   72.013325][    C0]  ? __pfx_worker_thread+0x10/0x10
>    [   72.014044][    C0]  kthread+0xef/0x220
>    [   72.014647][    C0]  ? __pfx_kthread+0x10/0x10
>    [   72.015302][    C0]  ret_from_fork+0x34/0x50
>    [   72.015939][    C0]  ? __pfx_kthread+0x10/0x10
>    [   72.016592][    C0]  ret_from_fork_asm+0x1a/0x30
>    [   72.017281][    C0]  </TASK>
>    [   72.017931][    C0] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>    [   72.019478][    C0] Rebooting in 90 seconds..
>
> So fix this by making sure we move a block group out of the unused block
> groups discard list when calling __add_to_discard_list().
>
> Fixes: 2bee7eb8bb81 ("btrfs: discard one region at a time in async discard")
> Link: https://bugzilla.suse.com/show_bug.cgi?id=1242012
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Looks good to me.

Reviewed-by: Daniel Vacek <neelx@suse.com>

> ---
>  fs/btrfs/discard.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> index d6eef4bd9e9d..de23c4b3515e 100644
> --- a/fs/btrfs/discard.c
> +++ b/fs/btrfs/discard.c
> @@ -94,8 +94,6 @@ static void __add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
>                                   struct btrfs_block_group *block_group)
>  {
>         lockdep_assert_held(&discard_ctl->lock);
> -       if (!btrfs_run_discard_work(discard_ctl))
> -               return;
>
>         if (list_empty(&block_group->discard_list) ||
>             block_group->discard_index == BTRFS_DISCARD_INDEX_UNUSED) {
> @@ -118,6 +116,9 @@ static void add_to_discard_list(struct btrfs_discard_ctl *discard_ctl,
>         if (!btrfs_is_block_group_data_only(block_group))
>                 return;
>
> +       if (!btrfs_run_discard_work(discard_ctl))
> +               return;
> +
>         spin_lock(&discard_ctl->lock);
>         __add_to_discard_list(discard_ctl, block_group);
>         spin_unlock(&discard_ctl->lock);
> @@ -244,6 +245,18 @@ static struct btrfs_block_group *peek_discard_list(
>                     block_group->used != 0) {
>                         if (btrfs_is_block_group_data_only(block_group)) {
>                                 __add_to_discard_list(discard_ctl, block_group);
> +                               /*
> +                                * The block group must have been moved to other
> +                                * discard list even if discard was disabled in
> +                                * the meantime or a transaction abort happened,
> +                                * otherwise we can end up in an infinite loop,
> +                                * always jumping into the 'again' label and
> +                                * keep getting this block group over and over
> +                                * in case there are no other block groups in
> +                                * the discard lists.
> +                                */
> +                               ASSERT(block_group->discard_index !=
> +                                      BTRFS_DISCARD_INDEX_UNUSED);
>                         } else {
>                                 list_del_init(&block_group->discard_list);
>                                 btrfs_put_block_group(block_group);
> --
> 2.47.2
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/2] btrfs: use verbose assert at peek_discard_list()
  2025-05-05 15:49 [PATCH 0/2] btrfs: fix an async discard infinite loop when disabling it fdmanana
  2025-05-05 15:49 ` [PATCH 1/2] btrfs: fix discard worker infinite loop after disabling discard fdmanana
@ 2025-05-05 15:49 ` fdmanana
  2025-05-05 17:20   ` Boris Burkov
  2025-05-06 16:58   ` Daniel Vacek
  1 sibling, 2 replies; 7+ messages in thread
From: fdmanana @ 2025-05-05 15:49 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We now have a verbose variant of ASSERT() so that we can print the value
of the block group's discard_index. So use it for better problem analysis
in case the assertion is triggered.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/discard.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
index de23c4b3515e..89fe85778115 100644
--- a/fs/btrfs/discard.c
+++ b/fs/btrfs/discard.c
@@ -256,7 +256,9 @@ static struct btrfs_block_group *peek_discard_list(
 				 * the discard lists.
 				 */
 				ASSERT(block_group->discard_index !=
-				       BTRFS_DISCARD_INDEX_UNUSED);
+				       BTRFS_DISCARD_INDEX_UNUSED,
+				       "discard_index=%d",
+				       block_group->discard_index);
 			} else {
 				list_del_init(&block_group->discard_list);
 				btrfs_put_block_group(block_group);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] btrfs: use verbose assert at peek_discard_list()
  2025-05-05 15:49 ` [PATCH 2/2] btrfs: use verbose assert at peek_discard_list() fdmanana
@ 2025-05-05 17:20   ` Boris Burkov
  2025-05-06 16:58   ` Daniel Vacek
  1 sibling, 0 replies; 7+ messages in thread
From: Boris Burkov @ 2025-05-05 17:20 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Mon, May 05, 2025 at 04:49:56PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We now have a verbose variant of ASSERT() so that we can print the value
> of the block group's discard_index. So use it for better problem analysis
> in case the assertion is triggered.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/discard.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> index de23c4b3515e..89fe85778115 100644
> --- a/fs/btrfs/discard.c
> +++ b/fs/btrfs/discard.c
> @@ -256,7 +256,9 @@ static struct btrfs_block_group *peek_discard_list(
>  				 * the discard lists.
>  				 */
>  				ASSERT(block_group->discard_index !=
> -				       BTRFS_DISCARD_INDEX_UNUSED);
> +				       BTRFS_DISCARD_INDEX_UNUSED,
> +				       "discard_index=%d",
> +				       block_group->discard_index);
>  			} else {
>  				list_del_init(&block_group->discard_list);
>  				btrfs_put_block_group(block_group);
> -- 
> 2.47.2
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] btrfs: use verbose assert at peek_discard_list()
  2025-05-05 15:49 ` [PATCH 2/2] btrfs: use verbose assert at peek_discard_list() fdmanana
  2025-05-05 17:20   ` Boris Burkov
@ 2025-05-06 16:58   ` Daniel Vacek
  1 sibling, 0 replies; 7+ messages in thread
From: Daniel Vacek @ 2025-05-06 16:58 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Mon, 5 May 2025 at 17:50, <fdmanana@kernel.org> wrote:
>
> From: Filipe Manana <fdmanana@suse.com>
>
> We now have a verbose variant of ASSERT() so that we can print the value
> of the block group's discard_index. So use it for better problem analysis
> in case the assertion is triggered.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Daniel Vacek <neelx@suse.com>

> ---
>  fs/btrfs/discard.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> index de23c4b3515e..89fe85778115 100644
> --- a/fs/btrfs/discard.c
> +++ b/fs/btrfs/discard.c
> @@ -256,7 +256,9 @@ static struct btrfs_block_group *peek_discard_list(
>                                  * the discard lists.
>                                  */
>                                 ASSERT(block_group->discard_index !=
> -                                      BTRFS_DISCARD_INDEX_UNUSED);
> +                                      BTRFS_DISCARD_INDEX_UNUSED,
> +                                      "discard_index=%d",
> +                                      block_group->discard_index);
>                         } else {
>                                 list_del_init(&block_group->discard_list);
>                                 btrfs_put_block_group(block_group);
> --
> 2.47.2
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-05-06 16:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-05 15:49 [PATCH 0/2] btrfs: fix an async discard infinite loop when disabling it fdmanana
2025-05-05 15:49 ` [PATCH 1/2] btrfs: fix discard worker infinite loop after disabling discard fdmanana
2025-05-05 17:20   ` Boris Burkov
2025-05-06 16:55   ` Daniel Vacek
2025-05-05 15:49 ` [PATCH 2/2] btrfs: use verbose assert at peek_discard_list() fdmanana
2025-05-05 17:20   ` Boris Burkov
2025-05-06 16:58   ` Daniel Vacek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox