[PATCH] block: Fix blk_sync_queue() to properly stop timeout timer

Linux block layer
 help / color / mirror / Atom feed

* [PATCH] block: Fix blk_sync_queue() to properly stop timeout timer
@ 2025-05-29 21:49 Mohamed Khalfella
  2025-05-29 22:13 ` Keith Busch
  0 siblings, 1 reply; 5+ messages in thread
From: Mohamed Khalfella @ 2025-05-29 21:49 UTC (permalink / raw)
  To: James Smart, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg
  Cc: linux-nvme, linux-kernel, linux-block, Mohamed Khalfella,
	Yuanyuan Zhong, Michael Liang, Randy Jennings

[ 5084.255110] INFO: task kworker/42:1H:914 blocked for more than 917 seconds.
[ 5084.255563]       Not tainted 5.14.0-503.22.1mk.el9.x86_64 #6
[ 5084.255966] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5084.256421] task:kworker/42:1H   state:D stack:0     pid:914   tgid:914   ppid:2      flags:0x00004000
[ 5084.256794] Workqueue: kblockd blk_mq_timeout_work
[ 5084.257200] Call Trace:
[ 5084.257557]  <TASK>
[ 5084.257909]  __schedule+0x229/0x550
[ 5084.258322]  schedule+0x2e/0xd0
[ 5084.258665]  schedule_timeout+0x11f/0x160
[ 5084.259003]  __wait_for_common+0x90/0x1d0
[ 5084.259414]  ? __pfx_schedule_timeout+0x10/0x10
[ 5084.259740]  __flush_work.isra.0+0x160/0x230
[ 5084.260072]  ? __pfx_wq_barrier_func+0x10/0x10
[ 5084.260390]  __cancel_work_sync+0x104/0x1a0
[ 5084.260701]  ? __timer_delete_sync+0x2c/0x40
[ 5084.261008]  nvme_sync_io_queues+0x53/0xa0 [nvme_core]
[ 5084.261399]  __nvme_fc_abort_outstanding_ios+0x1b8/0x250 [nvme_fc]
[ 5084.261700]  nvme_fc_error_recovery+0x2d/0x50 [nvme_fc]
[ 5084.261997]  nvme_fc_timeout.cold+0x12/0x24 [nvme_fc]
[ 5084.262353]  blk_mq_handle_expired+0x7e/0x160
[ 5084.262637]  bt_iter+0x8b/0xa0
[ 5084.262912]  blk_mq_queue_tag_busy_iter+0x2b8/0x590
[ 5084.263224]  ? __pfx_blk_mq_handle_expired+0x10/0x10
[ 5084.263490]  ? __pfx_blk_mq_handle_expired+0x10/0x10
[ 5084.263748]  ? __call_rcu_common.constprop.0+0x210/0x2b0
[ 5084.264002]  blk_mq_timeout_work+0x162/0x1b0
[ 5084.264307]  process_one_work+0x194/0x380
[ 5084.264550]  worker_thread+0x2fe/0x410
[ 5084.264788]  ? __pfx_worker_thread+0x10/0x10
[ 5084.265019]  kthread+0xdd/0x100
[ 5084.265306]  ? __pfx_kthread+0x10/0x10
[ 5084.265527]  ret_from_fork+0x29/0x50
[ 5084.265741]  </TASK>

nvme-fc initiator hit hung_task with stacktrace above while handling
request timeout call. The work thread is waiting for itself to finish
which is never going to happen. From the stacktrace the nvme controller
was in NVME_CTRL_CONNECTING state when nvme_fc_timeout() was called.
We do not expect to get IO timeout call in NVME_CTRL_CONNECTING state
because blk_sync_queue() must have been called on this queue before
switching from NVME_CTRL_RESETTING to NVME_CTRL_CONNECTING.

It turned out that blk_sync_queue() did not stop q->timeout_work from
running as expected. nvme_fc_timeout() returned BLK_EH_RESET_TIMER
causing q->timeout to be rearmed after it was canceled earlier.
q->timeout queued q->timeout_work after the controller switched to
NVME_CTRL_CONNECTING state causing deadlock above.

Add QUEUE_FLAG_NOTIMEOUT queue flag to tell q->timeout not to queue
q->timeout_work while queue is being synced. Update blk_sync_queue() to
cancel q->timeout_work first and then cancel q->timeout.

Fixes: 287922eb0b18 ("block: defer timeouts to a workqueue")
Fixes: 4e9b6f20828a ("block: Fix a race between blk_cleanup_queue() and timeout handling")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Yuanyuan Zhong <yzhong@purestorage.com>
Reviewed-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
---
 block/blk-core.c       | 10 ++++++++--
 block/blk-mq-debugfs.c |  1 +
 include/linux/blkdev.h |  2 ++
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b862c66018f2..8b70c0202f07 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -219,8 +219,11 @@ EXPORT_SYMBOL_GPL(blk_status_to_str);
  */
 void blk_sync_queue(struct request_queue *q)
 {
-	timer_delete_sync(&q->timeout);
+	blk_queue_flag_set(QUEUE_FLAG_NOTIMEOUT, q);
+	synchronize_rcu();
 	cancel_work_sync(&q->timeout_work);
+	timer_delete_sync(&q->timeout);
+	blk_queue_flag_clear(QUEUE_FLAG_NOTIMEOUT, q);
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
@@ -383,7 +386,10 @@ static void blk_rq_timed_out_timer(struct timer_list *t)
 {
 	struct request_queue *q = from_timer(q, t, timeout);
 
-	kblockd_schedule_work(&q->timeout_work);
+	rcu_read_lock();
+	if (!blk_queue_notimeout(q))
+		kblockd_schedule_work(&q->timeout_work);
+	rcu_read_unlock();
 }
 
 static void blk_timeout_work(struct work_struct *work)
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 29b3540dd180..a98ff6fbf75d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -81,6 +81,7 @@ static int queue_pm_only_show(void *data, struct seq_file *m)
 #define QUEUE_FLAG_NAME(name) [QUEUE_FLAG_##name] = #name
 static const char *const blk_queue_flag_name[] = {
 	QUEUE_FLAG_NAME(DYING),
+	QUEUE_FLAG_NAME(NOTIMEOUT),
 	QUEUE_FLAG_NAME(NOMERGES),
 	QUEUE_FLAG_NAME(SAME_COMP),
 	QUEUE_FLAG_NAME(FAIL_IO),
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 332b56f323d9..c0e6a18f5325 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -633,6 +633,7 @@ struct request_queue {
 /* Keep blk_queue_flag_name[] in sync with the definitions below */
 enum {
 	QUEUE_FLAG_DYING,		/* queue being torn down */
+	QUEUE_FLAG_NOTIMEOUT,		/* do not schedule timeout work */
 	QUEUE_FLAG_NOMERGES,		/* disable merge attempts */
 	QUEUE_FLAG_SAME_COMP,		/* complete on same CPU-group */
 	QUEUE_FLAG_FAIL_IO,		/* fake timeout */
@@ -657,6 +658,7 @@ void blk_queue_flag_clear(unsigned int flag, struct request_queue *q);
 
 #define blk_queue_dying(q)	test_bit(QUEUE_FLAG_DYING, &(q)->queue_flags)
 #define blk_queue_init_done(q)	test_bit(QUEUE_FLAG_INIT_DONE, &(q)->queue_flags)
+#define blk_queue_notimeout(q)	test_bit(QUEUE_FLAG_NOTIMEOUT, &(q)->queue_flags)
 #define blk_queue_nomerges(q)	test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags)
 #define blk_queue_noxmerges(q)	\
 	test_bit(QUEUE_FLAG_NOXMERGES, &(q)->queue_flags)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] block: Fix blk_sync_queue() to properly stop timeout timer
  2025-05-29 21:49 [PATCH] block: Fix blk_sync_queue() to properly stop timeout timer Mohamed Khalfella
@ 2025-05-29 22:13 ` Keith Busch
  2025-05-29 22:33   ` Mohamed Khalfella
  0 siblings, 1 reply; 5+ messages in thread
From: Keith Busch @ 2025-05-29 22:13 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: James Smart, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, linux-kernel, linux-block, Yuanyuan Zhong,
	Michael Liang, Randy Jennings

On Thu, May 29, 2025 at 03:49:28PM -0600, Mohamed Khalfella wrote:
> nvme-fc initiator hit hung_task with stacktrace above while handling
> request timeout call. The work thread is waiting for itself to finish
> which is never going to happen. From the stacktrace the nvme controller
> was in NVME_CTRL_CONNECTING state when nvme_fc_timeout() was called.
> We do not expect to get IO timeout call in NVME_CTRL_CONNECTING state
> because blk_sync_queue() must have been called on this queue before
> switching from NVME_CTRL_RESETTING to NVME_CTRL_CONNECTING.
> 
> It turned out that blk_sync_queue() did not stop q->timeout_work from
> running as expected. nvme_fc_timeout() returned BLK_EH_RESET_TIMER
> causing q->timeout to be rearmed after it was canceled earlier.
> q->timeout queued q->timeout_work after the controller switched to
> NVME_CTRL_CONNECTING state causing deadlock above.
> 
> Add QUEUE_FLAG_NOTIMEOUT queue flag to tell q->timeout not to queue
> q->timeout_work while queue is being synced. Update blk_sync_queue() to
> cancel q->timeout_work first and then cancel q->timeout.

I feel like this is a nvme-fc problem that doesn't need the block layer
to handle. Just don't sync the queues within the timeout workqueue
context.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] block: Fix blk_sync_queue() to properly stop timeout timer
  2025-05-29 22:13 ` Keith Busch
@ 2025-05-29 22:33   ` Mohamed Khalfella
  2025-06-20 18:34     ` Mohamed Khalfella
  0 siblings, 1 reply; 5+ messages in thread
From: Mohamed Khalfella @ 2025-05-29 22:33 UTC (permalink / raw)
  To: Keith Busch
  Cc: James Smart, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, linux-kernel, linux-block, Yuanyuan Zhong,
	Michael Liang, Randy Jennings

On 2025-05-29 16:13:23 -0600, Keith Busch wrote:
> On Thu, May 29, 2025 at 03:49:28PM -0600, Mohamed Khalfella wrote:
> > nvme-fc initiator hit hung_task with stacktrace above while handling
> > request timeout call. The work thread is waiting for itself to finish
> > which is never going to happen. From the stacktrace the nvme controller
> > was in NVME_CTRL_CONNECTING state when nvme_fc_timeout() was called.
> > We do not expect to get IO timeout call in NVME_CTRL_CONNECTING state
> > because blk_sync_queue() must have been called on this queue before
> > switching from NVME_CTRL_RESETTING to NVME_CTRL_CONNECTING.
> > 
> > It turned out that blk_sync_queue() did not stop q->timeout_work from
> > running as expected. nvme_fc_timeout() returned BLK_EH_RESET_TIMER
> > causing q->timeout to be rearmed after it was canceled earlier.
> > q->timeout queued q->timeout_work after the controller switched to
> > NVME_CTRL_CONNECTING state causing deadlock above.
> > 
> > Add QUEUE_FLAG_NOTIMEOUT queue flag to tell q->timeout not to queue
> > q->timeout_work while queue is being synced. Update blk_sync_queue() to
> > cancel q->timeout_work first and then cancel q->timeout.
> 
> I feel like this is a nvme-fc problem that doesn't need the block layer
> to handle. Just don't sync the queues within the timeout workqueue
> context.

Agreed on nvme-fc should not sync queues within timeout work, and I am
testing a patch to fix nvme-fc. At the same time blk_sync_queue() should
provide a guarantee that q->timeout_work will not run after the function
returns, no?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] block: Fix blk_sync_queue() to properly stop timeout timer
  2025-05-29 22:33   ` Mohamed Khalfella
@ 2025-06-20 18:34     ` Mohamed Khalfella
  2026-06-09 23:11       ` Casey Chen
  0 siblings, 1 reply; 5+ messages in thread
From: Mohamed Khalfella @ 2025-06-20 18:34 UTC (permalink / raw)
  To: Keith Busch, James Smart, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg
  Cc: linux-nvme, linux-kernel, linux-block, Yuanyuan Zhong,
	Michael Liang, Randy Jennings

On 2025-05-29 15:33:47 -0700, Mohamed Khalfella wrote:
> On 2025-05-29 16:13:23 -0600, Keith Busch wrote:
> > On Thu, May 29, 2025 at 03:49:28PM -0600, Mohamed Khalfella wrote:
> > > nvme-fc initiator hit hung_task with stacktrace above while handling
> > > request timeout call. The work thread is waiting for itself to finish
> > > which is never going to happen. From the stacktrace the nvme controller
> > > was in NVME_CTRL_CONNECTING state when nvme_fc_timeout() was called.
> > > We do not expect to get IO timeout call in NVME_CTRL_CONNECTING state
> > > because blk_sync_queue() must have been called on this queue before
> > > switching from NVME_CTRL_RESETTING to NVME_CTRL_CONNECTING.
> > > 
> > > It turned out that blk_sync_queue() did not stop q->timeout_work from
> > > running as expected. nvme_fc_timeout() returned BLK_EH_RESET_TIMER
> > > causing q->timeout to be rearmed after it was canceled earlier.
> > > q->timeout queued q->timeout_work after the controller switched to
> > > NVME_CTRL_CONNECTING state causing deadlock above.
> > > 
> > > Add QUEUE_FLAG_NOTIMEOUT queue flag to tell q->timeout not to queue
> > > q->timeout_work while queue is being synced. Update blk_sync_queue() to
> > > cancel q->timeout_work first and then cancel q->timeout.
> > 
> > I feel like this is a nvme-fc problem that doesn't need the block layer
> > to handle. Just don't sync the queues within the timeout workqueue
> > context.
> 
> Agreed on nvme-fc should not sync queues within timeout work, and I am
> testing a patch to fix nvme-fc. At the same time blk_sync_queue() should
> provide a guarantee that q->timeout_work will not run after the function
> returns, no?

Following up on this patch. I think the issue with blk_sync_queue()
needs to be addressed. If adding a queue flag is not the preferred way
to do it, please let me know what do you suggest and I will make the
code changes.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] block: Fix blk_sync_queue() to properly stop timeout timer
  2025-06-20 18:34     ` Mohamed Khalfella
@ 2026-06-09 23:11       ` Casey Chen
  0 siblings, 0 replies; 5+ messages in thread
From: Casey Chen @ 2026-06-09 23:11 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Keith Busch, James Smart, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, linux-nvme, linux-kernel, linux-block,
	Yuanyuan Zhong, Michael Liang, Randy Jennings

On 2025-06-20 11:34:19 -0700, Mohamed Khalfella wrote:
> Following up on this patch. I think the issue with blk_sync_queue()
> needs to be addressed. If adding a queue flag is not the preferred way
> to do it, please let me know what do you suggest and I will make the
> code changes.

I'd like to add a second data point in support of this patch. We hit
the same blk_sync_queue() race in nvme-rdma during torture testing of
an nvme-multipath/dm-mpath stacked configuration, with a call pattern
distinct from the nvme-fc case Mohamed reported. The symptom on our
side is request double-completion leading to dm-rq clone double-free,
not the self-deadlock Mohamed observed; but the root cause is the
same q->timeout rearm window inside blk_sync_queue().

Importantly, nvme-rdma's blk_sync_queue() call happens from a
separate workqueue (nvme_rdma_error_recovery_work), NOT from inside
blk_mq_timeout_work. So Keith's earlier suggestion that the consumer
"just not sync the queues within the timeout workqueue context" --
which addresses nvme-fc's self-deadlock -- does not apply to us:
we already are outside that context, and the race still manifests
because blk_sync_queue() itself doesn't fence q->timeout_work
against post-return rearm. The fix has to be in blk_sync_queue().

Setup:
  - Linux 6.6.77, ~64 CPUs, ~50 nvme-rdma controllers, nvme-mpath
    paths each carrying a dm-mpath clone in flight.
  - "Stall an SPM" torture: target-side faults inducing per-controller
    nvme_rdma_error_recovery_work in parallel across many ctrls.

Call pattern:

  blk_mq_timeout_work       (per-queue, kblockd)
    ... nvme_rdma_timeout(rq) -- ctrl state != NVME_CTRL_LIVE
        path A -> nvme_rdma_complete_timed_out(rq)
            blk_mq_complete_request(rq)        <- forced completion A

  nvme_rdma_error_recovery_work (per-ctrl, separate workqueue)
    nvme_rdma_teardown_io_queues
      nvme_quiesce_io_queues
      nvme_sync_io_queues       <- calls blk_sync_queue per ns->queue
      nvme_rdma_stop_io_queues
      nvme_rdma_cancel_tagset
        blk_mq_tagset_busy_iter(nvme_cancel_request)
          nvme_cancel_request(rq)
            blk_mq_complete_request(rq)        <- forced completion B

We instrumented the kernel to capture per-request event histories
(stack + cpu + timestamp) at each forced-completion entry point, at
blk_mq_complete_request() entry on the orig, and at
multipath_release_clone(). In an unpatched 6.6 we captured the
following sequence on the same nvme-rdma clone within ~5 ms, which
walks step-by-step from the two racing producers through the
double-completion of the orig to the double-free of the clone.
Records below are verbatim from the kernel log:

  rq=ffff889589967d00 tag=10 qid=16 ctrl=nvme33 op=DRV_IN

(1) Timeout (T) and cancel (C) reach forced-completion entry within
    ~1 us, on different CPUs (note 'age=4751us' vs 'age=4750us'):

     by-source[2] source=T age=4751us cpu=8 op=DRV_IN nr_bytes=32768 sts=0 ctx=ffff889589967d00 stack(10):
       [0] nvme_rdma_timeout+0x58/0x110 [nvme_rdma]
       [1] blk_mq_handle_expired+0x5d/0x90
       [2] bt_iter+0x7e/0x90
       [3] blk_mq_queue_tag_busy_iter+0x2b3/0x590
       [4] blk_mq_timeout_work+0x15f/0x1b0
       [5] process_one_work+0x133/0x380
       [6] worker_thread+0x30d/0x420
       [7] kthread+0xe5/0x110
       [8] ret_from_fork+0x2d/0x50
       [9] ret_from_fork_asm+0x11/0x20

     by-source[1] source=C age=4750us cpu=14 op=DRV_IN nr_bytes=32768 sts=0 ctx=ffff889589967d00 stack(10):
       [0] nvme_cancel_request+0x52/0xb0 [nvme_core]
       [1] bt_tags_iter+0x4b/0xa0
       [2] blk_mq_tagset_busy_iter+0x186/0x2c0
       [3] nvme_rdma_teardown_io_queues.part.0+0x73/0xf0 [nvme_rdma]
       [4] nvme_rdma_error_recovery_work+0x41/0xe0 [nvme_rdma]
       [5] process_one_work+0x133/0x380
       [6] worker_thread+0x30d/0x420
       [7] kthread+0xe5/0x110
       [8] ret_from_fork+0x2d/0x50
       [9] ret_from_fork_asm+0x11/0x20

  (The dump CPU is C8; cpu=N inside each record is the producing
  CPU. T came from cpu=8 with stack rooted at blk_mq_timeout_work;
  C came from cpu=14 with stack rooted at
  nvme_rdma_error_recovery_work. Two independent producers, no
  re-entry, captured ~1us apart on the same rq.)

(2) Each chain proceeds through dm-rq's clone-end-io to call
    blk_mq_complete_request() on the dm-mpath orig that owns this
    clone via tio->clone. Our hook at blk_mq_complete_request()
    captured *two* invocations on the same orig, one from each
    upstream forced-completion chain (both age=4743us):

     by-source[6] source=B age=4743us cpu=14 op=DRV_IN nr_bytes=32768 sts=0 ctx=ffff889589967d00 stack(14):
       [0] nvme_rdma_dbg_blk_complete_hook+0xcd/0x130 [nvme_rdma]
       [1] blk_mq_complete_request+0x17/0x70
       [2] end_clone_request+0x1f/0x30
       [3] __blk_mq_end_request+0x60/0x150
       [4] nvme_cancel_request+0x5a/0xb0 [nvme_core]
       [5] bt_tags_iter+0x4b/0xa0
       [6] blk_mq_tagset_busy_iter+0x186/0x2c0
       [7] nvme_rdma_teardown_io_queues.part.0+0x73/0xf0 [nvme_rdma]
       [8] nvme_rdma_error_recovery_work+0x41/0xe0 [nvme_rdma]
       [9] process_one_work+0x133/0x380
       [10] worker_thread+0x30d/0x420
       [11] kthread+0xe5/0x110
       [12] ret_from_fork+0x2d/0x50
       [13] ret_from_fork_asm+0x11/0x20

     ring[4] source=B age=4743us cpu=8 op=DRV_IN nr_bytes=32768 sts=0 ctx=ffff889589967d00 stack(14):
       [0] nvme_rdma_dbg_blk_complete_hook+0xcd/0x130 [nvme_rdma]
       [1] blk_mq_complete_request+0x17/0x70
       [2] end_clone_request+0x1f/0x30
       [3] __blk_mq_end_request+0x60/0x150
       [4] nvme_rdma_timeout+0xff/0x110 [nvme_rdma]
       [5] blk_mq_handle_expired+0x5d/0x90
       [6] bt_iter+0x7e/0x90
       [7] blk_mq_queue_tag_busy_iter+0x2b3/0x590
       [8] blk_mq_timeout_work+0x15f/0x1b0
  (frames [9..13] truncated in our log dump by interleaved console
   output during the dump; the lower frames are the standard
   process_one_work/worker_thread/kthread tail.)

  Two B records on the same orig at the same age=4743us, with two
  *different* upstream chains (frame [4] is nvme_cancel_request in
  the first, nvme_rdma_timeout in the second), is the direct
  observation that the orig has been double-completed.

(3) Each B schedules a per-cpu BLOCK_SOFTIRQ for the orig on its
    mq_ctx->cpu. The first softirq drain runs dm_softirq_done ->
    dm_done -> dm_requeue_original_request -> release_clone_rq,
    which releases this clone via multipath_release_clone ->
    blk_mq_free_request. req_ref_put_and_test sees refcount=1,
    dec to 0, free runs; "R" source captures this (age=4736us,
    7us after the two B's):

     by-source[5] source=R age=4736us cpu=8 op=DRV_IN nr_bytes=0 sts=0 ctx=ffff889589967d00 stack(10):
       [0] nvme_rdma_dbg_release_clone_hook+0x3f/0x120 [nvme_rdma]
       [1] multipath_release_clone+0x1f/0x60
       [2] dm_requeue_original_request+0x61/0xd0
       [3] blk_complete_reqs+0x3a/0x50
       [4] handle_softirqs+0x1ee/0x4a0
       [5] run_ksoftirqd+0x3c/0x70
       [6] smpboot_thread_fn+0xdf/0x1a0
       [7] kthread+0xe5/0x110
       [8] ret_from_fork+0x2d/0x50
       [9] ret_from_fork_asm+0x11/0x20

(4) The second softirq drain on the same orig (queued by the other
    chain's blk_mq_complete_request) runs the same path. tio->clone
    still points to the just-freed clone (dm-rq does not clear it),
    so dm_done -> release_clone_rq -> blk_mq_free_request is called
    again. This is the double-free: refcount is already 0 entering
    req_ref_put_and_test, which trips the WARN_ON_ONCE:

  [ 7122.320733] ------------[ cut here ]------------
  [ 7122.320742] WARNING: CPU: 3 PID: 77382 at block/blk.h:508 \
                 blk_mq_free_request+0x64/0x90
  [ 7122.320811] CPU: 3 PID: 77382 Comm: foed
  [ 7122.320821] RIP: 0010:blk_mq_free_request+0x64/0x90
  [ 7122.320942] Call Trace:
  [ 7122.320952]  ? __warn+0xa3/0x1d0
  [ 7122.320956]  ? blk_mq_free_request+0x64/0x90
  [ 7122.320962]  ? report_bug+0x200/0x230
  [ 7122.320966]  ? handle_bug+0x57/0x90
  [ 7122.320984]  dm_requeue_original_request+0x61/0xd0
  [ 7122.320989]  blk_complete_reqs+0x3a/0x50
  [ 7122.320993]  handle_softirqs+0x1ee/0x4a0
  [ 7122.321003]  __irq_exit_rcu+0x79/0xa0
  [ 7122.321007]  irq_exit_rcu+0xa/0x20
  [ 7122.321011]  common_interrupt+0x43/0xa0
  [ 7122.321014]  asm_common_interrupt+0x22/0x40
  [ 7122.321050] ---[ end trace 0000000000000000 ]---

The premise for the race is that nvme_rdma_teardown_io_queues calls
nvme_quiesce_io_queues -> nvme_sync_io_queues BEFORE
nvme_rdma_cancel_tagset, so after sync returns we expect no
blk_mq_timeout_work can fire concurrent with cancel_tagset. With the
unpatched blk_sync_queue(), that expectation breaks:
nvme_rdma_timeout() returning BLK_EH_RESET_TIMER causes a per-request
blk_add_timer() to re-arm q->timeout from inside the timeout work
that blk_sync_queue is in the middle of draining. After sync's
del_timer_sync (which ran first in the old ordering) and
cancel_work_sync return, q->timeout is once again armed. It fires
shortly afterwards, queueing a new blk_mq_timeout_work that races
nvme_rdma_cancel_tagset on the same rqs.

The race also corrupts wire ordering, since the clone is released
by dm-rq while its SEND/RECV WRs are still outstanding (the forced
completions bypassed end_request's refcount drain gate). A late
recv CQE then lands on the freed slot, dereferences NULL mq_hctx
inside blk_mq_complete_request_remote, and panics the kernel:

  [ 3079.146243] BUG: kernel NULL pointer dereference, address: 0000000000000136
  [ 3079.150802] #PF: supervisor read access in kernel mode
  [ 3079.150811] PGD 0 P4D 0
  [ 3079.150813] Oops: 0000 [#1] SMP
  [ 3079.150819] CPU: 1 PID: 10069 Comm: foed
  [ 3079.150834] RIP: 0010:blk_mq_complete_request_remote+0x15/0x130
  [ 3079.155506] CR2: 0000000000000136
  [ 3079.155516] Call Trace:
  [ 3079.166933]  ? blk_mq_complete_request_remote+0x15/0x130
  [ 3079.166953]  ? nvme_should_fail+0x3e/0x90 [nvme_core]
  [ 3079.166958]  nvme_rdma_end_request+0x55/0x80 [nvme_rdma]
  [ 3079.166962]  nvme_rdma_recv_done+0x1a4/0x5e0 [nvme_rdma]
  [ 3079.185939]  __ib_process_cq+0x79/0x150 [ib_core]
  [ 3079.193425]  ib_poll_handler+0x2c/0xb0 [ib_core]

(CR2=0x136 = offsetof(struct blk_mq_hw_ctx, nr_ctx); the deref is
rq->mq_hctx->nr_ctx with rq->mq_hctx already nulled by
__blk_mq_free_request from the racy double release above.)

We backported your patch to 6.6 (QUEUE_FLAG_NOTIMEOUT is still in
the #define-with-slot-number style there; allocated slot 2). With
the patch applied, both upstream signals went silent across
multiple-hour torture runs:

  - WARN_ON_ONCE at block/blk.h:508 (double-free): 0 occurrences.
  - NULL deref panic at blk_mq_complete_request_remote+0x15 (stale
    CQE on freed slot): 0 occurrences.

In response to Keith's earlier comment about this being an nvme-fc
problem:

  - The deadlock you described (sync called from within timeout work)
    is specific to nvme-fc's call pattern, agreed.

  - The race in blk_sync_queue() is not. nvme-rdma does not sync from
    within timeout work; sync is called from nvme_rdma_error_recovery_work
    (a separate workqueue) under the assumption that after sync returns
    no further timeout work for that queue can run. That assumption is
    what blk_sync_queue's name and existing callers imply, and what
    Mohamed's patch makes true.

  - Any driver returning BLK_EH_RESET_TIMER from .timeout and having a
    teardown path that calls blk_sync_queue is subject to the same
    race. nvme-fc and nvme-rdma both qualify; we expect nvme-tcp does
    too (haven't reproduced there yet).

A driver-local fix for each consumer doesn't address the contract:
blk_sync_queue() should provide what its name says it provides.

Tested-by: Casey Chen <cachen@purestorage.com>

Casey

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-09 23:12 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-29 21:49 [PATCH] block: Fix blk_sync_queue() to properly stop timeout timer Mohamed Khalfella
2025-05-29 22:13 ` Keith Busch
2025-05-29 22:33   ` Mohamed Khalfella
2025-06-20 18:34     ` Mohamed Khalfella
2026-06-09 23:11       ` Casey Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox