From: Casey Chen <cachen@purestorage.com>
To: Mohamed Khalfella <mkhalfella@purestorage.com>
Cc: Keith Busch <kbusch@kernel.org>,
James Smart <james.smart@broadcom.com>,
Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@lst.de>,
Sagi Grimberg <sagi@grimberg.me>,
linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org,
linux-block@vger.kernel.org,
Yuanyuan Zhong <yzhong@purestorage.com>,
Michael Liang <mliang@purestorage.com>,
Randy Jennings <randyj@purestorage.com>
Subject: Re: [PATCH] block: Fix blk_sync_queue() to properly stop timeout timer
Date: Tue, 9 Jun 2026 17:11:44 -0600 [thread overview]
Message-ID: <20260609231145.3712029-1-cachen@purestorage.com> (raw)
In-Reply-To: <20250620183419.GC4836-mkhalfella@purestorage.com>
On 2025-06-20 11:34:19 -0700, Mohamed Khalfella wrote:
> Following up on this patch. I think the issue with blk_sync_queue()
> needs to be addressed. If adding a queue flag is not the preferred way
> to do it, please let me know what do you suggest and I will make the
> code changes.
I'd like to add a second data point in support of this patch. We hit
the same blk_sync_queue() race in nvme-rdma during torture testing of
an nvme-multipath/dm-mpath stacked configuration, with a call pattern
distinct from the nvme-fc case Mohamed reported. The symptom on our
side is request double-completion leading to dm-rq clone double-free,
not the self-deadlock Mohamed observed; but the root cause is the
same q->timeout rearm window inside blk_sync_queue().
Importantly, nvme-rdma's blk_sync_queue() call happens from a
separate workqueue (nvme_rdma_error_recovery_work), NOT from inside
blk_mq_timeout_work. So Keith's earlier suggestion that the consumer
"just not sync the queues within the timeout workqueue context" --
which addresses nvme-fc's self-deadlock -- does not apply to us:
we already are outside that context, and the race still manifests
because blk_sync_queue() itself doesn't fence q->timeout_work
against post-return rearm. The fix has to be in blk_sync_queue().
Setup:
- Linux 6.6.77, ~64 CPUs, ~50 nvme-rdma controllers, nvme-mpath
paths each carrying a dm-mpath clone in flight.
- "Stall an SPM" torture: target-side faults inducing per-controller
nvme_rdma_error_recovery_work in parallel across many ctrls.
Call pattern:
blk_mq_timeout_work (per-queue, kblockd)
... nvme_rdma_timeout(rq) -- ctrl state != NVME_CTRL_LIVE
path A -> nvme_rdma_complete_timed_out(rq)
blk_mq_complete_request(rq) <- forced completion A
nvme_rdma_error_recovery_work (per-ctrl, separate workqueue)
nvme_rdma_teardown_io_queues
nvme_quiesce_io_queues
nvme_sync_io_queues <- calls blk_sync_queue per ns->queue
nvme_rdma_stop_io_queues
nvme_rdma_cancel_tagset
blk_mq_tagset_busy_iter(nvme_cancel_request)
nvme_cancel_request(rq)
blk_mq_complete_request(rq) <- forced completion B
We instrumented the kernel to capture per-request event histories
(stack + cpu + timestamp) at each forced-completion entry point, at
blk_mq_complete_request() entry on the orig, and at
multipath_release_clone(). In an unpatched 6.6 we captured the
following sequence on the same nvme-rdma clone within ~5 ms, which
walks step-by-step from the two racing producers through the
double-completion of the orig to the double-free of the clone.
Records below are verbatim from the kernel log:
rq=ffff889589967d00 tag=10 qid=16 ctrl=nvme33 op=DRV_IN
(1) Timeout (T) and cancel (C) reach forced-completion entry within
~1 us, on different CPUs (note 'age=4751us' vs 'age=4750us'):
by-source[2] source=T age=4751us cpu=8 op=DRV_IN nr_bytes=32768 sts=0 ctx=ffff889589967d00 stack(10):
[0] nvme_rdma_timeout+0x58/0x110 [nvme_rdma]
[1] blk_mq_handle_expired+0x5d/0x90
[2] bt_iter+0x7e/0x90
[3] blk_mq_queue_tag_busy_iter+0x2b3/0x590
[4] blk_mq_timeout_work+0x15f/0x1b0
[5] process_one_work+0x133/0x380
[6] worker_thread+0x30d/0x420
[7] kthread+0xe5/0x110
[8] ret_from_fork+0x2d/0x50
[9] ret_from_fork_asm+0x11/0x20
by-source[1] source=C age=4750us cpu=14 op=DRV_IN nr_bytes=32768 sts=0 ctx=ffff889589967d00 stack(10):
[0] nvme_cancel_request+0x52/0xb0 [nvme_core]
[1] bt_tags_iter+0x4b/0xa0
[2] blk_mq_tagset_busy_iter+0x186/0x2c0
[3] nvme_rdma_teardown_io_queues.part.0+0x73/0xf0 [nvme_rdma]
[4] nvme_rdma_error_recovery_work+0x41/0xe0 [nvme_rdma]
[5] process_one_work+0x133/0x380
[6] worker_thread+0x30d/0x420
[7] kthread+0xe5/0x110
[8] ret_from_fork+0x2d/0x50
[9] ret_from_fork_asm+0x11/0x20
(The dump CPU is C8; cpu=N inside each record is the producing
CPU. T came from cpu=8 with stack rooted at blk_mq_timeout_work;
C came from cpu=14 with stack rooted at
nvme_rdma_error_recovery_work. Two independent producers, no
re-entry, captured ~1us apart on the same rq.)
(2) Each chain proceeds through dm-rq's clone-end-io to call
blk_mq_complete_request() on the dm-mpath orig that owns this
clone via tio->clone. Our hook at blk_mq_complete_request()
captured *two* invocations on the same orig, one from each
upstream forced-completion chain (both age=4743us):
by-source[6] source=B age=4743us cpu=14 op=DRV_IN nr_bytes=32768 sts=0 ctx=ffff889589967d00 stack(14):
[0] nvme_rdma_dbg_blk_complete_hook+0xcd/0x130 [nvme_rdma]
[1] blk_mq_complete_request+0x17/0x70
[2] end_clone_request+0x1f/0x30
[3] __blk_mq_end_request+0x60/0x150
[4] nvme_cancel_request+0x5a/0xb0 [nvme_core]
[5] bt_tags_iter+0x4b/0xa0
[6] blk_mq_tagset_busy_iter+0x186/0x2c0
[7] nvme_rdma_teardown_io_queues.part.0+0x73/0xf0 [nvme_rdma]
[8] nvme_rdma_error_recovery_work+0x41/0xe0 [nvme_rdma]
[9] process_one_work+0x133/0x380
[10] worker_thread+0x30d/0x420
[11] kthread+0xe5/0x110
[12] ret_from_fork+0x2d/0x50
[13] ret_from_fork_asm+0x11/0x20
ring[4] source=B age=4743us cpu=8 op=DRV_IN nr_bytes=32768 sts=0 ctx=ffff889589967d00 stack(14):
[0] nvme_rdma_dbg_blk_complete_hook+0xcd/0x130 [nvme_rdma]
[1] blk_mq_complete_request+0x17/0x70
[2] end_clone_request+0x1f/0x30
[3] __blk_mq_end_request+0x60/0x150
[4] nvme_rdma_timeout+0xff/0x110 [nvme_rdma]
[5] blk_mq_handle_expired+0x5d/0x90
[6] bt_iter+0x7e/0x90
[7] blk_mq_queue_tag_busy_iter+0x2b3/0x590
[8] blk_mq_timeout_work+0x15f/0x1b0
(frames [9..13] truncated in our log dump by interleaved console
output during the dump; the lower frames are the standard
process_one_work/worker_thread/kthread tail.)
Two B records on the same orig at the same age=4743us, with two
*different* upstream chains (frame [4] is nvme_cancel_request in
the first, nvme_rdma_timeout in the second), is the direct
observation that the orig has been double-completed.
(3) Each B schedules a per-cpu BLOCK_SOFTIRQ for the orig on its
mq_ctx->cpu. The first softirq drain runs dm_softirq_done ->
dm_done -> dm_requeue_original_request -> release_clone_rq,
which releases this clone via multipath_release_clone ->
blk_mq_free_request. req_ref_put_and_test sees refcount=1,
dec to 0, free runs; "R" source captures this (age=4736us,
7us after the two B's):
by-source[5] source=R age=4736us cpu=8 op=DRV_IN nr_bytes=0 sts=0 ctx=ffff889589967d00 stack(10):
[0] nvme_rdma_dbg_release_clone_hook+0x3f/0x120 [nvme_rdma]
[1] multipath_release_clone+0x1f/0x60
[2] dm_requeue_original_request+0x61/0xd0
[3] blk_complete_reqs+0x3a/0x50
[4] handle_softirqs+0x1ee/0x4a0
[5] run_ksoftirqd+0x3c/0x70
[6] smpboot_thread_fn+0xdf/0x1a0
[7] kthread+0xe5/0x110
[8] ret_from_fork+0x2d/0x50
[9] ret_from_fork_asm+0x11/0x20
(4) The second softirq drain on the same orig (queued by the other
chain's blk_mq_complete_request) runs the same path. tio->clone
still points to the just-freed clone (dm-rq does not clear it),
so dm_done -> release_clone_rq -> blk_mq_free_request is called
again. This is the double-free: refcount is already 0 entering
req_ref_put_and_test, which trips the WARN_ON_ONCE:
[ 7122.320733] ------------[ cut here ]------------
[ 7122.320742] WARNING: CPU: 3 PID: 77382 at block/blk.h:508 \
blk_mq_free_request+0x64/0x90
[ 7122.320811] CPU: 3 PID: 77382 Comm: foed
[ 7122.320821] RIP: 0010:blk_mq_free_request+0x64/0x90
[ 7122.320942] Call Trace:
[ 7122.320952] ? __warn+0xa3/0x1d0
[ 7122.320956] ? blk_mq_free_request+0x64/0x90
[ 7122.320962] ? report_bug+0x200/0x230
[ 7122.320966] ? handle_bug+0x57/0x90
[ 7122.320984] dm_requeue_original_request+0x61/0xd0
[ 7122.320989] blk_complete_reqs+0x3a/0x50
[ 7122.320993] handle_softirqs+0x1ee/0x4a0
[ 7122.321003] __irq_exit_rcu+0x79/0xa0
[ 7122.321007] irq_exit_rcu+0xa/0x20
[ 7122.321011] common_interrupt+0x43/0xa0
[ 7122.321014] asm_common_interrupt+0x22/0x40
[ 7122.321050] ---[ end trace 0000000000000000 ]---
The premise for the race is that nvme_rdma_teardown_io_queues calls
nvme_quiesce_io_queues -> nvme_sync_io_queues BEFORE
nvme_rdma_cancel_tagset, so after sync returns we expect no
blk_mq_timeout_work can fire concurrent with cancel_tagset. With the
unpatched blk_sync_queue(), that expectation breaks:
nvme_rdma_timeout() returning BLK_EH_RESET_TIMER causes a per-request
blk_add_timer() to re-arm q->timeout from inside the timeout work
that blk_sync_queue is in the middle of draining. After sync's
del_timer_sync (which ran first in the old ordering) and
cancel_work_sync return, q->timeout is once again armed. It fires
shortly afterwards, queueing a new blk_mq_timeout_work that races
nvme_rdma_cancel_tagset on the same rqs.
The race also corrupts wire ordering, since the clone is released
by dm-rq while its SEND/RECV WRs are still outstanding (the forced
completions bypassed end_request's refcount drain gate). A late
recv CQE then lands on the freed slot, dereferences NULL mq_hctx
inside blk_mq_complete_request_remote, and panics the kernel:
[ 3079.146243] BUG: kernel NULL pointer dereference, address: 0000000000000136
[ 3079.150802] #PF: supervisor read access in kernel mode
[ 3079.150811] PGD 0 P4D 0
[ 3079.150813] Oops: 0000 [#1] SMP
[ 3079.150819] CPU: 1 PID: 10069 Comm: foed
[ 3079.150834] RIP: 0010:blk_mq_complete_request_remote+0x15/0x130
[ 3079.155506] CR2: 0000000000000136
[ 3079.155516] Call Trace:
[ 3079.166933] ? blk_mq_complete_request_remote+0x15/0x130
[ 3079.166953] ? nvme_should_fail+0x3e/0x90 [nvme_core]
[ 3079.166958] nvme_rdma_end_request+0x55/0x80 [nvme_rdma]
[ 3079.166962] nvme_rdma_recv_done+0x1a4/0x5e0 [nvme_rdma]
[ 3079.185939] __ib_process_cq+0x79/0x150 [ib_core]
[ 3079.193425] ib_poll_handler+0x2c/0xb0 [ib_core]
(CR2=0x136 = offsetof(struct blk_mq_hw_ctx, nr_ctx); the deref is
rq->mq_hctx->nr_ctx with rq->mq_hctx already nulled by
__blk_mq_free_request from the racy double release above.)
We backported your patch to 6.6 (QUEUE_FLAG_NOTIMEOUT is still in
the #define-with-slot-number style there; allocated slot 2). With
the patch applied, both upstream signals went silent across
multiple-hour torture runs:
- WARN_ON_ONCE at block/blk.h:508 (double-free): 0 occurrences.
- NULL deref panic at blk_mq_complete_request_remote+0x15 (stale
CQE on freed slot): 0 occurrences.
In response to Keith's earlier comment about this being an nvme-fc
problem:
- The deadlock you described (sync called from within timeout work)
is specific to nvme-fc's call pattern, agreed.
- The race in blk_sync_queue() is not. nvme-rdma does not sync from
within timeout work; sync is called from nvme_rdma_error_recovery_work
(a separate workqueue) under the assumption that after sync returns
no further timeout work for that queue can run. That assumption is
what blk_sync_queue's name and existing callers imply, and what
Mohamed's patch makes true.
- Any driver returning BLK_EH_RESET_TIMER from .timeout and having a
teardown path that calls blk_sync_queue is subject to the same
race. nvme-fc and nvme-rdma both qualify; we expect nvme-tcp does
too (haven't reproduced there yet).
A driver-local fix for each consumer doesn't address the contract:
blk_sync_queue() should provide what its name says it provides.
Tested-by: Casey Chen <cachen@purestorage.com>
Casey
prev parent reply other threads:[~2026-06-09 23:12 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-29 21:49 [PATCH] block: Fix blk_sync_queue() to properly stop timeout timer Mohamed Khalfella
2025-05-29 22:13 ` Keith Busch
2025-05-29 22:33 ` Mohamed Khalfella
2025-06-20 18:34 ` Mohamed Khalfella
2026-06-09 23:11 ` Casey Chen [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260609231145.3712029-1-cachen@purestorage.com \
--to=cachen@purestorage.com \
--cc=axboe@kernel.dk \
--cc=hch@lst.de \
--cc=james.smart@broadcom.com \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=mkhalfella@purestorage.com \
--cc=mliang@purestorage.com \
--cc=randyj@purestorage.com \
--cc=sagi@grimberg.me \
--cc=yzhong@purestorage.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox