* [PATCH V2 1/2] block: move bio queue-transition flag fixups into blk_steal_bios()
2026-02-26 3:12 [PATCH v2 0/2] blk/nvme: fix NULL deref in rq_qos_done_bio() on multipath failover Chaitanya Kulkarni
@ 2026-02-26 3:12 ` Chaitanya Kulkarni
2026-02-26 15:32 ` Christoph Hellwig
2026-02-26 3:12 ` [PATCH V2 2/2] block: clear BIO_QOS flags in blk_steal_bios() Chaitanya Kulkarni
` (2 subsequent siblings)
3 siblings, 1 reply; 7+ messages in thread
From: Chaitanya Kulkarni @ 2026-02-26 3:12 UTC (permalink / raw)
To: kbusch, hch, sagi, wagi; +Cc: linux-block, linux-nvme, Chaitanya Kulkarni
blk_steal_bios() transfers bios from a request to a bio_list when the
request is requeued to a different queue. The NVMe multipath failover
path (nvme_failover_req) currently open-codes clearing of REQ_POLLED,
bi_cookie, and REQ_NOWAIT on each bio before calling blk_steal_bios().
Move these fixups into blk_steal_bios() itself so that any caller
automatically gets correct flag state when bios cross queue boundaries.
Simplify nvme_failover_req() accordingly.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
block/blk-mq.c | 17 +++++++++++++++++
drivers/nvme/host/multipath.c | 15 +--------------
2 files changed, 18 insertions(+), 14 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a29d8ac9d3e3..419b5c768af2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3412,6 +3412,23 @@ EXPORT_SYMBOL_GPL(blk_rq_prep_clone);
*/
void blk_steal_bios(struct bio_list *list, struct request *rq)
{
+ struct bio *bio;
+
+ for (bio = rq->bio; bio; bio = bio->bi_next) {
+ if (bio->bi_opf & REQ_POLLED) {
+ bio->bi_opf &= ~REQ_POLLED;
+ bio->bi_cookie = BLK_QC_T_NONE;
+ }
+ /*
+ * The alternate request queue that we may end up submitting
+ * the bio to may be frozen temporarily, in this case REQ_NOWAIT
+ * will fail the I/O immediately with EAGAIN to the issuer.
+ * We are not in the issuer context which cannot block. Clear
+ * the flag to avoid spurious EAGAIN I/O failures.
+ */
+ bio->bi_opf &= ~REQ_NOWAIT;
+ }
+
if (rq->bio) {
if (list->tail)
list->tail->bi_next = rq->bio;
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index bfcc5904e6a2..cda8a1e21f59 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -154,21 +154,8 @@ void nvme_failover_req(struct request *req)
}
spin_lock_irqsave(&ns->head->requeue_lock, flags);
- for (bio = req->bio; bio; bio = bio->bi_next) {
+ for (bio = req->bio; bio; bio = bio->bi_next)
bio_set_dev(bio, ns->head->disk->part0);
- if (bio->bi_opf & REQ_POLLED) {
- bio->bi_opf &= ~REQ_POLLED;
- bio->bi_cookie = BLK_QC_T_NONE;
- }
- /*
- * The alternate request queue that we may end up submitting
- * the bio to may be frozen temporarily, in this case REQ_NOWAIT
- * will fail the I/O immediately with EAGAIN to the issuer.
- * We are not in the issuer context which cannot block. Clear
- * the flag to avoid spurious EAGAIN I/O failures.
- */
- bio->bi_opf &= ~REQ_NOWAIT;
- }
blk_steal_bios(&ns->head->requeue_list, req);
spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
--
2.39.5
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH V2 1/2] block: move bio queue-transition flag fixups into blk_steal_bios()
2026-02-26 3:12 ` [PATCH V2 1/2] block: move bio queue-transition flag fixups into blk_steal_bios() Chaitanya Kulkarni
@ 2026-02-26 15:32 ` Christoph Hellwig
0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2026-02-26 15:32 UTC (permalink / raw)
To: Chaitanya Kulkarni; +Cc: kbusch, hch, sagi, wagi, linux-block, linux-nvme
On Wed, Feb 25, 2026 at 07:12:42PM -0800, Chaitanya Kulkarni wrote:
> blk_steal_bios() transfers bios from a request to a bio_list when the
> request is requeued to a different queue. The NVMe multipath failover
> path (nvme_failover_req) currently open-codes clearing of REQ_POLLED,
> bi_cookie, and REQ_NOWAIT on each bio before calling blk_steal_bios().
>
> Move these fixups into blk_steal_bios() itself so that any caller
> automatically gets correct flag state when bios cross queue boundaries.
> Simplify nvme_failover_req() accordingly.
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH V2 2/2] block: clear BIO_QOS flags in blk_steal_bios()
2026-02-26 3:12 [PATCH v2 0/2] blk/nvme: fix NULL deref in rq_qos_done_bio() on multipath failover Chaitanya Kulkarni
2026-02-26 3:12 ` [PATCH V2 1/2] block: move bio queue-transition flag fixups into blk_steal_bios() Chaitanya Kulkarni
@ 2026-02-26 3:12 ` Chaitanya Kulkarni
2026-02-26 15:33 ` Christoph Hellwig
2026-03-10 5:43 ` [PATCH v2 0/2] blk/nvme: fix NULL deref in rq_qos_done_bio() on multipath failover Chaitanya Kulkarni
2026-03-10 13:11 ` Jens Axboe
3 siblings, 1 reply; 7+ messages in thread
From: Chaitanya Kulkarni @ 2026-02-26 3:12 UTC (permalink / raw)
To: kbusch, hch, sagi, wagi; +Cc: linux-block, linux-nvme, Chaitanya Kulkarni
When a bio goes through the rq_qos infrastructure on a path's request
queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
flags indicate that rq_qos_done_bio() should be called on completion
to update rq_qos accounting.
During path failover in nvme_failover_req(), the bio's bi_bdev is
redirected from the failed path's disk to the multipath head's disk
via bio_set_dev(). However, the BIO_QOS flags are not cleared.
When the bio eventually completes (either successfully via a new path
or with an error via bio_io_error()), rq_qos_done_bio() checks for
these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
obtained from the bio's current bi_bdev - which is now the multipath
head's queue, not the original path's queue.
The multipath head's queue does not have rq_qos enabled (q->rq_qos is
NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
must be valid.
This breaks when a bio is moved between queues during NVMe multipath
failover, leading to a NULL pointer dereference.
Execution Context timeline :-
* =====> dd process context
[USER] dd process
[SYSCALL] write() - dd process context
submit_bio()
nvme_ns_head_submit_bio() - path selection
blk_mq_submit_bio() #### QOS FLAGS SET HERE
[USER] dd waits or returns
==== I/O in flight on NVMe hardware =====
===== End of submission path ====
------------------------------------------------------
* dd ====> Interrupt context;
[IRQ] NVMe completion interrupt
nvme_irq()
nvme_complete_rq()
nvme_failover_req() ### BIO MOVED TO HEAD
spin_lock_irqsave (atomic section)
bio_set_dev() changes bi_bdev
### BUG: QOS flags NOT cleared
kblockd_schedule_work()
* Interrupt context =====> kblockd workqueue
[WQ] kblockd workqueue - kworker process
nvme_requeue_work()
submit_bio_noacct()
nvme_ns_head_submit_bio()
nvme_find_path() returns NULL
bio_io_error()
bio_endio()
rq_qos_done_bio() ### CRASH ###
KERNEL PANIC / OOPS
Crash from blktests nvme/058 (rapid namespace remapping):
[ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1339.641025] nvme nvme4: rescanning namespaces.
[ 1339.642064] #PF: supervisor read access in kernel mode
[ 1339.642067] #PF: error_code(0x0000) - not-present page
[ 1339.642070] PGD 0 P4D 0
[ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
[ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
Tainted: G O N 6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
[ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
[ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
[ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
[ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
48 89 df ff d0 0f 1f
[ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
[ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
[ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
[ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
[ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
[ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
[ 1339.729029] FS: 0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
[ 1339.734525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
[ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
[ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 1339.748488] Call Trace:
[ 1339.749512] <TASK>
[ 1339.750449] bio_endio+0x71/0x2e0
[ 1339.751833] nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
[ 1339.754073] __submit_bio+0x222/0x5e0
[ 1339.755623] ? rcu_is_watching+0xd/0x40
[ 1339.757201] ? submit_bio_noacct_nocheck+0x131/0x370
[ 1339.759210] submit_bio_noacct_nocheck+0x131/0x370
[ 1339.761189] ? submit_bio_noacct+0x20/0x620
[ 1339.762849] nvme_requeue_work+0x4b/0x60 [nvme_core]
[ 1339.764828] process_one_work+0x20e/0x630
[ 1339.766528] worker_thread+0x184/0x330
[ 1339.768129] ? __pfx_worker_thread+0x10/0x10
[ 1339.769942] kthread+0x10a/0x250
[ 1339.771263] ? __pfx_kthread+0x10/0x10
[ 1339.772776] ? __pfx_kthread+0x10/0x10
[ 1339.774381] ret_from_fork+0x273/0x2e0
[ 1339.775948] ? __pfx_kthread+0x10/0x10
[ 1339.777504] ret_from_fork_asm+0x1a/0x30
[ 1339.779163] </TASK>
Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
when bios are redirected to the multipath head in nvme_failover_req().
This is consistent with the existing code that clears REQ_POLLED and
REQ_NOWAIT flags when the bio changes queues.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
---
block/blk-mq.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 419b5c768af2..fea1d46829d6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3427,6 +3427,8 @@ void blk_steal_bios(struct bio_list *list, struct request *rq)
* the flag to avoid spurious EAGAIN I/O failures.
*/
bio->bi_opf &= ~REQ_NOWAIT;
+ bio_clear_flag(bio, BIO_QOS_THROTTLED);
+ bio_clear_flag(bio, BIO_QOS_MERGED);
}
if (rq->bio) {
--
2.39.5
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v2 0/2] blk/nvme: fix NULL deref in rq_qos_done_bio() on multipath failover
2026-02-26 3:12 [PATCH v2 0/2] blk/nvme: fix NULL deref in rq_qos_done_bio() on multipath failover Chaitanya Kulkarni
2026-02-26 3:12 ` [PATCH V2 1/2] block: move bio queue-transition flag fixups into blk_steal_bios() Chaitanya Kulkarni
2026-02-26 3:12 ` [PATCH V2 2/2] block: clear BIO_QOS flags in blk_steal_bios() Chaitanya Kulkarni
@ 2026-03-10 5:43 ` Chaitanya Kulkarni
2026-03-10 13:11 ` Jens Axboe
3 siblings, 0 replies; 7+ messages in thread
From: Chaitanya Kulkarni @ 2026-03-10 5:43 UTC (permalink / raw)
To: Chaitanya Kulkarni, kbusch@kernel.org, hch@lst.de,
sagi@grimberg.me, wagi@monom.org, Jens Axboe
Cc: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
On 2/25/26 19:12, Chaitanya Kulkarni wrote:
> Hi,
>
> When a bio is processed on a path's request queue with rq_qos enabled,
> it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. During NVMe
> multipath failover, nvme_failover_req() redirects the bio's bi_bdev to
> the multipath head's disk via bio_set_dev(), but the BIO_QOS flags are
> left intact.
>
> This series moves bio queue transition code into blk_steal_bios()
> and adds a patch to clears BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
> in blk_steal_bios().
>
> -ck
>
> v1->v2 https://lore.kernel.org/all/20251124070142.GA17632@lst.de/:
>
> * Add a new patch to move the bio flag fixup loop from nvme_failover_req()
> into blk_steal_bios() rather than adding it only in the NVMe multipath
> path. (Christoph)
>
> Chaitanya Kulkarni (2):
> block: move bio queue-transition flag fixups into blk_steal_bios()
> block: clear BIO_QOS flags in blk_steal_bios()
>
> block/blk-mq.c | 19 +++++++++++++++++++
> drivers/nvme/host/multipath.c | 15 +--------------
> 2 files changed, 20 insertions(+), 14 deletions(-)
>
Gentle ping on this how should we merge nvme or block tree ?
-ck
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v2 0/2] blk/nvme: fix NULL deref in rq_qos_done_bio() on multipath failover
2026-02-26 3:12 [PATCH v2 0/2] blk/nvme: fix NULL deref in rq_qos_done_bio() on multipath failover Chaitanya Kulkarni
` (2 preceding siblings ...)
2026-03-10 5:43 ` [PATCH v2 0/2] blk/nvme: fix NULL deref in rq_qos_done_bio() on multipath failover Chaitanya Kulkarni
@ 2026-03-10 13:11 ` Jens Axboe
3 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2026-03-10 13:11 UTC (permalink / raw)
To: kbusch, hch, sagi, wagi, Chaitanya Kulkarni; +Cc: linux-block, linux-nvme
On Wed, 25 Feb 2026 19:12:41 -0800, Chaitanya Kulkarni wrote:
> When a bio is processed on a path's request queue with rq_qos enabled,
> it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. During NVMe
> multipath failover, nvme_failover_req() redirects the bio's bi_bdev to
> the multipath head's disk via bio_set_dev(), but the BIO_QOS flags are
> left intact.
>
> This series moves bio queue transition code into blk_steal_bios()
> and adds a patch to clears BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
> in blk_steal_bios().
>
> [...]
Applied, thanks!
[1/2] block: move bio queue-transition flag fixups into blk_steal_bios()
commit: b2c45ced591e6cf947560d2d290a51855926b774
[2/2] block: clear BIO_QOS flags in blk_steal_bios()
commit: daa6c79858e9ca75c548452bf71db8a9e61bde42
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 7+ messages in thread