Linux block layer
 help / color / mirror / Atom feed
* [PATCH] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
@ 2026-05-18 14:52 Achkinazi, Igor
  2026-05-18 19:23 ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: Achkinazi, Igor @ 2026-05-18 14:52 UTC (permalink / raw)
  To: kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, axboe@kernel.dk
  Cc: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	Achkinazi, Igor

When nvme_ns_head_submit_bio() remaps a block IO from the multipath
head to a per-path namespace (path of multipath), bio_set_dev() clears
BIO_REMAPPED.  Before commit a7c7f7b2b641 ("nvme: use bio_set_dev to
assign ->bi_bdev"), the code used a direct bio->bi_bdev assignment
which did not clear BIO_REMAPPED.  The block IO is then
queued on current->bio_list (deferred, not processed inline) and SRCU
read lock is released.

The deferred block IO itself is dispatched directly to
blk_mq_submit_bio() without re-entering submit_bio_noacct(), so it
would be fine on its own.  The problem is when the block IO size
exceeds queue limits and blk_mq_submit_bio() needs to split it using
__bio_split_to_limits().

The split remainder is resubmitted through submit_bio_noacct() which
calls bio_check_eod() again because BIO_REMAPPED is not set.  This
sometimes races with nvme_ns_remove() zeroing the capacity after
synchronize_srcu().  Result: bio_check_eod() sees zeroed capacity and
fails the IO with "attempt to access beyond end of device" instead of
letting it fail over to another path.

Observed failure scenario during tests:

  Setup: NVMe multipath with multiple paths (e.g., controllers nvme0,
  nvme1) to the same namespace, exposed as a single multipath block
  device (e.g., nvme0n1).

  Steps to reproduce:
    1. Run sustained IO against the multipath head device (e.g. vdbench)
    2. Delete the namespace on one of the paths (e.g., detach the
       namespace from NVMe controller on a subsystem on the
       target side).
    3. The IO that was remapped to the removed path and requires
       splitting (exceeds queue limits) hits the race.

  Expected behavior:
    The IO should fail on the removed path and nvme_failover_req()
    should retry it on the remaining healthy path.  IO continues
    without errors visible to the application.

  Actual behavior:
    The kernel reports IO errors to the application.  dmesg shows:

      IO_task /dev/di: attempt to access beyond end of device
      nvme1c9n1: rw=33556480, sector=476160, nr_sectors=256 limit=0

    IO errors were reported to the testing application, causing
    it to stop, despite a healthy path being available.  The IO is
    rejected by bio_check_eod() on the split remainder before it
    ever reaches the NVMe driver, so nvme_failover_req() never gets
    a chance to fail over to the other path.

We observed this failure during NVMe multipath failover testing at
Dell, for example, on kernel 5.14.0-570.23.1.el9_6.x86_64 (Red Hat
9.7), kernel 6.4.0-150600.23.53-default (SLES 15.6), and others.

The fix is setting BIO_REMAPPED after bio_set_dev() in
nvme_ns_head_submit_bio().  This skips bio_check_eod() on resubmission
for both the remapped IO and any split clones derived from it.  The
EOD check already passed on the multipath head.

This is safe because the individual path for nvme has bd_partno=0
(NVMe per-path namespace device is always a whole disk, not a
partition), so skipping blk_partition_remap() (also gated by
BIO_REMAPPED) has no effect: adjusting bio sector offsets from
partition-relative to whole-disk-relative is not necessary.  If the
per-path queue is dead, it fails via GD_DEAD check in
bio_queue_enter().  If the per-path queue is still alive, the
request completes with error Invalid-Namespace coming from nvme target
and nvme_failover_req() handles path failover.

Same approach as commit 3a905c37c351 ("block: skip bio_check_eod for
partition-remapped bios") which solved the same problem for partition
remaps resubmitted after bio splitting.

Fixes: a7c7f7b2b641 ("nvme: use bio_set_dev to assign ->bi_bdev")
Cc: stable@vger.kernel.org
Signed-off-by: Igor Achkinazi <igor.achkinazi@dell.com>
---
 drivers/nvme/host/multipath.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 263161cb8ac0..04f7c7e59945 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -511,6 +511,13 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
        ns = nvme_find_path(head);
        if (likely(ns)) {
                bio_set_dev(bio, ns->disk->part0);
+               /*
+                * Mark the bio as remapped to the per-path namespace disk so
+                * that bio_check_eod() is skipped on resubmission (e.g. from
+                * bio splitting in blk_mq_submit_bio).  The EOD check already
+                * passed on the multipath head disk.
+                */
+               bio_set_flag(bio, BIO_REMAPPED);
                bio->bi_opf |= REQ_NVME_MPATH;
                trace_block_bio_remap(bio, disk_devt(ns->head->disk),
                                      bio->bi_iter.bi_sector);
--
2.43.0


Internal Use - Confidential

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-05-20 20:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-18 14:52 [PATCH] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks Achkinazi, Igor
2026-05-18 19:23 ` Keith Busch
2026-05-18 20:59   ` Achkinazi, Igor
2026-05-18 21:52     ` Keith Busch
2026-05-20 20:27       ` Achkinazi, Igor
2026-05-19  6:53   ` hch
2026-05-19 19:10     ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox