Linux block layer
 help / color / mirror / Atom feed
* [PATCH] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
@ 2026-05-18 14:52 Achkinazi, Igor
  2026-05-18 19:23 ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: Achkinazi, Igor @ 2026-05-18 14:52 UTC (permalink / raw)
  To: kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, axboe@kernel.dk
  Cc: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org,
	Achkinazi, Igor

When nvme_ns_head_submit_bio() remaps a block IO from the multipath
head to a per-path namespace (path of multipath), bio_set_dev() clears
BIO_REMAPPED.  Before commit a7c7f7b2b641 ("nvme: use bio_set_dev to
assign ->bi_bdev"), the code used a direct bio->bi_bdev assignment
which did not clear BIO_REMAPPED.  The block IO is then
queued on current->bio_list (deferred, not processed inline) and SRCU
read lock is released.

The deferred block IO itself is dispatched directly to
blk_mq_submit_bio() without re-entering submit_bio_noacct(), so it
would be fine on its own.  The problem is when the block IO size
exceeds queue limits and blk_mq_submit_bio() needs to split it using
__bio_split_to_limits().

The split remainder is resubmitted through submit_bio_noacct() which
calls bio_check_eod() again because BIO_REMAPPED is not set.  This
sometimes races with nvme_ns_remove() zeroing the capacity after
synchronize_srcu().  Result: bio_check_eod() sees zeroed capacity and
fails the IO with "attempt to access beyond end of device" instead of
letting it fail over to another path.

Observed failure scenario during tests:

  Setup: NVMe multipath with multiple paths (e.g., controllers nvme0,
  nvme1) to the same namespace, exposed as a single multipath block
  device (e.g., nvme0n1).

  Steps to reproduce:
    1. Run sustained IO against the multipath head device (e.g. vdbench)
    2. Delete the namespace on one of the paths (e.g., detach the
       namespace from NVMe controller on a subsystem on the
       target side).
    3. The IO that was remapped to the removed path and requires
       splitting (exceeds queue limits) hits the race.

  Expected behavior:
    The IO should fail on the removed path and nvme_failover_req()
    should retry it on the remaining healthy path.  IO continues
    without errors visible to the application.

  Actual behavior:
    The kernel reports IO errors to the application.  dmesg shows:

      IO_task /dev/di: attempt to access beyond end of device
      nvme1c9n1: rw=33556480, sector=476160, nr_sectors=256 limit=0

    IO errors were reported to the testing application, causing
    it to stop, despite a healthy path being available.  The IO is
    rejected by bio_check_eod() on the split remainder before it
    ever reaches the NVMe driver, so nvme_failover_req() never gets
    a chance to fail over to the other path.

We observed this failure during NVMe multipath failover testing at
Dell, for example, on kernel 5.14.0-570.23.1.el9_6.x86_64 (Red Hat
9.7), kernel 6.4.0-150600.23.53-default (SLES 15.6), and others.

The fix is setting BIO_REMAPPED after bio_set_dev() in
nvme_ns_head_submit_bio().  This skips bio_check_eod() on resubmission
for both the remapped IO and any split clones derived from it.  The
EOD check already passed on the multipath head.

This is safe because the individual path for nvme has bd_partno=0
(NVMe per-path namespace device is always a whole disk, not a
partition), so skipping blk_partition_remap() (also gated by
BIO_REMAPPED) has no effect: adjusting bio sector offsets from
partition-relative to whole-disk-relative is not necessary.  If the
per-path queue is dead, it fails via GD_DEAD check in
bio_queue_enter().  If the per-path queue is still alive, the
request completes with error Invalid-Namespace coming from nvme target
and nvme_failover_req() handles path failover.

Same approach as commit 3a905c37c351 ("block: skip bio_check_eod for
partition-remapped bios") which solved the same problem for partition
remaps resubmitted after bio splitting.

Fixes: a7c7f7b2b641 ("nvme: use bio_set_dev to assign ->bi_bdev")
Cc: stable@vger.kernel.org
Signed-off-by: Igor Achkinazi <igor.achkinazi@dell.com>
---
 drivers/nvme/host/multipath.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 263161cb8ac0..04f7c7e59945 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -511,6 +511,13 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
        ns = nvme_find_path(head);
        if (likely(ns)) {
                bio_set_dev(bio, ns->disk->part0);
+               /*
+                * Mark the bio as remapped to the per-path namespace disk so
+                * that bio_check_eod() is skipped on resubmission (e.g. from
+                * bio splitting in blk_mq_submit_bio).  The EOD check already
+                * passed on the multipath head disk.
+                */
+               bio_set_flag(bio, BIO_REMAPPED);
                bio->bi_opf |= REQ_NVME_MPATH;
                trace_block_bio_remap(bio, disk_devt(ns->head->disk),
                                      bio->bi_iter.bi_sector);
--
2.43.0


Internal Use - Confidential

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
  2026-05-18 14:52 [PATCH] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks Achkinazi, Igor
@ 2026-05-18 19:23 ` Keith Busch
  2026-05-18 20:59   ` Achkinazi, Igor
  2026-05-19  6:53   ` hch
  0 siblings, 2 replies; 7+ messages in thread
From: Keith Busch @ 2026-05-18 19:23 UTC (permalink / raw)
  To: Achkinazi, Igor
  Cc: hch@lst.de, sagi@grimberg.me, axboe@kernel.dk,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On Mon, May 18, 2026 at 02:52:56PM +0000, Achkinazi, Igor wrote:
> @@ -511,6 +511,13 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
>         ns = nvme_find_path(head);
>         if (likely(ns)) {
>                 bio_set_dev(bio, ns->disk->part0);
> +               /*
> +                * Mark the bio as remapped to the per-path namespace disk so
> +                * that bio_check_eod() is skipped on resubmission (e.g. from
> +                * bio splitting in blk_mq_submit_bio).  The EOD check already
> +                * passed on the multipath head disk.
> +                */
> +               bio_set_flag(bio, BIO_REMAPPED);

Any reason nvme multipath can't call submit_bio_noacct_nocheck()
directly instead? If it's safe to skip the eod check here, then it
looks safe to skip everything else too.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
  2026-05-18 19:23 ` Keith Busch
@ 2026-05-18 20:59   ` Achkinazi, Igor
  2026-05-18 21:52     ` Keith Busch
  2026-05-19  6:53   ` hch
  1 sibling, 1 reply; 7+ messages in thread
From: Achkinazi, Igor @ 2026-05-18 20:59 UTC (permalink / raw)
  To: Keith Busch
  Cc: hch@lst.de, sagi@grimberg.me, axboe@kernel.dk,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

Keith Busch wrote :
> Any reason nvme multipath can't call submit_bio_noacct_nocheck()
> directly instead? If it's safe to skip the eod check here, then it
> looks safe to skip everything else too.

I'd prefer to keep this internal to nvme and use BIO_REMAPPED rather than
switching to submit_bio_noacct_nocheck:
- submit_bio_noacct_nocheck is block-internal and not exported, so using it
from NVMe would require a block-layer API change just for this.
- it bypasses more checks than I see we need here (throttling, RO, crypto,
op-type), I prefer bypassing only the EOD check.
- BIO_REMAPPED propagates to split clones, so it covers all resubmissions,
not just the initial one.


Internal Use - Confidential

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
  2026-05-18 20:59   ` Achkinazi, Igor
@ 2026-05-18 21:52     ` Keith Busch
  2026-05-20 20:27       ` Achkinazi, Igor
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2026-05-18 21:52 UTC (permalink / raw)
  To: Achkinazi, Igor
  Cc: hch@lst.de, sagi@grimberg.me, axboe@kernel.dk,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On Mon, May 18, 2026 at 08:59:09PM +0000, Achkinazi, Igor wrote:
> - submit_bio_noacct_nocheck is block-internal and not exported, so using it
> from NVMe would require a block-layer API change just for this.

That is not a valid reason to not do this. We often export and unexport
APIs as needs evolve. I'm not saying you have to use this API, but I'm
just not yet seeing why we shouldn't.

> - it bypasses more checks than I see we need here (throttling, RO, crypto,
> op-type), I prefer bypassing only the EOD check.

I do not think we need any of those on the hidden device either. The
stacked limits should have caught any problems or handled any policy on
the first pass.

Though I am aware of certain limit checks that go through here are not
done under the queue's entered reference count so there's some racy
things possible, but that's a pre-existing issue and doing
submit_bio_noacct a 2nd time doesn't help.

> - BIO_REMAPPED propagates to split clones, so it covers all resubmissions,
> not just the initial one.

Submitting split clones already uses submit_bio_noacct_nocheck() so the
BIO_REMAPPED flag doesn't come into play there either.

And BIO_REMAPPED applies to when the bio's bd_part is a partition. The
multipath layer overrides the block device with the part0 of the path,
so there is no partition (and we may not have been dealing with a
partition in first place), so this patch is introducing a new implicit
expectation on what this flag means. Consider a real failover going
through the requeue_list using the submit_bio_noacct() again. If you've
already set BIO_REMAPPED, then it skips the checks, but that could have
happened across a controller format change that modified all the limits,
so you probably want the eod checks to happen again on this scenario.
Your suggestion would skip it because you're using BIO_REMAPPED as a
proxy to skip "eod" checks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
  2026-05-18 19:23 ` Keith Busch
  2026-05-18 20:59   ` Achkinazi, Igor
@ 2026-05-19  6:53   ` hch
  2026-05-19 19:10     ` Keith Busch
  1 sibling, 1 reply; 7+ messages in thread
From: hch @ 2026-05-19  6:53 UTC (permalink / raw)
  To: Keith Busch
  Cc: Achkinazi, Igor, hch@lst.de, sagi@grimberg.me, axboe@kernel.dk,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On Mon, May 18, 2026 at 01:23:17PM -0600, Keith Busch wrote:
> On Mon, May 18, 2026 at 02:52:56PM +0000, Achkinazi, Igor wrote:
> > @@ -511,6 +511,13 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
> >         ns = nvme_find_path(head);
> >         if (likely(ns)) {
> >                 bio_set_dev(bio, ns->disk->part0);
> > +               /*
> > +                * Mark the bio as remapped to the per-path namespace disk so
> > +                * that bio_check_eod() is skipped on resubmission (e.g. from
> > +                * bio splitting in blk_mq_submit_bio).  The EOD check already
> > +                * passed on the multipath head disk.
> > +                */
> > +               bio_set_flag(bio, BIO_REMAPPED);
> 
> Any reason nvme multipath can't call submit_bio_noacct_nocheck()
> directly instead? If it's safe to skip the eod check here, then it
> looks safe to skip everything else too.

We really shouldn't expose that, and it doesn't quite to the
right checks here.

But I think the proper fix is to stop using bio_set_dev entirely.
We do not want to associate the bio with a new blkg as those are
basically not going to exist for the underlying devices.  I'm also
not sure we want to allow another round of BPF throttling either.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
  2026-05-19  6:53   ` hch
@ 2026-05-19 19:10     ` Keith Busch
  0 siblings, 0 replies; 7+ messages in thread
From: Keith Busch @ 2026-05-19 19:10 UTC (permalink / raw)
  To: hch@lst.de
  Cc: Achkinazi, Igor, sagi@grimberg.me, axboe@kernel.dk,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org

On Tue, May 19, 2026 at 08:53:03AM +0200, hch@lst.de wrote:
> On Mon, May 18, 2026 at 01:23:17PM -0600, Keith Busch wrote:
> > 
> > Any reason nvme multipath can't call submit_bio_noacct_nocheck()
> > directly instead? If it's safe to skip the eod check here, then it
> > looks safe to skip everything else too.
> 
> We really shouldn't expose that, and it doesn't quite to the
> right checks here.
> 
> But I think the proper fix is to stop using bio_set_dev entirely.
> We do not want to associate the bio with a new blkg as those are
> basically not going to exist for the underlying devices.  I'm also
> not sure we want to allow another round of BPF throttling either.

Simply dropping the bio_set_dev usage doesn't really fix the problem
here. Sure, bio_set_dev clears the BIO_REMAPPED flag, but it's not
guaranteed that flag was set in the first place.

I have an alternate proposal that's part of a larger series. This can
notify the driver of an early error so it can decide how to deal with
the bio:

  https://lore.kernel.org/linux-block/20260519172326.3462354-6-kbusch@meta.com/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
  2026-05-18 21:52     ` Keith Busch
@ 2026-05-20 20:27       ` Achkinazi, Igor
  0 siblings, 0 replies; 7+ messages in thread
From: Achkinazi, Igor @ 2026-05-20 20:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: hch@lst.de, sagi@grimberg.me, axboe@kernel.dk,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org, stable@vger.kernel.org


Internal Use - Confidential
On Mon, May 18, 2026 at 03:52:09PM -0600, Keith Busch wrote:
> On Mon, May 18, 2026 at 08:59:09PM +0000, Achkinazi, Igor wrote:
> > - submit_bio_noacct_nocheck is block-internal and not exported, so using it
> > from NVMe would require a block-layer API change just for this.
>
> That is not a valid reason to not do this. We often export and unexport
> APIs as needs evolve. I'm not saying you have to use this API, but I'm
> just not yet seeing why we shouldn't.

Fair point

> > - it bypasses more checks than I see we need here (throttling, RO, crypto,
> > op-type), I prefer bypassing only the EOD check.
>
> I do not think we need any of those on the hidden device either. The
> stacked limits should have caught any problems or handled any policy on
> the first pass.

Agreed, however I want to point out that EOD might be special since the
capacity can change asynchronously between passes because set_capacity(0)
in nvme_ns_remove() runs before synchronize_srcu(), not after.

> > - BIO_REMAPPED propagates to split clones, so it covers all resubmissions,
> > not just the initial one.
>
> Submitting split clones already uses submit_bio_noacct_nocheck() so the
> BIO_REMAPPED flag doesn't come into play there either.

You are right, on current mainline (since commit 0b64682e78f7 "block:
skip unnecessary checks for split bio") split remainders go through
submit_bio_noacct_nocheck().  The split path was the trigger on the
older production kernels where we observed the failures (5.14, 6.4),
where splits still went through submit_bio_noacct().

Looking at this more carefully, the race still exists on current mainline
through the initial submit_bio_noacct() call inside
nvme_ns_head_submit_bio() itself, not in splits:

  nvme_find_path(head)
              -- returns ns, NVME_NS_READY was set
              -- nvme_ns_remove() races in here:
              --  clear_bit(NVME_NS_READY)
              --  set_capacity(ns->disk, 0)
              --  synchronize_srcu() blocks (we hold it)
  bio_set_dev(bio, ns->disk->part0)
              -- clears BIO_REMAPPED
  submit_bio_noacct(bio)
              -- bio_check_eod() sees capacity=0, fails

The SRCU read lock prevents synchronize_srcu() from completing, but it
does not prevent set_capacity(0) from executing.  So the bio can fail
the EOD check on the per-path device while we're still inside the SRCU
read-side critical section.


> And BIO_REMAPPED applies to when the bio's bd_part is a partition. The
> multipath layer overrides the block device with the part0 of the path,
> so there is no partition (and we may not have been dealing with a
> partition in first place), so this patch is introducing a new implicit
> expectation on what this flag means.

I agree this sounds like overloading of the BIO_REMAPPED flag that was
introduced to skip partition remaps.  However skipping bio_check_eod
and blk_partition_remap is what is needed: the per-path device is always
a whole disk so skipping blk_partition_remap is a no-op, and skipping
bio_check_eod is intentional to avoid the false IO error.  This is the
same pattern as commit 3a905c37c351 ("block: skip bio_check_eod for
partition-remapped bios") which used BIO_REMAPPED to skip EOD on
resubmission after remapping.

I did not want to add a new block-layer flag for a single case that
needs the exact same behavior the existing flag provides.

> Consider a real failover going through the requeue_list using the
> submit_bio_noacct() again. If you've already set BIO_REMAPPED, then it
> skips the checks, but that could have happened across a controller
> format change that modified all the limits, so you probably want the eod
> checks to happen again on this scenario. Your suggestion would skip it
> because you're using BIO_REMAPPED as a proxy to skip "eod" checks.

I see that the failover path clears BIO_REMAPPED before the bio reaches
the requeue list.  nvme_failover_req() calls bio_set_dev() on each bio
to redirect it back to the multipath head:

  void nvme_failover_req(struct request *req)
  {
      ...
      spin_lock_irqsave(&ns->head->requeue_lock, flags);
      for (bio = req->bio; bio; bio = bio->bi_next)
          bio_set_dev(bio, ns->head->disk->part0);  /* clears BIO_REMAPPED */
      blk_steal_bios(&ns->head->requeue_list, req);
      spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
      ...
  }

Then nvme_requeue_work() calls submit_bio_noacct(bio) with BIO_REMAPPED
already cleared, so bio_check_eod() runs normally on the multipath head.
If a controller format changed the capacity in between, the EOD check on
the multipath head catches it.

I see from your "validate bios against queue limits" discussion that
Christoph Hellwig suggested eliminating the unconditional set_capacity(0)
entirely and you are exploring whether the GD_DEAD checks are sufficient
to replace it. That might be the root cause fix.

You also noted in the mentioned thread that BIO_REMAPPED does not cover
bio_queue_enter() -> GD_DEAD -> bio_io_error() path.  That is true, but
I think it is a separate race.  The error we saw in our tests is
specifically bio_check_eod():

    "attempt to access beyond end of device"
    "nvme1c9n1: rw=33556480, sector=476160, nr_sectors=256 limit=0"

BIO_REMAPPED addresses this reported failure.  If removing
set_capacity(0) from nvme_ns_remove() goes in as part of your RFC
series, it fixes both races and this patch is not needed.  Until then,
this patch provides a minimal fix for the bio_check_eod() case that is
backportable to stable kernels affected today.

I will send a v2 with an updated commit message that clarifies the
primary race and drops the arguments you rightly pointed out aren’t
valid.

Thanks,
Igor

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-05-20 20:27 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-18 14:52 [PATCH] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks Achkinazi, Igor
2026-05-18 19:23 ` Keith Busch
2026-05-18 20:59   ` Achkinazi, Igor
2026-05-18 21:52     ` Keith Busch
2026-05-20 20:27       ` Achkinazi, Igor
2026-05-19  6:53   ` hch
2026-05-19 19:10     ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox