Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug
@ 2026-04-27  0:34 Chao Shi
  2026-04-27  0:34 ` [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace Chao Shi
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Chao Shi @ 2026-04-27  0:34 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Chao Shi, Sungwoo Kim,
	Dave Tian, Weidong Zhu

When an NVMe namespace is configured with embedded metadata (flbas bit 4
set, NVME_NS_FLBAS_META_EXT) but no Protection Information (dps=0) and
no NVME_NS_METADATA_SUPPORTED, nvme_setup_rw() fires WARN_ON_ONCE on
any request that reaches it with REQ_INTEGRITY unset.  The WARN was
observed repeatedly during NVMe fuzz testing with a FEMU-based fuzzer
that performs semantic mutation of Identify Namespace responses.

The trigger requires three conditions to align: (a) a namespace
transitions through the EXT_LBAS non-PI state (head->ms != 0,
features & NVME_NS_EXT_LBAS, !(features & NVME_NS_METADATA_SUPPORTED)),
(b) nvme_init_integrity() returns false through the early-exit branch
at core.c:1834 without populating bi->metadata_size, leaving the disk
without an integrity profile (blk_get_integrity() returns NULL), and
(c) a request that was admitted to the block layer before the namespace
update reaches nvme_setup_rw() after it.

The admission gap arises in two places.  First, the plug-list flush
path: a process with dirty pages queued in a plug before the namespace
update flushes them on file close (blk_finish_plug -> blk_mq_dispatch
-> nvme_setup_rw), bypassing any capacity-zero gate.  Second, the
cached-rq path: blk_mq_submit_bio() at blk-mq.c:3155 may find a cached
request; if so, the bio_queue_enter() freeze-serialization guard at
blk-mq.c:3174-3176 is skipped and the bio is dispatched immediately.

In both cases the bio was submitted without REQ_INTEGRITY (because
blk_get_integrity() returned NULL at dispatch time, so
bio_integrity_action() returned 0 and bio_integrity_prep() was not
called), and it reaches nvme_setup_rw() for a namespace where
head->ms != 0.  The existing BLK_STS_NOTSUPP return correctly handles
this dispatch; the WARN_ON_ONCE is a false positive.

The WARN was reproduced six times over four days of fuzzing (April
2026).  A representative crash shows the plug-flush path:

  nvme0n1: detected capacity change from 2097152 to 0
  WARNING: drivers/nvme/host/core.c:1042 at nvme_setup_rw+0x768/0xfd0
  PID: 785 (systemd-udevd)
  Call Trace:
   nvme_setup_cmd / nvme_queue_rq / blk_mq_dispatch_rq_list
   blk_mq_flush_plug_list / blk_finish_plug / blkdev_writepages
   sync_blockdev / bdev_release / __fput / sys_close

Replace WARN_ON_ONCE with pr_debug_ratelimited so the condition is
logged at debug level without splat.  The BLK_STS_NOTSUPP return is
preserved; I/O to the transitioning namespace is still rejected.

An alternative approach that addresses the root cause at the
integrity-profile level is proposed in patch 2/2: populate
bi->metadata_size for EXT_LBAS non-PI namespaces in nvme_init_integrity()
so that bio_integrity_action() returns non-zero, bio_integrity_prep()
sets REQ_INTEGRITY, and nvme_setup_rw() never reaches this branch.
Both patches are sent as RFC for maintainer guidance on the preferred
direction.

Tested: Compiled on linux-kcov-debug (6.19.0+, KASAN/DEBUG_LIST).
Boot-tested under FEMU with NVME_MALICIOUS_RESPONDER=1
NVME_SEMANTIC_DATA_MUTATOR=1; ran 4 concurrent dd processes plus 500
rescan_controller cycles.  No WARN, BUG, or Oops observed.

Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).

Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---
 drivers/nvme/host/core.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d1711ef59fb..4e20c8f08e4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1039,8 +1039,12 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 		 * namespace capacity to zero to prevent any I/O.
 		 */
 		if (!blk_integrity_rq(req)) {
-			if (WARN_ON_ONCE(!nvme_ns_has_pi(ns->head)))
+			if (!nvme_ns_has_pi(ns->head)) {
+				pr_debug_ratelimited("nvme: %s: metadata (ms=%u) without PI or integrity request, returning NOTSUPP\n",
+						     ns->disk->disk_name,
+						     ns->head->ms);
 				return BLK_STS_NOTSUPP;
+			}
 			control |= NVME_RW_PRINFO_PRACT;
 			nvme_set_ref_tag(ns, cmnd, req);
 		}
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace
  2026-04-27  0:34 [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Chao Shi
@ 2026-04-27  0:34 ` Chao Shi
  2026-05-07  5:49   ` Christoph Hellwig
  2026-05-07  5:48 ` [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Christoph Hellwig
  2026-05-07 18:12 ` Keith Busch
  2 siblings, 1 reply; 6+ messages in thread
From: Chao Shi @ 2026-04-27  0:34 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-block, hch, kbusch, sagi, axboe, Chao Shi, Sungwoo Kim,
	Dave Tian, Weidong Zhu

This patch is an alternative to patch 1/2: instead of downgrading the
assertion in nvme_setup_rw(), it addresses the root cause at the
integrity-profile level so that the assertion is never reached.

For PCIe namespaces with extended LBAs (NVME_NS_EXT_LBAS set, flbas
bit 4) but without PI and without NVME_NS_METADATA_SUPPORTED, the early-
exit branch of nvme_init_integrity() at core.c:1834 returns false
without populating bi->metadata_size.  As a result blk_get_integrity()
returns NULL (it checks q->limits.integrity.metadata_size via
blk_integrity_queue_supports_integrity()), bio_integrity_action() returns
0, bio_integrity_prep() is never called, and REQ_INTEGRITY is never set
on bios dispatched to the namespace.  Any such bio that reaches
nvme_setup_rw() triggers WARN_ON_ONCE because head->ms != 0 but
blk_integrity_rq() returns false.

Populate bi->metadata_size = head->ms in the early-exit path for the
EXT_LBAS non-PI case.  This is sufficient to make blk_get_integrity()
return non-NULL, which causes bio_integrity_action() to return non-zero,
which causes bio_integrity_prep() to run and set REQ_INTEGRITY on any
bio submitted to the namespace.  Requests that reach nvme_setup_rw()
then satisfy blk_integrity_rq() and the assertion is not reached.

blk_validate_integrity_limits() accepts this configuration: with
csum_type=BLK_INTEGRITY_CSUM_NONE, pi_tuple_size=0, and pi_offset=0,
all checks pass (pi_offset + pi_tuple_size <= metadata_size, pi_tuple_size
must be 0 for CSUM_NONE), and interval_exp is auto-filled to
ilog2(logical_block_size).  No generate/verify callbacks are configured,
so no actual integrity computation occurs; only the blk_integrity_rq()
predicate is satisfied.  Capacity is still forced to 0 by
set_capacity_and_notify(), so new bios are rejected by bio_check_eod()
before queue entry.

Tested: Compiled on linux-kcov-debug (6.19.0+, KASAN/DEBUG_LIST).
Boot-tested under FEMU with NVME_SEMANTIC_DATA_MUTATOR=1; ran 4
concurrent dd processes plus 500 rescan_controller cycles with no WARN,
BUG, or Oops.  The EXT_LBAS + ms!=0 + !PI combination was not triggered
during testing (FEMU's mutator varies flbas and lbaf[0].ms independently;
flbas=0x10 with lbaf_idx=0 was not produced in this run).  The
bi->metadata_size assignment path was not exercised in testing;
correctness of blk_validate_integrity_limits() for this configuration
was verified by code inspection.  Provided as RFC.

Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).

Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---
 drivers/nvme/host/core.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4e20c8f08e4..76fb788024f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1836,8 +1836,29 @@ static bool nvme_init_integrity(struct nvme_ns_head *head,
 	 * insert/strip it, which is not possible for other kinds of metadata.
 	 */
 	if (!IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) ||
-	    !(head->features & NVME_NS_METADATA_SUPPORTED))
-		return nvme_ns_has_pi(head);
+	    !(head->features & NVME_NS_METADATA_SUPPORTED)) {
+		bool has_pi = nvme_ns_has_pi(head);
+
+		/*
+		 * For PCIe EXT_LBAS non-PI namespaces the block layer sets
+		 * capacity to 0 (we return false) to prevent block I/O, but a
+		 * cached-rq bio may bypass bio_queue_enter freeze serialisation
+		 * and reach nvme_setup_rw() with head->ms != 0 and no
+		 * REQ_INTEGRITY set.  Populate bi->metadata_size so that
+		 * bio_integrity_action() returns non-zero and bio_integrity_prep()
+		 * sets REQ_INTEGRITY on any such bio, preventing the WARN_ON_ONCE
+		 * at nvme_setup_rw() (addressed by patch 1/2).
+		 *
+		 * NOTE: only metadata_size is populated; no csum or PI profile is
+		 * configured.  Actual data integrity for EXT_LBAS non-PI workloads
+		 * is untested; this patch is RFC for direction discussion.
+		 */
+		if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) &&
+		    (head->features & NVME_NS_EXT_LBAS) &&
+		    head->ms && !has_pi)
+			bi->metadata_size = head->ms;
+		return has_pi;
+	}
 
 	switch (head->pi_type) {
 	case NVME_NS_DPS_PI_TYPE3:
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug
  2026-04-27  0:34 [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Chao Shi
  2026-04-27  0:34 ` [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace Chao Shi
@ 2026-05-07  5:48 ` Christoph Hellwig
  2026-05-07 18:12 ` Keith Busch
  2 siblings, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2026-05-07  5:48 UTC (permalink / raw)
  To: Chao Shi
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, Sungwoo Kim,
	Dave Tian, Weidong Zhu

On Sun, Apr 26, 2026 at 08:34:56PM -0400, Chao Shi wrote:
> When an NVMe namespace is configured with embedded metadata (flbas bit 4
> set, NVME_NS_FLBAS_META_EXT) but no Protection Information (dps=0) and
> no NVME_NS_METADATA_SUPPORTED, nvme_setup_rw() fires WARN_ON_ONCE on
> any request that reaches it with REQ_INTEGRITY unset.  The WARN was
> observed repeatedly during NVMe fuzz testing with a FEMU-based fuzzer
> that performs semantic mutation of Identify Namespace responses.

What is "semantic mutation of Identify Namespace responses" supposed to
mean?

> In both cases the bio was submitted without REQ_INTEGRITY (because
> blk_get_integrity() returned NULL at dispatch time, so
> bio_integrity_action() returned 0 and bio_integrity_prep() was not
> called), and it reaches nvme_setup_rw() for a namespace where
> head->ms != 0.  The existing BLK_STS_NOTSUPP return correctly handles
> this dispatch; the WARN_ON_ONCE is a false positive.

That means we fail to properaly freeze and quiesce the queue over
updateѕm which has much worse results than just a WARN_ON.  So if we
care about this rather theoretical case we'll need to fix that.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace
  2026-04-27  0:34 ` [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace Chao Shi
@ 2026-05-07  5:49   ` Christoph Hellwig
  2026-05-07  8:05     ` Keith Busch
  0 siblings, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2026-05-07  5:49 UTC (permalink / raw)
  To: Chao Shi
  Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, Sungwoo Kim,
	Dave Tian, Weidong Zhu

On Sun, Apr 26, 2026 at 08:34:57PM -0400, Chao Shi wrote:
> +		/*
> +		 * For PCIe EXT_LBAS non-PI namespaces the block layer sets
> +		 * capacity to 0 (we return false) to prevent block I/O, but a
> +		 * cached-rq bio may bypass bio_queue_enter freeze serialisation
> +		 * and reach nvme_setup_rw() with head->ms != 0 and no
> +		 * REQ_INTEGRITY set.  Populate bi->metadata_size so that
> +		 * bio_integrity_action() returns non-zero and bio_integrity_prep()
> +		 * sets REQ_INTEGRITY on any such bio, preventing the WARN_ON_ONCE
> +		 * at nvme_setup_rw() (addressed by patch 1/2).

This sounds like the Bug Keith is trying to fix in the block layer
("blk-mq: check for stale cached request in blk_mq_submit_bio") ?



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace
  2026-05-07  5:49   ` Christoph Hellwig
@ 2026-05-07  8:05     ` Keith Busch
  0 siblings, 0 replies; 6+ messages in thread
From: Keith Busch @ 2026-05-07  8:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chao Shi, linux-nvme, linux-block, sagi, axboe, Sungwoo Kim,
	Dave Tian, Weidong Zhu

On Thu, May 07, 2026 at 07:49:44AM +0200, Christoph Hellwig wrote:
> On Sun, Apr 26, 2026 at 08:34:57PM -0400, Chao Shi wrote:
> > +		/*
> > +		 * For PCIe EXT_LBAS non-PI namespaces the block layer sets
> > +		 * capacity to 0 (we return false) to prevent block I/O, but a
> > +		 * cached-rq bio may bypass bio_queue_enter freeze serialisation
> > +		 * and reach nvme_setup_rw() with head->ms != 0 and no
> > +		 * REQ_INTEGRITY set.  Populate bi->metadata_size so that
> > +		 * bio_integrity_action() returns non-zero and bio_integrity_prep()
> > +		 * sets REQ_INTEGRITY on any such bio, preventing the WARN_ON_ONCE
> > +		 * at nvme_setup_rw() (addressed by patch 1/2).
> 
> This sounds like the Bug Keith is trying to fix in the block layer
> ("blk-mq: check for stale cached request in blk_mq_submit_bio") ?

Both issues are dealing with cached request corner cases, but they're
not really related. My bug fix is specifically when those requests are
freed, while this one is just racing with them.

For this patch's issue, how is the host being triggered to rescan? If
you're sending a "Format NVM" command, the driver would have frozen the
queues first, which would have waited for any cached requests to flush
out and prevent new ones from being allocated until after the format has
been updated in the driver.

It's possible to format the namespace from a different controller the
host doesn't know about, so we've always had a race where the actual
format is different than what the host knows about. The rescan would
have to be triggered some other way in that case (either through AEN or
manual sysfs/ioctl trigger).

We always ensure the queue is frozen when we update the queue limits in
this path too, so the driver and block layer should always be in sync
even if it's not in sync with the device. That in itself can be pretty
nasty, but we'd need a new NVMe TP to define a way to fix that. But
specifically on the driver warning here, I'm not sure how it is getting
triggered due to the existing freeze semantics around the queue limits
update.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug
  2026-04-27  0:34 [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Chao Shi
  2026-04-27  0:34 ` [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace Chao Shi
  2026-05-07  5:48 ` [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Christoph Hellwig
@ 2026-05-07 18:12 ` Keith Busch
  2 siblings, 0 replies; 6+ messages in thread
From: Keith Busch @ 2026-05-07 18:12 UTC (permalink / raw)
  To: Chao Shi
  Cc: linux-nvme, linux-block, hch, sagi, axboe, Sungwoo Kim, Dave Tian,
	Weidong Zhu

On Sun, Apr 26, 2026 at 08:34:56PM -0400, Chao Shi wrote:
> In both cases the bio was submitted without REQ_INTEGRITY (because
> blk_get_integrity() returned NULL at dispatch time, so
> bio_integrity_action() returned 0 and bio_integrity_prep() was not
> called), and it reaches nvme_setup_rw() for a namespace where
> head->ms != 0.  The existing BLK_STS_NOTSUPP return correctly handles
> this dispatch; the WARN_ON_ONCE is a false positive.

This is what I'm not really following. The cached request holds a
reference on the queue that prevents the queue freeze from proceeding.
This driver freezes the queue along with the queue limits update. As I
mentioned in the other patch, that was supposed to ensure the block
layer had the updated limits before it could allocate a request, so I
think we need to understand how that was defeated to get to a real
solution.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-05-07 18:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27  0:34 [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Chao Shi
2026-04-27  0:34 ` [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace Chao Shi
2026-05-07  5:49   ` Christoph Hellwig
2026-05-07  8:05     ` Keith Busch
2026-05-07  5:48 ` [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Christoph Hellwig
2026-05-07 18:12 ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox