* [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug
@ 2026-04-27 0:34 Chao Shi
2026-04-27 0:34 ` [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace Chao Shi
0 siblings, 1 reply; 2+ messages in thread
From: Chao Shi @ 2026-04-27 0:34 UTC (permalink / raw)
To: linux-nvme
Cc: linux-block, hch, kbusch, sagi, axboe, Chao Shi, Sungwoo Kim,
Dave Tian, Weidong Zhu
When an NVMe namespace is configured with embedded metadata (flbas bit 4
set, NVME_NS_FLBAS_META_EXT) but no Protection Information (dps=0) and
no NVME_NS_METADATA_SUPPORTED, nvme_setup_rw() fires WARN_ON_ONCE on
any request that reaches it with REQ_INTEGRITY unset. The WARN was
observed repeatedly during NVMe fuzz testing with a FEMU-based fuzzer
that performs semantic mutation of Identify Namespace responses.
The trigger requires three conditions to align: (a) a namespace
transitions through the EXT_LBAS non-PI state (head->ms != 0,
features & NVME_NS_EXT_LBAS, !(features & NVME_NS_METADATA_SUPPORTED)),
(b) nvme_init_integrity() returns false through the early-exit branch
at core.c:1834 without populating bi->metadata_size, leaving the disk
without an integrity profile (blk_get_integrity() returns NULL), and
(c) a request that was admitted to the block layer before the namespace
update reaches nvme_setup_rw() after it.
The admission gap arises in two places. First, the plug-list flush
path: a process with dirty pages queued in a plug before the namespace
update flushes them on file close (blk_finish_plug -> blk_mq_dispatch
-> nvme_setup_rw), bypassing any capacity-zero gate. Second, the
cached-rq path: blk_mq_submit_bio() at blk-mq.c:3155 may find a cached
request; if so, the bio_queue_enter() freeze-serialization guard at
blk-mq.c:3174-3176 is skipped and the bio is dispatched immediately.
In both cases the bio was submitted without REQ_INTEGRITY (because
blk_get_integrity() returned NULL at dispatch time, so
bio_integrity_action() returned 0 and bio_integrity_prep() was not
called), and it reaches nvme_setup_rw() for a namespace where
head->ms != 0. The existing BLK_STS_NOTSUPP return correctly handles
this dispatch; the WARN_ON_ONCE is a false positive.
The WARN was reproduced six times over four days of fuzzing (April
2026). A representative crash shows the plug-flush path:
nvme0n1: detected capacity change from 2097152 to 0
WARNING: drivers/nvme/host/core.c:1042 at nvme_setup_rw+0x768/0xfd0
PID: 785 (systemd-udevd)
Call Trace:
nvme_setup_cmd / nvme_queue_rq / blk_mq_dispatch_rq_list
blk_mq_flush_plug_list / blk_finish_plug / blkdev_writepages
sync_blockdev / bdev_release / __fput / sys_close
Replace WARN_ON_ONCE with pr_debug_ratelimited so the condition is
logged at debug level without splat. The BLK_STS_NOTSUPP return is
preserved; I/O to the transitioning namespace is still rejected.
An alternative approach that addresses the root cause at the
integrity-profile level is proposed in patch 2/2: populate
bi->metadata_size for EXT_LBAS non-PI namespaces in nvme_init_integrity()
so that bio_integrity_action() returns non-zero, bio_integrity_prep()
sets REQ_INTEGRITY, and nvme_setup_rw() never reaches this branch.
Both patches are sent as RFC for maintainer guidance on the preferred
direction.
Tested: Compiled on linux-kcov-debug (6.19.0+, KASAN/DEBUG_LIST).
Boot-tested under FEMU with NVME_MALICIOUS_RESPONDER=1
NVME_SEMANTIC_DATA_MUTATOR=1; ran 4 concurrent dd processes plus 500
rescan_controller cycles. No WARN, BUG, or Oops observed.
Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).
Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---
drivers/nvme/host/core.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d1711ef59fb..4e20c8f08e4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1039,8 +1039,12 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
* namespace capacity to zero to prevent any I/O.
*/
if (!blk_integrity_rq(req)) {
- if (WARN_ON_ONCE(!nvme_ns_has_pi(ns->head)))
+ if (!nvme_ns_has_pi(ns->head)) {
+ pr_debug_ratelimited("nvme: %s: metadata (ms=%u) without PI or integrity request, returning NOTSUPP\n",
+ ns->disk->disk_name,
+ ns->head->ms);
return BLK_STS_NOTSUPP;
+ }
control |= NVME_RW_PRINFO_PRACT;
nvme_set_ref_tag(ns, cmnd, req);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 2+ messages in thread
* [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace
2026-04-27 0:34 [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Chao Shi
@ 2026-04-27 0:34 ` Chao Shi
0 siblings, 0 replies; 2+ messages in thread
From: Chao Shi @ 2026-04-27 0:34 UTC (permalink / raw)
To: linux-nvme
Cc: linux-block, hch, kbusch, sagi, axboe, Chao Shi, Sungwoo Kim,
Dave Tian, Weidong Zhu
This patch is an alternative to patch 1/2: instead of downgrading the
assertion in nvme_setup_rw(), it addresses the root cause at the
integrity-profile level so that the assertion is never reached.
For PCIe namespaces with extended LBAs (NVME_NS_EXT_LBAS set, flbas
bit 4) but without PI and without NVME_NS_METADATA_SUPPORTED, the early-
exit branch of nvme_init_integrity() at core.c:1834 returns false
without populating bi->metadata_size. As a result blk_get_integrity()
returns NULL (it checks q->limits.integrity.metadata_size via
blk_integrity_queue_supports_integrity()), bio_integrity_action() returns
0, bio_integrity_prep() is never called, and REQ_INTEGRITY is never set
on bios dispatched to the namespace. Any such bio that reaches
nvme_setup_rw() triggers WARN_ON_ONCE because head->ms != 0 but
blk_integrity_rq() returns false.
Populate bi->metadata_size = head->ms in the early-exit path for the
EXT_LBAS non-PI case. This is sufficient to make blk_get_integrity()
return non-NULL, which causes bio_integrity_action() to return non-zero,
which causes bio_integrity_prep() to run and set REQ_INTEGRITY on any
bio submitted to the namespace. Requests that reach nvme_setup_rw()
then satisfy blk_integrity_rq() and the assertion is not reached.
blk_validate_integrity_limits() accepts this configuration: with
csum_type=BLK_INTEGRITY_CSUM_NONE, pi_tuple_size=0, and pi_offset=0,
all checks pass (pi_offset + pi_tuple_size <= metadata_size, pi_tuple_size
must be 0 for CSUM_NONE), and interval_exp is auto-filled to
ilog2(logical_block_size). No generate/verify callbacks are configured,
so no actual integrity computation occurs; only the blk_integrity_rq()
predicate is satisfied. Capacity is still forced to 0 by
set_capacity_and_notify(), so new bios are rejected by bio_check_eod()
before queue entry.
Tested: Compiled on linux-kcov-debug (6.19.0+, KASAN/DEBUG_LIST).
Boot-tested under FEMU with NVME_SEMANTIC_DATA_MUTATOR=1; ran 4
concurrent dd processes plus 500 rescan_controller cycles with no WARN,
BUG, or Oops. The EXT_LBAS + ms!=0 + !PI combination was not triggered
during testing (FEMU's mutator varies flbas and lbaf[0].ms independently;
flbas=0x10 with lbaf_idx=0 was not produced in this run). The
bi->metadata_size assignment path was not exercised in testing;
correctness of blk_validate_integrity_limits() for this configuration
was verified by code inspection. Provided as RFC.
Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).
Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---
drivers/nvme/host/core.c | 25 +++++++++++++++++++++++--
1 file changed, 23 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4e20c8f08e4..76fb788024f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1836,8 +1836,29 @@ static bool nvme_init_integrity(struct nvme_ns_head *head,
* insert/strip it, which is not possible for other kinds of metadata.
*/
if (!IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) ||
- !(head->features & NVME_NS_METADATA_SUPPORTED))
- return nvme_ns_has_pi(head);
+ !(head->features & NVME_NS_METADATA_SUPPORTED)) {
+ bool has_pi = nvme_ns_has_pi(head);
+
+ /*
+ * For PCIe EXT_LBAS non-PI namespaces the block layer sets
+ * capacity to 0 (we return false) to prevent block I/O, but a
+ * cached-rq bio may bypass bio_queue_enter freeze serialisation
+ * and reach nvme_setup_rw() with head->ms != 0 and no
+ * REQ_INTEGRITY set. Populate bi->metadata_size so that
+ * bio_integrity_action() returns non-zero and bio_integrity_prep()
+ * sets REQ_INTEGRITY on any such bio, preventing the WARN_ON_ONCE
+ * at nvme_setup_rw() (addressed by patch 1/2).
+ *
+ * NOTE: only metadata_size is populated; no csum or PI profile is
+ * configured. Actual data integrity for EXT_LBAS non-PI workloads
+ * is untested; this patch is RFC for direction discussion.
+ */
+ if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) &&
+ (head->features & NVME_NS_EXT_LBAS) &&
+ head->ms && !has_pi)
+ bi->metadata_size = head->ms;
+ return has_pi;
+ }
switch (head->pi_type) {
case NVME_NS_DPS_PI_TYPE3:
--
2.43.0
^ permalink raw reply related [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-04-27 0:35 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 0:34 [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Chao Shi
2026-04-27 0:34 ` [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace Chao Shi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox