* [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug
@ 2026-04-27 0:34 Chao Shi
2026-04-27 0:34 ` [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace Chao Shi
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Chao Shi @ 2026-04-27 0:34 UTC (permalink / raw)
To: linux-nvme
Cc: linux-block, hch, kbusch, sagi, axboe, Chao Shi, Sungwoo Kim,
Dave Tian, Weidong Zhu
When an NVMe namespace is configured with embedded metadata (flbas bit 4
set, NVME_NS_FLBAS_META_EXT) but no Protection Information (dps=0) and
no NVME_NS_METADATA_SUPPORTED, nvme_setup_rw() fires WARN_ON_ONCE on
any request that reaches it with REQ_INTEGRITY unset. The WARN was
observed repeatedly during NVMe fuzz testing with a FEMU-based fuzzer
that performs semantic mutation of Identify Namespace responses.
The trigger requires three conditions to align: (a) a namespace
transitions through the EXT_LBAS non-PI state (head->ms != 0,
features & NVME_NS_EXT_LBAS, !(features & NVME_NS_METADATA_SUPPORTED)),
(b) nvme_init_integrity() returns false through the early-exit branch
at core.c:1834 without populating bi->metadata_size, leaving the disk
without an integrity profile (blk_get_integrity() returns NULL), and
(c) a request that was admitted to the block layer before the namespace
update reaches nvme_setup_rw() after it.
The admission gap arises in two places. First, the plug-list flush
path: a process with dirty pages queued in a plug before the namespace
update flushes them on file close (blk_finish_plug -> blk_mq_dispatch
-> nvme_setup_rw), bypassing any capacity-zero gate. Second, the
cached-rq path: blk_mq_submit_bio() at blk-mq.c:3155 may find a cached
request; if so, the bio_queue_enter() freeze-serialization guard at
blk-mq.c:3174-3176 is skipped and the bio is dispatched immediately.
In both cases the bio was submitted without REQ_INTEGRITY (because
blk_get_integrity() returned NULL at dispatch time, so
bio_integrity_action() returned 0 and bio_integrity_prep() was not
called), and it reaches nvme_setup_rw() for a namespace where
head->ms != 0. The existing BLK_STS_NOTSUPP return correctly handles
this dispatch; the WARN_ON_ONCE is a false positive.
The WARN was reproduced six times over four days of fuzzing (April
2026). A representative crash shows the plug-flush path:
nvme0n1: detected capacity change from 2097152 to 0
WARNING: drivers/nvme/host/core.c:1042 at nvme_setup_rw+0x768/0xfd0
PID: 785 (systemd-udevd)
Call Trace:
nvme_setup_cmd / nvme_queue_rq / blk_mq_dispatch_rq_list
blk_mq_flush_plug_list / blk_finish_plug / blkdev_writepages
sync_blockdev / bdev_release / __fput / sys_close
Replace WARN_ON_ONCE with pr_debug_ratelimited so the condition is
logged at debug level without splat. The BLK_STS_NOTSUPP return is
preserved; I/O to the transitioning namespace is still rejected.
An alternative approach that addresses the root cause at the
integrity-profile level is proposed in patch 2/2: populate
bi->metadata_size for EXT_LBAS non-PI namespaces in nvme_init_integrity()
so that bio_integrity_action() returns non-zero, bio_integrity_prep()
sets REQ_INTEGRITY, and nvme_setup_rw() never reaches this branch.
Both patches are sent as RFC for maintainer guidance on the preferred
direction.
Tested: Compiled on linux-kcov-debug (6.19.0+, KASAN/DEBUG_LIST).
Boot-tested under FEMU with NVME_MALICIOUS_RESPONDER=1
NVME_SEMANTIC_DATA_MUTATOR=1; ran 4 concurrent dd processes plus 500
rescan_controller cycles. No WARN, BUG, or Oops observed.
Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).
Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---
drivers/nvme/host/core.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d1711ef59fb..4e20c8f08e4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1039,8 +1039,12 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
* namespace capacity to zero to prevent any I/O.
*/
if (!blk_integrity_rq(req)) {
- if (WARN_ON_ONCE(!nvme_ns_has_pi(ns->head)))
+ if (!nvme_ns_has_pi(ns->head)) {
+ pr_debug_ratelimited("nvme: %s: metadata (ms=%u) without PI or integrity request, returning NOTSUPP\n",
+ ns->disk->disk_name,
+ ns->head->ms);
return BLK_STS_NOTSUPP;
+ }
control |= NVME_RW_PRINFO_PRACT;
nvme_set_ref_tag(ns, cmnd, req);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace
2026-04-27 0:34 [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Chao Shi
@ 2026-04-27 0:34 ` Chao Shi
2026-05-07 5:49 ` Christoph Hellwig
2026-05-07 5:48 ` [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Christoph Hellwig
2026-05-07 18:12 ` Keith Busch
2 siblings, 1 reply; 12+ messages in thread
From: Chao Shi @ 2026-04-27 0:34 UTC (permalink / raw)
To: linux-nvme
Cc: linux-block, hch, kbusch, sagi, axboe, Chao Shi, Sungwoo Kim,
Dave Tian, Weidong Zhu
This patch is an alternative to patch 1/2: instead of downgrading the
assertion in nvme_setup_rw(), it addresses the root cause at the
integrity-profile level so that the assertion is never reached.
For PCIe namespaces with extended LBAs (NVME_NS_EXT_LBAS set, flbas
bit 4) but without PI and without NVME_NS_METADATA_SUPPORTED, the early-
exit branch of nvme_init_integrity() at core.c:1834 returns false
without populating bi->metadata_size. As a result blk_get_integrity()
returns NULL (it checks q->limits.integrity.metadata_size via
blk_integrity_queue_supports_integrity()), bio_integrity_action() returns
0, bio_integrity_prep() is never called, and REQ_INTEGRITY is never set
on bios dispatched to the namespace. Any such bio that reaches
nvme_setup_rw() triggers WARN_ON_ONCE because head->ms != 0 but
blk_integrity_rq() returns false.
Populate bi->metadata_size = head->ms in the early-exit path for the
EXT_LBAS non-PI case. This is sufficient to make blk_get_integrity()
return non-NULL, which causes bio_integrity_action() to return non-zero,
which causes bio_integrity_prep() to run and set REQ_INTEGRITY on any
bio submitted to the namespace. Requests that reach nvme_setup_rw()
then satisfy blk_integrity_rq() and the assertion is not reached.
blk_validate_integrity_limits() accepts this configuration: with
csum_type=BLK_INTEGRITY_CSUM_NONE, pi_tuple_size=0, and pi_offset=0,
all checks pass (pi_offset + pi_tuple_size <= metadata_size, pi_tuple_size
must be 0 for CSUM_NONE), and interval_exp is auto-filled to
ilog2(logical_block_size). No generate/verify callbacks are configured,
so no actual integrity computation occurs; only the blk_integrity_rq()
predicate is satisfied. Capacity is still forced to 0 by
set_capacity_and_notify(), so new bios are rejected by bio_check_eod()
before queue entry.
Tested: Compiled on linux-kcov-debug (6.19.0+, KASAN/DEBUG_LIST).
Boot-tested under FEMU with NVME_SEMANTIC_DATA_MUTATOR=1; ran 4
concurrent dd processes plus 500 rescan_controller cycles with no WARN,
BUG, or Oops. The EXT_LBAS + ms!=0 + !PI combination was not triggered
during testing (FEMU's mutator varies flbas and lbaf[0].ms independently;
flbas=0x10 with lbaf_idx=0 was not produced in this run). The
bi->metadata_size assignment path was not exercised in testing;
correctness of blk_validate_integrity_limits() for this configuration
was verified by code inspection. Provided as RFC.
Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).
Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---
drivers/nvme/host/core.c | 25 +++++++++++++++++++++++--
1 file changed, 23 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4e20c8f08e4..76fb788024f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1836,8 +1836,29 @@ static bool nvme_init_integrity(struct nvme_ns_head *head,
* insert/strip it, which is not possible for other kinds of metadata.
*/
if (!IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) ||
- !(head->features & NVME_NS_METADATA_SUPPORTED))
- return nvme_ns_has_pi(head);
+ !(head->features & NVME_NS_METADATA_SUPPORTED)) {
+ bool has_pi = nvme_ns_has_pi(head);
+
+ /*
+ * For PCIe EXT_LBAS non-PI namespaces the block layer sets
+ * capacity to 0 (we return false) to prevent block I/O, but a
+ * cached-rq bio may bypass bio_queue_enter freeze serialisation
+ * and reach nvme_setup_rw() with head->ms != 0 and no
+ * REQ_INTEGRITY set. Populate bi->metadata_size so that
+ * bio_integrity_action() returns non-zero and bio_integrity_prep()
+ * sets REQ_INTEGRITY on any such bio, preventing the WARN_ON_ONCE
+ * at nvme_setup_rw() (addressed by patch 1/2).
+ *
+ * NOTE: only metadata_size is populated; no csum or PI profile is
+ * configured. Actual data integrity for EXT_LBAS non-PI workloads
+ * is untested; this patch is RFC for direction discussion.
+ */
+ if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) &&
+ (head->features & NVME_NS_EXT_LBAS) &&
+ head->ms && !has_pi)
+ bi->metadata_size = head->ms;
+ return has_pi;
+ }
switch (head->pi_type) {
case NVME_NS_DPS_PI_TYPE3:
--
2.43.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug
2026-04-27 0:34 [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Chao Shi
2026-04-27 0:34 ` [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace Chao Shi
@ 2026-05-07 5:48 ` Christoph Hellwig
2026-05-17 3:54 ` Chao S
2026-05-07 18:12 ` Keith Busch
2 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2026-05-07 5:48 UTC (permalink / raw)
To: Chao Shi
Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, Sungwoo Kim,
Dave Tian, Weidong Zhu
On Sun, Apr 26, 2026 at 08:34:56PM -0400, Chao Shi wrote:
> When an NVMe namespace is configured with embedded metadata (flbas bit 4
> set, NVME_NS_FLBAS_META_EXT) but no Protection Information (dps=0) and
> no NVME_NS_METADATA_SUPPORTED, nvme_setup_rw() fires WARN_ON_ONCE on
> any request that reaches it with REQ_INTEGRITY unset. The WARN was
> observed repeatedly during NVMe fuzz testing with a FEMU-based fuzzer
> that performs semantic mutation of Identify Namespace responses.
What is "semantic mutation of Identify Namespace responses" supposed to
mean?
> In both cases the bio was submitted without REQ_INTEGRITY (because
> blk_get_integrity() returned NULL at dispatch time, so
> bio_integrity_action() returned 0 and bio_integrity_prep() was not
> called), and it reaches nvme_setup_rw() for a namespace where
> head->ms != 0. The existing BLK_STS_NOTSUPP return correctly handles
> this dispatch; the WARN_ON_ONCE is a false positive.
That means we fail to properaly freeze and quiesce the queue over
updateѕm which has much worse results than just a WARN_ON. So if we
care about this rather theoretical case we'll need to fix that.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace
2026-04-27 0:34 ` [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace Chao Shi
@ 2026-05-07 5:49 ` Christoph Hellwig
2026-05-07 8:05 ` Keith Busch
0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2026-05-07 5:49 UTC (permalink / raw)
To: Chao Shi
Cc: linux-nvme, linux-block, hch, kbusch, sagi, axboe, Sungwoo Kim,
Dave Tian, Weidong Zhu
On Sun, Apr 26, 2026 at 08:34:57PM -0400, Chao Shi wrote:
> + /*
> + * For PCIe EXT_LBAS non-PI namespaces the block layer sets
> + * capacity to 0 (we return false) to prevent block I/O, but a
> + * cached-rq bio may bypass bio_queue_enter freeze serialisation
> + * and reach nvme_setup_rw() with head->ms != 0 and no
> + * REQ_INTEGRITY set. Populate bi->metadata_size so that
> + * bio_integrity_action() returns non-zero and bio_integrity_prep()
> + * sets REQ_INTEGRITY on any such bio, preventing the WARN_ON_ONCE
> + * at nvme_setup_rw() (addressed by patch 1/2).
This sounds like the Bug Keith is trying to fix in the block layer
("blk-mq: check for stale cached request in blk_mq_submit_bio") ?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace
2026-05-07 5:49 ` Christoph Hellwig
@ 2026-05-07 8:05 ` Keith Busch
0 siblings, 0 replies; 12+ messages in thread
From: Keith Busch @ 2026-05-07 8:05 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Chao Shi, linux-nvme, linux-block, sagi, axboe, Sungwoo Kim,
Dave Tian, Weidong Zhu
On Thu, May 07, 2026 at 07:49:44AM +0200, Christoph Hellwig wrote:
> On Sun, Apr 26, 2026 at 08:34:57PM -0400, Chao Shi wrote:
> > + /*
> > + * For PCIe EXT_LBAS non-PI namespaces the block layer sets
> > + * capacity to 0 (we return false) to prevent block I/O, but a
> > + * cached-rq bio may bypass bio_queue_enter freeze serialisation
> > + * and reach nvme_setup_rw() with head->ms != 0 and no
> > + * REQ_INTEGRITY set. Populate bi->metadata_size so that
> > + * bio_integrity_action() returns non-zero and bio_integrity_prep()
> > + * sets REQ_INTEGRITY on any such bio, preventing the WARN_ON_ONCE
> > + * at nvme_setup_rw() (addressed by patch 1/2).
>
> This sounds like the Bug Keith is trying to fix in the block layer
> ("blk-mq: check for stale cached request in blk_mq_submit_bio") ?
Both issues are dealing with cached request corner cases, but they're
not really related. My bug fix is specifically when those requests are
freed, while this one is just racing with them.
For this patch's issue, how is the host being triggered to rescan? If
you're sending a "Format NVM" command, the driver would have frozen the
queues first, which would have waited for any cached requests to flush
out and prevent new ones from being allocated until after the format has
been updated in the driver.
It's possible to format the namespace from a different controller the
host doesn't know about, so we've always had a race where the actual
format is different than what the host knows about. The rescan would
have to be triggered some other way in that case (either through AEN or
manual sysfs/ioctl trigger).
We always ensure the queue is frozen when we update the queue limits in
this path too, so the driver and block layer should always be in sync
even if it's not in sync with the device. That in itself can be pretty
nasty, but we'd need a new NVMe TP to define a way to fix that. But
specifically on the driver warning here, I'm not sure how it is getting
triggered due to the existing freeze semantics around the queue limits
update.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug
2026-04-27 0:34 [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Chao Shi
2026-04-27 0:34 ` [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace Chao Shi
2026-05-07 5:48 ` [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Christoph Hellwig
@ 2026-05-07 18:12 ` Keith Busch
2026-05-17 3:53 ` Chao S
2 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2026-05-07 18:12 UTC (permalink / raw)
To: Chao Shi
Cc: linux-nvme, linux-block, hch, sagi, axboe, Sungwoo Kim, Dave Tian,
Weidong Zhu
On Sun, Apr 26, 2026 at 08:34:56PM -0400, Chao Shi wrote:
> In both cases the bio was submitted without REQ_INTEGRITY (because
> blk_get_integrity() returned NULL at dispatch time, so
> bio_integrity_action() returned 0 and bio_integrity_prep() was not
> called), and it reaches nvme_setup_rw() for a namespace where
> head->ms != 0. The existing BLK_STS_NOTSUPP return correctly handles
> this dispatch; the WARN_ON_ONCE is a false positive.
This is what I'm not really following. The cached request holds a
reference on the queue that prevents the queue freeze from proceeding.
This driver freezes the queue along with the queue limits update. As I
mentioned in the other patch, that was supposed to ensure the block
layer had the updated limits before it could allocate a request, so I
think we need to understand how that was defeated to get to a real
solution.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug
2026-05-07 18:12 ` Keith Busch
@ 2026-05-17 3:53 ` Chao S
2026-05-17 22:05 ` Keith Busch
2026-05-18 22:41 ` Keith Busch
0 siblings, 2 replies; 12+ messages in thread
From: Chao S @ 2026-05-17 3:53 UTC (permalink / raw)
To: Keith Busch
Cc: linux-nvme, linux-block, hch, sagi, axboe, Sungwoo Kim, Dave Tian,
Weidong Zhu
On Thu, May 07, 2026 at 07:12:26PM +0100, Keith Busch wrote:
> [...] how that was defeated [...]
Hi Keith,
Not the freeze. The WARN does not depend on q->limits, but on
ns->head->ms (read live at dispatch, set inside the freeze window) and
on REQ_INTEGRITY, never set for EXT_LBAS-non-PI. capacity==0 only
gates submission (bio_check_eod()), not dispatch: a writeback bio that
passed bio_check_eod() under the old capacity sits on the task plug
holding no q_usage_counter ref, so it does not block the freeze;
blk_finish_plug() flushes it after the update committed head->ms != 0
(dmesg: the capacity-change line prints before the WARN).
So it is reachable -- the host-unaware geometry change you described,
unrelated to your block fix. The deeper fencing gap is the separate
TP-level issue; v2 does not attempt it, it only stops a
device-reachable, already-safely-rejected dispatch from being a WARN
(a panic under panic_on_warn).
Thanks,
Chao
On Thu, May 7, 2026 at 2:12 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Sun, Apr 26, 2026 at 08:34:56PM -0400, Chao Shi wrote:
> > In both cases the bio was submitted without REQ_INTEGRITY (because
> > blk_get_integrity() returned NULL at dispatch time, so
> > bio_integrity_action() returned 0 and bio_integrity_prep() was not
> > called), and it reaches nvme_setup_rw() for a namespace where
> > head->ms != 0. The existing BLK_STS_NOTSUPP return correctly handles
> > this dispatch; the WARN_ON_ONCE is a false positive.
>
> This is what I'm not really following. The cached request holds a
> reference on the queue that prevents the queue freeze from proceeding.
> This driver freezes the queue along with the queue limits update. As I
> mentioned in the other patch, that was supposed to ensure the block
> layer had the updated limits before it could allocate a request, so I
> think we need to understand how that was defeated to get to a real
> solution.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug
2026-05-07 5:48 ` [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Christoph Hellwig
@ 2026-05-17 3:54 ` Chao S
2026-05-18 5:56 ` Christoph Hellwig
0 siblings, 1 reply; 12+ messages in thread
From: Chao S @ 2026-05-17 3:54 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-nvme, linux-block, kbusch, sagi, axboe, Sungwoo Kim,
Dave Tian, Weidong Zhu
On Thu, May 07, 2026 at 07:48:09AM +0200, Christoph Hellwig wrote:
> What is "semantic mutation of Identify Namespace responses" [...]
The emulated controller returns a mutated Identify Namespace on
rescan (flbas META_EXT, lbaf.ms != 0, dps == 0) -- the host-unaware
geometry change Keith described (out-of-band format on a shared
namespace, or a non-compliant device), learned via rescan/AEN, not
the host's own Format NVM.
> [...] rather theoretical case [...]
Agreed v1 was wrong (1/2 hid the signal, 2/2 unrelated; 2/2 dropped).
Not fuzzer-only though: the out-of-band-format path Keith acknowledged
is real, and the branch is reachable from device data, so under
panic_on_warn this is a device-triggerable panic. The deeper fencing
fix is the bigger TP-level work; v2 does not attempt it -- it only
adds the missing nvme_setup_rw() case for an unusable (metadata, no
PI, no profile) namespace whose I/O is already rejected, replacing the
assertion with one dev_warn_once(). Worth the bigger fix, or leave it
documented?
Thanks,
Chao
On Thu, May 7, 2026 at 1:48 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Sun, Apr 26, 2026 at 08:34:56PM -0400, Chao Shi wrote:
> > When an NVMe namespace is configured with embedded metadata (flbas bit 4
> > set, NVME_NS_FLBAS_META_EXT) but no Protection Information (dps=0) and
> > no NVME_NS_METADATA_SUPPORTED, nvme_setup_rw() fires WARN_ON_ONCE on
> > any request that reaches it with REQ_INTEGRITY unset. The WARN was
> > observed repeatedly during NVMe fuzz testing with a FEMU-based fuzzer
> > that performs semantic mutation of Identify Namespace responses.
>
> What is "semantic mutation of Identify Namespace responses" supposed to
> mean?
>
> > In both cases the bio was submitted without REQ_INTEGRITY (because
> > blk_get_integrity() returned NULL at dispatch time, so
> > bio_integrity_action() returned 0 and bio_integrity_prep() was not
> > called), and it reaches nvme_setup_rw() for a namespace where
> > head->ms != 0. The existing BLK_STS_NOTSUPP return correctly handles
> > this dispatch; the WARN_ON_ONCE is a false positive.
>
> That means we fail to properaly freeze and quiesce the queue over
> updateѕm which has much worse results than just a WARN_ON. So if we
> care about this rather theoretical case we'll need to fix that.
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug
2026-05-17 3:53 ` Chao S
@ 2026-05-17 22:05 ` Keith Busch
2026-05-17 22:42 ` Keith Busch
2026-05-18 22:41 ` Keith Busch
1 sibling, 1 reply; 12+ messages in thread
From: Keith Busch @ 2026-05-17 22:05 UTC (permalink / raw)
To: Chao S
Cc: linux-nvme, linux-block, hch, sagi, axboe, Sungwoo Kim, Dave Tian,
Weidong Zhu
On Sat, May 16, 2026 at 11:53:54PM -0400, Chao S wrote:
> On Thu, May 07, 2026 at 07:12:26PM +0100, Keith Busch wrote:
> > [...] how that was defeated [...]
>
> Hi Keith,
>
> Not the freeze. The WARN does not depend on q->limits, but on
> ns->head->ms (read live at dispatch, set inside the freeze window) and
> on REQ_INTEGRITY, never set for EXT_LBAS-non-PI. capacity==0 only
> gates submission (bio_check_eod()), not dispatch: a writeback bio that
> passed bio_check_eod() under the old capacity sits on the task plug
> holding no q_usage_counter ref, so it does not block the freeze;
> blk_finish_plug() flushes it after the update committed head->ms != 0
> (dmesg: the capacity-change line prints before the WARN).
>
> So it is reachable -- the host-unaware geometry change you described,
> unrelated to your block fix. The deeper fencing gap is the separate
> TP-level issue; v2 does not attempt it, it only stops a
> device-reachable, already-safely-rejected dispatch from being a WARN
> (a panic under panic_on_warn).
I think tHe WARN is serving it's intendeded purpose: the block layer
shouldn't have submitted this request. You can't do generic read/write
with extended metadatate as the DMA is going to corrupt memory with
respect to what the block layer expects.
This driver is depending on the capacity constraint to prevent this
scenario, so I think The "end-of-device" check needs to happen within
the entered queue context. If there's a scenario that escapes that
check, then I think that's what needs fixing, not the driver.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug
2026-05-17 22:05 ` Keith Busch
@ 2026-05-17 22:42 ` Keith Busch
0 siblings, 0 replies; 12+ messages in thread
From: Keith Busch @ 2026-05-17 22:42 UTC (permalink / raw)
To: Chao S
Cc: linux-nvme, linux-block, hch, sagi, axboe, Sungwoo Kim, Dave Tian,
Weidong Zhu
On Sun, May 17, 2026 at 04:05:07PM -0600, Keith Busch wrote:
> On Sat, May 16, 2026 at 11:53:54PM -0400, Chao S wrote:
> > On Thu, May 07, 2026 at 07:12:26PM +0100, Keith Busch wrote:
> > > [...] how that was defeated [...]
> >
> > Hi Keith,
> >
> > Not the freeze. The WARN does not depend on q->limits, but on
> > ns->head->ms (read live at dispatch, set inside the freeze window) and
> > on REQ_INTEGRITY, never set for EXT_LBAS-non-PI. capacity==0 only
> > gates submission (bio_check_eod()), not dispatch: a writeback bio that
> > passed bio_check_eod() under the old capacity sits on the task plug
> > holding no q_usage_counter ref, so it does not block the freeze;
> > blk_finish_plug() flushes it after the update committed head->ms != 0
> > (dmesg: the capacity-change line prints before the WARN).
> >
> > So it is reachable -- the host-unaware geometry change you described,
> > unrelated to your block fix. The deeper fencing gap is the separate
> > TP-level issue; v2 does not attempt it, it only stops a
> > device-reachable, already-safely-rejected dispatch from being a WARN
> > (a panic under panic_on_warn).
>
> I think tHe WARN is serving it's intendeded purpose: the block layer
> shouldn't have submitted this request. You can't do generic read/write
> with extended metadatate as the DMA is going to corrupt memory with
> respect to what the block layer expects.
>
> This driver is depending on the capacity constraint to prevent this
> scenario, so I think The "end-of-device" check needs to happen within
> the entered queue context. If there's a scenario that escapes that
> check, then I think that's what needs fixing, not the driver.
Does this fix it? I don't necessarily like having yet another check in
the hotpath, but should be exactly the check that drivers expected to be
done, so should be cache hot in the normal case.
---
diff --git a/block/blk-core.c b/block/blk-core.c
index 17450058ea6d8..4b5fb32a7d6f8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -553,7 +553,7 @@ ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
* This may well happen - the kernel calls bread() without checking the size of
* the device, e.g., when mounting a file system.
*/
-static inline int bio_check_eod(struct bio *bio)
+int bio_check_eod(struct bio *bio)
{
sector_t maxsector = bdev_nr_sectors(bio->bi_bdev);
unsigned int nr_sectors = bio_sectors(bio);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c5c16cce4f8f..b75117ec5c988 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3186,6 +3186,11 @@ void blk_mq_submit_bio(struct bio *bio)
goto queue_exit;
}
+ if (unlikely(bio_check_eod(bio))) {
+ bio_io_error(bio);
+ goto queue_exit;
+ }
+
if ((bio->bi_opf & REQ_POLLED) && !blk_mq_can_poll(q)) {
bio->bi_status = BLK_STS_NOTSUPP;
bio_endio(bio);
diff --git a/block/blk.h b/block/blk.h
index b998a7761faf3..84515bb75485d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -54,6 +54,7 @@ bool blk_queue_start_drain(struct request_queue *q);
bool __blk_freeze_queue_start(struct request_queue *q,
struct task_struct *owner);
int __bio_queue_enter(struct request_queue *q, struct bio *bio);
+int bio_check_eod(struct bio *bio);
void submit_bio_noacct_nocheck(struct bio *bio, bool split);
int bio_submit_or_kill(struct bio *bio, unsigned int flags);
--
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug
2026-05-17 3:54 ` Chao S
@ 2026-05-18 5:56 ` Christoph Hellwig
0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2026-05-18 5:56 UTC (permalink / raw)
To: Chao S
Cc: Christoph Hellwig, linux-nvme, linux-block, kbusch, sagi, axboe,
Sungwoo Kim, Dave Tian, Weidong Zhu
On Sat, May 16, 2026 at 11:54:49PM -0400, Chao S wrote:
> On Thu, May 07, 2026 at 07:48:09AM +0200, Christoph Hellwig wrote:
> > What is "semantic mutation of Identify Namespace responses" [...]
>
> The emulated controller returns a mutated Identify Namespace on
> rescan (flbas META_EXT, lbaf.ms != 0, dps == 0) -- the host-unaware
What does "mutated" mean here?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug
2026-05-17 3:53 ` Chao S
2026-05-17 22:05 ` Keith Busch
@ 2026-05-18 22:41 ` Keith Busch
1 sibling, 0 replies; 12+ messages in thread
From: Keith Busch @ 2026-05-18 22:41 UTC (permalink / raw)
To: Chao S
Cc: linux-nvme, linux-block, hch, sagi, axboe, Sungwoo Kim, Dave Tian,
Weidong Zhu
On Sat, May 16, 2026 at 11:53:54PM -0400, Chao S wrote:
> Not the freeze. The WARN does not depend on q->limits, but on
> ns->head->ms (read live at dispatch, set inside the freeze window) and
> on REQ_INTEGRITY, never set for EXT_LBAS-non-PI. capacity==0 only
> gates submission (bio_check_eod()), not dispatch: a writeback bio that
> passed bio_check_eod() under the old capacity sits on the task plug
> holding no q_usage_counter ref, so it does not block the freeze;
> blk_finish_plug() flushes it after the update committed head->ms != 0
> (dmesg: the capacity-change line prints before the WARN).
I'm also not what you mean about the task plug here. The plug list holds
requests, which hold queue references. The plug does not hold bios. If
you're not using a preallocated cached request, then the bio just gets
stuck on "bio_queue_enter" until the freeze completes with the new
limits, though it does proceed with potentially outdated assumtions.
So I think the problem is that the early checks are done outside the
queue entered context, and some of the following code depends on those
checks being valid.
My suggestion in the other reply isn't quite right for a couple reasons
(partition handling and issues with nvme multipath path failure), so
don't bother trying that.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-05-18 22:41 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 0:34 [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Chao Shi
2026-04-27 0:34 ` [PATCH RFC 2/2] nvme: set integrity metadata size for EXT_LBAS non-PI namespace Chao Shi
2026-05-07 5:49 ` Christoph Hellwig
2026-05-07 8:05 ` Keith Busch
2026-05-07 5:48 ` [PATCH RFC 1/2] nvme: downgrade WARN in nvme_setup_rw to pr_debug Christoph Hellwig
2026-05-17 3:54 ` Chao S
2026-05-18 5:56 ` Christoph Hellwig
2026-05-07 18:12 ` Keith Busch
2026-05-17 3:53 ` Chao S
2026-05-17 22:05 ` Keith Busch
2026-05-17 22:42 ` Keith Busch
2026-05-18 22:41 ` Keith Busch
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.