[PATCH AUTOSEL 6.19-5.10] nvme-pci: ensure we're polling a polled queue

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 6.19-5.10] nvme-pci: ensure we're polling a polled queue
       [not found] <20260310090145.2709021-1-sashal@kernel.org>
@ 2026-03-10  9:01 ` Sasha Levin
  2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.15] nvme-pci: cap queue creation to used queues Sasha Levin
  2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.1] nvme-fabrics: use kfree_sensitive() for DHCHAP secrets Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2026-03-10  9:01 UTC (permalink / raw)
  To: patches, stable
  Cc: Keith Busch, Christoph Hellwig, Kanchan Joshi, Sasha Levin, sagi,
	linux-nvme, linux-kernel

From: Keith Busch <kbusch@kernel.org>

[ Upstream commit 166e31d7dbf6aa44829b98aa446bda5c9580f12a ]

A user can change the polled queue count at run time. There's a brief
window during a reset where a hipri task may try to poll that queue
before the block layer has updated the queue maps, which would race with
the now interrupt driven queue and may cause double completions.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed for my analysis.

## Analysis

### 1. Commit Message Analysis

The commit message clearly describes a **race condition** that can cause
**double completions**:
- A user can change the polled queue count at runtime
- During a controller reset, there's a window where a hipri task may
  poll a queue that has transitioned from polled to interrupt-driven
- Both the polling path and the interrupt handler could complete the
  same I/O, causing double completions

Double completions in the block layer are a serious bug — they can lead
to use-after-free (the request is freed after the first completion, then
accessed again during the second completion), data corruption, or kernel
crashes.

### 2. Code Change Analysis

The fix is minimal — a single additional check in `nvme_poll()`:

```c
- if (!nvme_cqe_pending(nvmeq))
+       if (!test_bit(NVMEQ_POLLED, &nvmeq->flags) ||
+           !nvme_cqe_pending(nvmeq))
                return 0;
```

Before polling the completion queue, verify the queue is actually still
a polled queue via `test_bit(NVMEQ_POLLED, ...)`. If the queue has
transitioned to interrupt-driven mode (the NVMEQ_POLLED bit was cleared
in `nvme_suspend_queue()`), skip the poll and return 0.

This is a 2-line change in a single file. The risk of regression is
extremely low.

### 3. Classification

- **Race condition fix** — prevents concurrent polling and interrupt-
  driven completion of the same queue
- **Prevents double completions** — which are a serious kernel bug
  (potential UAF, crashes)
- Authored by Keith Busch (NVMe maintainer), reviewed by Christoph
  Hellwig and Kanchan Joshi

### 4. Scope and Risk

- **1 file changed, 2 insertions, 1 deletion** — extremely small and
  contained
- The added check is a fast bit test that simply returns early if the
  queue isn't polled
- No functional changes to the normal code path — polled queues still
  poll as before
- Only affects the edge case of runtime queue configuration changes

### 5. Dependencies

The `NVMEQ_POLLED` flag has existed since kernel v5.2 (commit
7c349dde26b7, March 2019). The `test_bit(NVMEQ_POLLED, ...)` pattern is
already used elsewhere in the same file (e.g., `nvme_poll_irqdisable()`,
`nvme_suspend_queue()`). This fix is fully self-contained and applies
cleanly without any prerequisites.

The commit appears alongside two related commits (`44b798a6fd680` "cap
queue creation" and `02bfc966463a7` "do not try to add queue maps at
runtime"), but this fix is independent — it guards against an already-
existing race in `nvme_poll()` regardless of the other changes.

### 6. User Impact

NVMe is one of the most widely used storage interfaces in modern
systems. Any system using polled I/O queues (common for high-performance
storage workloads) and changing queue configuration at runtime could hit
this race. Double completions can cause kernel panics, data corruption,
or use-after-free.

### Verification

- Verified upstream commit is `166e31d7dbf6` which appeared in
  `v7.0-rc3`
- Confirmed `NVMEQ_POLLED` flag was introduced in commit `7c349dde26b7`
  (v5.2), so it exists in all current stable trees
- Confirmed the fix is a 2-line change modifying only
  `drivers/nvme/host/pci.c`
- Confirmed the code pattern (`test_bit(NVMEQ_POLLED, ...)`) is already
  used elsewhere in the same file, particularly in
  `nvme_poll_irqdisable()` at line 1631
- Verified the commit is reviewed by Christoph Hellwig and Kanchan Joshi
- Confirmed the fix is self-contained — no dependency on the two
  companion commits

### Conclusion

This is a textbook stable backport candidate: a tiny, surgical fix for a
race condition that can cause double completions (potential UAF/crash),
in a critical subsystem (NVMe storage), authored and reviewed by
subsystem maintainers, with zero risk of regression.

**YES**

 drivers/nvme/host/pci.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 74aca8846fffe..3f4af341474e1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1499,7 +1499,8 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
 	struct nvme_queue *nvmeq = hctx->driver_data;
 	bool found;

-	if (!nvme_cqe_pending(nvmeq))
+	if (!test_bit(NVMEQ_POLLED, &nvmeq->flags) ||
+	    !nvme_cqe_pending(nvmeq))
 		return 0;

 	spin_lock(&nvmeq->cq_poll_lock);
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH AUTOSEL 6.19-5.15] nvme-pci: cap queue creation to used queues
       [not found] <20260310090145.2709021-1-sashal@kernel.org>
  2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.10] nvme-pci: ensure we're polling a polled queue Sasha Levin
@ 2026-03-10  9:01 ` Sasha Levin
  2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.1] nvme-fabrics: use kfree_sensitive() for DHCHAP secrets Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2026-03-10  9:01 UTC (permalink / raw)
  To: patches, stable
  Cc: Keith Busch, Kanchan Joshi, Christoph Hellwig, Sasha Levin, sagi,
	linux-nvme, linux-kernel

From: Keith Busch <kbusch@kernel.org>

[ Upstream commit 4735b510a00fb2d4ac9e8d21a8c9552cb281f585 ]

If the user reduces the special queue count at runtime and resets the
controller, we need to reduce the number of queues and interrupts
requested accordingly rather than start with the pre-allocated queue
count.

Tested-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have a complete understanding. Let me summarize the analysis.

## Analysis

### What the commit does

This is a one-line functional change (replacing
`dev->nr_allocated_queues - 1` with `min(nvme_max_io_queues(dev),
dev->nr_allocated_queues - 1)`) in the NVMe PCI driver's
`nvme_setup_io_queues()` function.

### The bug

The `write_queues` and `poll_queues` module parameters are writable at
runtime (permission 0644). At probe time, `nr_allocated_queues` is set
to `nvme_max_io_queues(dev) + 1`, based on the CPU count plus write/poll
queue counts at that time. If the user later **reduces** these
parameters and then triggers a controller reset,
`nvme_setup_io_queues()` would still use the old, larger
`nr_allocated_queues - 1` as the starting queue count, requesting more
interrupts and queues than actually needed.

### Historical context

Interestingly, there **was** a `min()` cap here originally. Commit
`ff4e5fbad06f7` (v5.11) removed it with the reasoning that
`nr_allocated_queues` equals `nvme_max_io_queues() + 1` so the min is
redundant. That reasoning is correct **at probe time** but breaks when
module parameters change at runtime. This commit effectively restores
the `min()` to handle the runtime modification case.

### Consequences of the bug

Without this fix, after reducing queue parameters and resetting:
1. Too many interrupt vectors are requested via
   `pci_alloc_irq_vectors_affinity()`
2. The retry loop at lines 3022-3030 may be triggered unnecessarily when
   the controller can't create all requested queues
3. Unnecessary resource waste (interrupt vectors, queue slots)
4. Potential allocation failures on systems with limited interrupt
   resources

### Stable criteria assessment

- **Fixes a real bug**: Yes - incorrect behavior when runtime-modifiable
  parameters change
- **Obviously correct**: Yes - it's restoring a `min()` that was
  previously there, with well-understood semantics
- **Small and contained**: Yes - 1 functional line change (7 lines with
  comment)
- **No new features**: Correct - purely a bug fix
- **Tested and reviewed**: Has Tested-by, Reviewed-by from Kanchan Joshi
  and Reviewed-by from Christoph Hellwig (NVMe subsystem maintainer)
- **Author**: Keith Busch is the NVMe maintainer himself
- **Risk**: Very low - `min()` can only reduce the initial queue count,
  never increase it; the previous value was always valid, so this can
  only make things better

### Dependencies

This commit is standalone. The `nvme_max_io_queues()` function has
existed since v5.10 (commit `e3aef0950a30e`). The related commit
`02bfc966463a7` ("do not try to add queue maps at runtime") was applied
to the same stable tree but is independent - it handles a different
aspect (preventing addition of queue map types at runtime), while this
commit handles queue **count** capping.

The code at lines 2923-2933 (tagset nr_maps check from the "do not try
to add queue maps" commit) executes before this line and can further
reduce `dev->nr_write_queues`/`dev->nr_poll_queues`. While this commit
works fine without that prerequisite (both independently reduce the
queue count for different reasons), having the prerequisite makes the
`min()` even more effective since `nvme_max_io_queues()` returns the
properly capped value.

### Verification

- **git show ff4e5fbad06f7**: Confirmed that a `min()` with
  `nvme_max_io_queues()` existed before v5.11 and was removed because it
  was considered redundant at the time
- **git tag --contains ff4e5fbad06f7**: Confirmed the `min()` removal is
  in stable trees from v5.11 onward, meaning all actively maintained
  stable trees (5.15.y, 6.1.y, 6.6.y, etc.) have the bug
- **Module parameter permissions**: Confirmed `write_queues` and
  `poll_queues` are 0644 (runtime writable) at line 265 and 271 of pci.c
- **nvme_max_io_queues()**: Confirmed it exists since v5.10 (commit
  e3aef0950a30e), so the function is available in all stable trees
- **git show 4735b510a00fb**: Confirmed upstream commit exists and the
  diff matches exactly
- **git show 02bfc966463a7**: Verified the related "do not try to add
  queue maps" commit is independent; it adds the tagset check (lines
  2923-2933) which is additive and doesn't conflict
- **git log --author="Keith Busch"**: Confirmed Keith Busch is the NVMe
  maintainer and author of related fixes
- Could NOT independently verify whether the retry loop causes user-
  visible failures vs just being wasteful (unverified, but requesting
  excess interrupts is clearly wrong behavior)

### Conclusion

This is a small, surgical, well-reviewed bug fix by the NVMe subsystem
maintainer that restores a necessary safety cap for a realistic user
scenario (modifying module parameters at runtime). It has minimal risk -
`min()` can only reduce the queue count. The affected code exists in all
maintained stable trees. It meets all stable kernel criteria.

**YES**

 drivers/nvme/host/pci.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d86f2565a92ca..74aca8846fffe 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2776,7 +2776,13 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	dev->nr_write_queues = write_queues;
 	dev->nr_poll_queues = poll_queues;

-	nr_io_queues = dev->nr_allocated_queues - 1;
+	/*
+	 * The initial number of allocated queue slots may be too large if the
+	 * user reduced the special queue parameters. Cap the value to the
+	 * number we need for this round.
+	 */
+	nr_io_queues = min(nvme_max_io_queues(dev),
+			   dev->nr_allocated_queues - 1);
 	result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
 	if (result < 0)
 		return result;
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH AUTOSEL 6.19-6.1] nvme-fabrics: use kfree_sensitive() for DHCHAP secrets
       [not found] <20260310090145.2709021-1-sashal@kernel.org>
  2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.10] nvme-pci: ensure we're polling a polled queue Sasha Levin
  2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.15] nvme-pci: cap queue creation to used queues Sasha Levin
@ 2026-03-10  9:01 ` Sasha Levin
  2 siblings, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2026-03-10  9:01 UTC (permalink / raw)
  To: patches, stable
  Cc: Daniel Hodges, Christoph Hellwig, Keith Busch, Sasha Levin, sagi,
	linux-nvme, linux-kernel

From: Daniel Hodges <hodgesd@meta.com>

[ Upstream commit 0a1fc2f301529ac75aec0ce80d5ab9d9e4dc4b16 ]

The DHCHAP secrets (dhchap_secret and dhchap_ctrl_secret) contain
authentication key material for NVMe-oF. Use kfree_sensitive() instead
of kfree() in nvmf_free_options() to ensure secrets are zeroed before
the memory is freed, preventing recovery from freed pages.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Hodges <hodgesd@meta.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Analysis

### What the commit does
This is a two-line change replacing `kfree()` with `kfree_sensitive()`
for two fields (`dhchap_secret` and `dhchap_ctrl_secret`) in
`nvmf_free_options()`. `kfree_sensitive()` zeroes memory before freeing
it, preventing authentication key material from being recoverable from
freed kernel pages.

### Bug classification: Security hardening
This is a security hygiene fix for sensitive cryptographic material. The
DHCHAP (Diffie-Hellman Hash-based Authentication Challenge Handshake
Protocol) secrets are authentication keys for NVMe-over-Fabrics
connections. Without zeroing, these keys could potentially be recovered
from freed memory by an attacker with kernel memory read access (e.g.,
via `/dev/mem`, `/proc/kcore`, crash dump, cold boot attacks, or another
kernel vulnerability).

### Consistency with existing codebase
The NVMe subsystem already uses `kfree_sensitive()` extensively for
similar authentication material:
- `drivers/nvme/host/auth.c`: Uses it for `host_key`, `ctrl_key`,
  `sess_key`, `tls_psk`, etc.
- `drivers/nvme/common/auth.c`: Uses it for `key`, `hashed_key`, `psk`,
  etc.
- `drivers/nvme/target/auth.c`: Uses it for `dh_key`, `tls_psk`, etc.

The two fields changed here (`dhchap_secret` and `dhchap_ctrl_secret`)
were an oversight - they contain the same type of sensitive
authentication material but were using plain `kfree()`.

### Stable criteria assessment
- **Obviously correct**: Yes - `kfree_sensitive()` is a drop-in
  replacement for `kfree()` with added zeroing. No behavioral change.
- **Fixes a real bug**: Yes - leaking cryptographic key material in
  freed memory is a security issue.
- **Small and contained**: Yes - exactly 2 lines changed.
- **No new features**: Correct - no new APIs or behavior.
- **Risk**: Extremely low - `kfree_sensitive()` just adds `memset(0)`
  before `kfree()`.

### Remaining inconsistencies
Note that there are still other places using plain `kfree()` for dhchap
secrets (fabrics.c lines 1034/1048 for option parsing, sysfs.c,
target/configfs.c, target/auth.c) - but this commit fixes the main
cleanup path and is self-contained.

### Verification
- Read `fabrics.c:1282-1296` to confirm the current code already has
  `kfree_sensitive()` (the commit has already been applied to this
  tree).
- Verified `dhchap_secret` and `dhchap_ctrl_secret` are defined as `char
  *` in `fabrics.h:130-131` and documented as "DH-HMAC-CHAP secret" /
  "DH-HMAC-CHAP controller secret".
- Confirmed via grep that `kfree_sensitive()` is used extensively for
  similar authentication keys throughout `drivers/nvme/host/auth.c` and
  `drivers/nvme/common/auth.c` (18+ instances).
- Confirmed via grep that other dhchap_secret free paths still use plain
  `kfree()` (sysfs.c, fabrics.c option parsing, target side) - this
  commit is incomplete coverage but still valuable.
- Reviewed-by: Christoph Hellwig (well-known kernel developer and NVMe
  maintainer) provides strong confidence.
- The DHCHAP feature was introduced in commit `f50fff73d620` ("nvme:
  implement In-Band authentication") which was in v6.0 cycle, so this is
  relevant to stable trees 6.1+.

### Conclusion
This is a minimal, zero-risk security fix for sensitive cryptographic
material. It follows established patterns in the same subsystem, is
reviewed by a senior maintainer, and meets all stable criteria. The
security benefit (preventing key material leakage) clearly outweighs the
negligible risk.

**YES**

 drivers/nvme/host/fabrics.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 55a8afd2efd50..d37cb140d8323 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -1290,8 +1290,8 @@ void nvmf_free_options(struct nvmf_ctrl_options *opts)
 	kfree(opts->subsysnqn);
 	kfree(opts->host_traddr);
 	kfree(opts->host_iface);
-	kfree(opts->dhchap_secret);
-	kfree(opts->dhchap_ctrl_secret);
+	kfree_sensitive(opts->dhchap_secret);
+	kfree_sensitive(opts->dhchap_ctrl_secret);
 	kfree(opts);
 }
 EXPORT_SYMBOL_GPL(nvmf_free_options);
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-03-10  9:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260310090145.2709021-1-sashal@kernel.org>
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.10] nvme-pci: ensure we're polling a polled queue Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-5.15] nvme-pci: cap queue creation to used queues Sasha Levin
2026-03-10  9:01 ` [PATCH AUTOSEL 6.19-6.1] nvme-fabrics: use kfree_sensitive() for DHCHAP secrets Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox