* [PATCH AUTOSEL 6.15 11/34] nvme-pci: try function level reset on init failure
[not found] <20250803211836.3546094-1-sashal@kernel.org>
@ 2025-08-03 21:18 ` Sasha Levin
2025-08-03 21:18 ` [PATCH AUTOSEL 6.15 14/34] nvme-tcp: log TLS handshake failures at error level Sasha Levin
1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2025-08-03 21:18 UTC (permalink / raw)
To: patches, stable
Cc: Keith Busch, Chaitanya Kulkarni, Nitesh Shetty, Christoph Hellwig,
Sasha Levin, sagi, linux-nvme
From: Keith Busch <kbusch@kernel.org>
[ Upstream commit 5b2c214a95942f7997d1916a4c44017becbc3cac ]
NVMe devices from multiple vendors appear to get stuck in a reset state
that we can't get out of with an NVMe level Controller Reset. The kernel
would report these with messages that look like:
Device not ready; aborting reset, CSTS=0x1
These have historically required a power cycle to make them usable
again, but in many cases, a PCIe FLR is sufficient to restart operation
without a power cycle. Try it if the initial controller reset fails
during any nvme reset attempt.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
**Backport Status: YES**
This commit should be backported to stable kernel trees based on the
following analysis:
## 1. Fixes a Real User-Impacting Bug
The commit addresses a significant hardware issue where NVMe devices
from multiple vendors get stuck in a reset state showing "Device not
ready; aborting reset, CSTS=0x1". This is a real-world problem that:
- Affects multiple vendors' devices (not vendor-specific)
- Historically required a full power cycle to recover
- Renders the storage device completely unusable
## 2. Minimal and Contained Fix
The code change is very focused and minimal:
- Only adds ~20 lines of code in a single location
- The fix is contained within the `nvme_pci_configure_admin_queue()`
function
- Only executes when the initial controller reset fails (error path)
- No architectural changes or new features
## 3. Clear Recovery Mechanism
The fix implements a well-understood recovery mechanism:
```c
if (result < 0) {
struct pci_dev *pdev = to_pci_dev(dev->dev);
result = pcie_reset_flr(pdev, false);
if (result < 0)
return result;
pci_restore_state(pdev);
result = nvme_disable_ctrl(&dev->ctrl, false);
```
It uses PCIe Function Level Reset (FLR) as a "bigger hammer" when the
NVMe-level controller reset fails, which is a standard PCIe recovery
mechanism.
## 4. Low Risk of Regression
- Only triggers in error conditions (when `nvme_disable_ctrl` fails)
- Falls back gracefully if FLR also fails
- Uses existing, well-tested PCIe APIs (`pcie_reset_flr`,
`pci_restore_state`)
- Adds informative logging when recovery succeeds
## 5. Follows Stable Tree Rules
- Fixes a bug that prevents device initialization
- No new features or functionality
- Small, self-contained change
- Clear benefit (avoiding power cycles) with minimal risk
## 6. Similar Issues Fixed in Stable
The git history shows multiple commits addressing "stuck reset" issues
in the NVMe driver (e.g., `3f674e7b670b`, `ebef7368571d`), indicating
this is an ongoing class of problems that stable kernels need to handle.
The commit message explicitly states this affects devices from "multiple
vendors" and provides a solution that avoids requiring power cycles,
making it a clear candidate for stable backporting to improve system
reliability.
drivers/nvme/host/pci.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 776c867fb64d..5396282015a2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1888,8 +1888,28 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
* might be pointing at!
*/
result = nvme_disable_ctrl(&dev->ctrl, false);
- if (result < 0)
- return result;
+ if (result < 0) {
+ struct pci_dev *pdev = to_pci_dev(dev->dev);
+
+ /*
+ * The NVMe Controller Reset method did not get an expected
+ * CSTS.RDY transition, so something with the device appears to
+ * be stuck. Use the lower level and bigger hammer PCIe
+ * Function Level Reset to attempt restoring the device to its
+ * initial state, and try again.
+ */
+ result = pcie_reset_flr(pdev, false);
+ if (result < 0)
+ return result;
+
+ pci_restore_state(pdev);
+ result = nvme_disable_ctrl(&dev->ctrl, false);
+ if (result < 0)
+ return result;
+
+ dev_info(dev->ctrl.device,
+ "controller reset completed after pcie flr\n");
+ }
result = nvme_alloc_queue(dev, 0, NVME_AQ_DEPTH);
if (result)
--
2.39.5
^ permalink raw reply related [flat|nested] 2+ messages in thread
* [PATCH AUTOSEL 6.15 14/34] nvme-tcp: log TLS handshake failures at error level
[not found] <20250803211836.3546094-1-sashal@kernel.org>
2025-08-03 21:18 ` [PATCH AUTOSEL 6.15 11/34] nvme-pci: try function level reset on init failure Sasha Levin
@ 2025-08-03 21:18 ` Sasha Levin
1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2025-08-03 21:18 UTC (permalink / raw)
To: patches, stable
Cc: Maurizio Lombardi, Laurence Oberman, Hannes Reinecke,
Christoph Hellwig, Sasha Levin, kbusch, sagi, linux-nvme
From: Maurizio Lombardi <mlombard@redhat.com>
[ Upstream commit 5a58ac9bfc412a58c3cf26c6a7e54d4308e9d109 ]
Update the nvme_tcp_start_tls() function to use dev_err() instead of
dev_dbg() when a TLS error is detected. This ensures that handshake
failures are visible by default, aiding in debugging.
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
**Backport Status: YES**
This commit should be backported to stable kernel trees for the
following reasons:
1. **Fixes a Real Debugging Issue**: The commit addresses a genuine
problem where TLS handshake failures were being logged at debug level
(`dev_dbg()`), making them invisible in production environments
unless debug logging is explicitly enabled. This change promotes
error visibility by using `dev_err()` for actual error conditions.
2. **Small and Contained Change**: The patch is minimal - it only
changes logging behavior by:
- Adding a conditional check `if (queue->tls_err)`
- Moving the error case from `dev_dbg()` to `dev_err()`
- Keeping successful handshakes at debug level
The diff shows only about 10 lines changed with no functional
modifications.
3. **No Architectural Changes**: This is purely a logging improvement
that doesn't alter any control flow, data structures, or protocol
behavior. It simply makes existing errors more visible.
4. **Important for Production Debugging**: TLS handshake failures in
NVMe-TCP can be caused by various issues (certificate problems, key
mismatches, network issues) that are critical to diagnose in
production. Having these errors hidden at debug level severely
hampers troubleshooting.
5. **Mature Feature Area**: NVMe-TCP TLS support was introduced in
kernel 6.5 (around August 2023 based on commit be8e82caa685), making
it a relatively mature feature that's likely deployed in production
systems requiring proper error visibility.
6. **No Risk of Regression**: The change only affects logging output and
cannot introduce functional regressions. The worst case is slightly
more verbose kernel logs when TLS errors occur, which is the intended
behavior.
7. **Follows Stable Rules**: This meets the stable kernel criteria as
it's a small fix that improves debuggability of an existing feature
without introducing new functionality or risks.
drivers/nvme/host/tcp.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index b882ee6ef40f..d0ee9e2a3294 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1776,9 +1776,14 @@ static int nvme_tcp_start_tls(struct nvme_ctrl *nctrl,
qid, ret);
tls_handshake_cancel(queue->sock->sk);
} else {
- dev_dbg(nctrl->device,
- "queue %d: TLS handshake complete, error %d\n",
- qid, queue->tls_err);
+ if (queue->tls_err) {
+ dev_err(nctrl->device,
+ "queue %d: TLS handshake complete, error %d\n",
+ qid, queue->tls_err);
+ } else {
+ dev_dbg(nctrl->device,
+ "queue %d: TLS handshake complete\n", qid);
+ }
ret = queue->tls_err;
}
return ret;
--
2.39.5
^ permalink raw reply related [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-08-03 21:23 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20250803211836.3546094-1-sashal@kernel.org>
2025-08-03 21:18 ` [PATCH AUTOSEL 6.15 11/34] nvme-pci: try function level reset on init failure Sasha Levin
2025-08-03 21:18 ` [PATCH AUTOSEL 6.15 14/34] nvme-tcp: log TLS handshake failures at error level Sasha Levin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).