* [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery
@ 2026-02-14 4:25 Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 01/21] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
` (21 more replies)
0 siblings, 22 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
This patchset adds support for TP8028 Rapid Path Failure Recovery for
both nvme target and initiator. Rapid Path Failure Recovery brings
Cross-Controller Reset (CCR) functionality to nvme. This allows nvme
host to send an nvme command to a source nvme controller to reset
the impacted nvme controller, provided that both source and impacted
controllers are in the same nvme subsystem.
The main use of CCR is when one path to the nvme subsystem fails.
Inflight IOs on impacted nvme controller need to be terminated first
before they can be retried on another path. Otherwise data corruption
may happen. CCR provides a quick way to terminate these IOs on the
unreachable nvme controller allowing recovery to move quickly avoiding
unnecessary delays. In case of CCR is not possible, inflight requests
are held for duration defined by TP4129 KATO Corrections and
Clarifications before they are allowed to be retried.
On the target side:
* New struct members have been added to support CCR. struct nvme_id_ctrl
has been updated with CIU (Controller Instance Uniquifier), CIRN
(Controller Instance Random Number), and CQT (Command Quiesce Time).
The combination of CIU, CNTLID, and CIRN is used to identify impacted
controller in CCR command.
* CCR nvme command implemented on the target causes impacted controller
to fail and drop connections to host.
* CCR logpage contains the status of pending CCR requests. An entry is
added to the logpage after CCR request is validated. Completed CCR
requests are removed from the logpage when controller becomes ready or
when requested in get logpage command.
* An AEN is sent when CCR completes to let the host know that it is safe
to retry inflight requests.
On the host side:
* CIU, CIRN, and CQT have been added to struct nvme_ctrl. CIU and CIRN
have been added to sysfs to make the values visible to the user.
CIU and CIRN can be used to construct and manually send admin-passthru
CCR commands.
* New controller states FENCING and FENCED have been added to make sure
that inflight request do not get canceled if they timeout during
fencing process. FENCED exists so that controller state machine does
not have a transition from FENCING to RESETTING. Instead FENCING ->
FENCED -> RESETTING. This prevents a controller being fenced from
getting reset. Only after fencing finishes the impacted controller is
reset.
* Controller recovery in nvme_fence_ctrl() is invoked when LIVE
controller hits an error or when a request times out. CCR is attempted
first to reset impacted controller. If it fails then inflight requests
are held until it is safe to retry them.
* Updated nvme fabric transports nvme-tcp, nvme-rdma, and nvme-fc to
use CCR recovery.
Ideally all inflight requests should be held during controller recovery
and only retried after recovery is done. However, there are known
situations where that is not the case in this implementation. These gaps
will be addressed in future patches:
* Manual controller reset from sysfs will result in controller going to
RESETTING state and all inflight requests to be canceled immediately
and may be retried on another path.
* Manual controller delete from sysfs will also result in all inflight
requests to be canceled immediately and may be retried on another path.
* In nvme-fc, nvme controller will be deleted if remote port disappears
with no timeout specified. This results in immediate cancellation of
requests that may be retried on another path.
* In nvme-rdma if HCA is removed all nvme controllers will be deleted.
This results in canceling inflight IOs and may be they will be retried
on another path.
Changes from v2:
- nvmet: Implement CCR nvme command
- Minor changes addressing review comments on v2.
- nvme: Rapid Path Failure Recovery read controller identify fields
Addressed security concern that CCR can be used to cause denail of
service. Changed the permission of CIU and CIRN sysfs attributes
from S_IRUGO to S_IRUSR. This makes sure only root user can read
these attributes.
- nvme: Introduce FENCING and FENCED controller states
Addressed code review comments. Minor changes.
- nvme: Implement cross-controller reset recovery
- Refactored nvme_find_ctrl_ccr(), more idiomatic code.
- Update nvme_issue_wait_ccr() to return
- 0 on success.
- EIO in case failure submitting CCR command
- ETIMEDOUT timedout waiting for CCR operation.
- EREMOTEIO CCR operation failed.
- Updated nvme_fence_ctrl() such that CCR is operation
is tried on one source controller maximum.
- nvme-tcp: Use CCR to recover controller that hits an error
- Dropped ctrl->fenced_work. Moved to CQT patches.
- nvme_tcp_fencing_work() resets controller regardless of
CCR success or failure.
- nvme-rdma: Use CCR to recover controller that hits an error
- Similar to nvme-tcp
- nvme-fc: Decouple error recovery from controller reset
- nvme_fc_start_ioerr_recovery() queues ctrl->ioerr_work in case of
CONNECTING, DELETING, and DELETING_NOIO without changing controller
state. For CONNECTING it addresses an issue raised during
codereview. For DELECTING{_NOIO} it addresses an issue observed
during testing where a controller is deleted with inflight IOs.
- Updated nvme_fc_ctrl_ioerr_work() to handle CONNECTING state in
special way just aborting outstanding IO. This change addresses an
issue raised during code review.
- nvme_fc_error_recovery() has been updated to flush
ctrl->ctrl.async_event_work as mentioned in code review.
- nvme-fc: Use CCR to recover controller that hits an error
- Changes similar to nvme-rdma and nvme-tcp
- nvme-fc: Do not cancel requests in io taget before it is initialized
- A new patch added to address an issue observed during testing.
- CQT changes have been pulled to separate patches
- nvmet: Add support for CQT to nvme target
- nvme: Add support for CQT to nvme host
- nvme: Update CCR completion wait timeout to consider CQT
- nvme-tcp: Extend FENCING state per TP4129 on CCR failure
- nvme-rdma: Extend FENCING state per TP4129 on CCR failure
- nvme-fc: Extend FENCING state per TP4129 on CCR failure
v2: https://lore.kernel.org/all/20260130223531.2478849-1-mkhalfella@purestorage.com/
Mohamed Khalfella (21):
nvmet: Rapid Path Failure Recovery set controller identify fields
nvmet/debugfs: Export controller CIU and CIRN via debugfs
nvmet: Implement CCR nvme command
nvmet: Implement CCR logpage
nvmet: Send an AEN on CCR completion
nvme: Rapid Path Failure Recovery read controller identify fields
nvme: Introduce FENCING and FENCED controller states
nvme: Implement cross-controller reset recovery
nvme: Implement cross-controller reset completion
nvme-tcp: Use CCR to recover controller that hits an error
nvme-rdma: Use CCR to recover controller that hits an error
nvme-fc: Decouple error recovery from controller reset
nvme-fc: Use CCR to recover controller that hits an error
nvme-fc: Hold inflight requests while in FENCING state
nvme-fc: Do not cancel requests in io taget before it is initialized
nvmet: Add support for CQT to nvme target
nvme: Add support for CQT to nvme host
nvme: Update CCR completion wait timeout to consider CQT
nvme-tcp: Extend FENCING state per TP4129 on CCR failure
nvme-rdma: Extend FENCING state per TP4129 on CCR failure
nvme-fc: Extend FENCING state per TP4129 on CCR failure
drivers/nvme/host/constants.c | 1 +
drivers/nvme/host/core.c | 222 +++++++++++++++++++++++++++-
drivers/nvme/host/fc.c | 249 ++++++++++++++++++++++++--------
drivers/nvme/host/nvme.h | 25 ++++
drivers/nvme/host/rdma.c | 63 +++++++-
drivers/nvme/host/sysfs.c | 27 ++++
drivers/nvme/host/tcp.c | 63 +++++++-
drivers/nvme/target/admin-cmd.c | 124 ++++++++++++++++
drivers/nvme/target/configfs.c | 31 ++++
drivers/nvme/target/core.c | 113 ++++++++++++++-
drivers/nvme/target/debugfs.c | 21 +++
drivers/nvme/target/nvmet.h | 20 ++-
include/linux/nvme.h | 70 ++++++++-
13 files changed, 953 insertions(+), 76 deletions(-)
base-commit: cd7a5651db263b5384aef1950898e5e889425134
--
2.52.0
^ permalink raw reply [flat|nested] 61+ messages in thread
* [PATCH v3 01/21] nvmet: Rapid Path Failure Recovery set controller identify fields
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 02/21] nvmet/debugfs: Export controller CIU and CIRN via debugfs Mohamed Khalfella
` (20 subsequent siblings)
21 siblings, 0 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
TP8028 Rapid Path Failure Recovery defined new fields in controller
identify response. The newly defined fields are:
- CIU (Controller Instance UNIQUIFIER): is an 8bit non-zero value that
is assigned a random value when controller first created. The value is
expected to be incremented when RDY bit in CSTS register is asserted
- CIRN (Controller Instance Random Number): is 64bit random value that
gets generated when controller is crated. CIRN is regenerated everytime
RDY bit is CSTS register is asserted.
- CCRL (Cross-Controller Reset Limit) is an 8bit value that defines the
maximum number of in-progress controller reset operations. CCRL is
hardcoded to 4 as recommended by TP8028.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/target/admin-cmd.c | 5 +++++
drivers/nvme/target/core.c | 9 +++++++++
drivers/nvme/target/nvmet.h | 2 ++
include/linux/nvme.h | 10 ++++++++--
4 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 5e366502fb75..368e36362ac5 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -696,6 +696,11 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
id->cntlid = cpu_to_le16(ctrl->cntlid);
id->ver = cpu_to_le32(ctrl->subsys->ver);
+ if (!nvmet_is_disc_subsys(ctrl->subsys)) {
+ id->ciu = ctrl->ciu;
+ id->cirn = cpu_to_le64(ctrl->cirn);
+ id->ccrl = NVMF_CCR_LIMIT;
+ }
/* XXX: figure out what to do about RTD3R/RTD3 */
id->oaes = cpu_to_le32(NVMET_AEN_CFG_OPTIONAL);
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index eab3e4fc0f74..e5f413405604 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -1394,6 +1394,10 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl)
return;
}
+ if (!nvmet_is_disc_subsys(ctrl->subsys)) {
+ ctrl->ciu = ((u8)(ctrl->ciu + 1)) ? : 1;
+ ctrl->cirn = get_random_u64();
+ }
ctrl->csts = NVME_CSTS_RDY;
/*
@@ -1662,6 +1666,11 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
}
ctrl->cntlid = ret;
+ if (!nvmet_is_disc_subsys(ctrl->subsys)) {
+ ctrl->ciu = get_random_u8() ? : 1;
+ ctrl->cirn = get_random_u64();
+ }
+
/*
* Discovery controllers may use some arbitrary high value
* in order to cleanup stale discovery sessions
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index b664b584fdc8..a36daa5d3a57 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -264,7 +264,9 @@ struct nvmet_ctrl {
uuid_t hostid;
u16 cntlid;
+ u8 ciu;
u32 kato;
+ u64 cirn;
struct nvmet_port *port;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 655d194f8e72..7746b6d30349 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -21,6 +21,8 @@
#define NVMF_TRADDR_SIZE 256
#define NVMF_TSAS_SIZE 256
+#define NVMF_CCR_LIMIT 4
+
#define NVME_DISC_SUBSYS_NAME "nqn.2014-08.org.nvmexpress.discovery"
#define NVME_NSID_ALL 0xffffffff
@@ -328,7 +330,10 @@ struct nvme_id_ctrl {
__le16 crdt1;
__le16 crdt2;
__le16 crdt3;
- __u8 rsvd134[122];
+ __u8 rsvd134[1];
+ __u8 ciu;
+ __le64 cirn;
+ __u8 rsvd144[112];
__le16 oacs;
__u8 acl;
__u8 aerl;
@@ -389,7 +394,8 @@ struct nvme_id_ctrl {
__u8 msdbd;
__u8 rsvd1804[2];
__u8 dctype;
- __u8 rsvd1807[241];
+ __u8 ccrl;
+ __u8 rsvd1808[240];
struct nvme_id_power_state psd[32];
__u8 vs[1024];
};
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 02/21] nvmet/debugfs: Export controller CIU and CIRN via debugfs
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 01/21] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 03/21] nvmet: Implement CCR nvme command Mohamed Khalfella
` (19 subsequent siblings)
21 siblings, 0 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
Export ctrl->ciu and ctrl->cirn as debugfs files under controller
debugfs directory.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/target/debugfs.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/drivers/nvme/target/debugfs.c b/drivers/nvme/target/debugfs.c
index 5dcbd5aa86e1..1300adf6c1fb 100644
--- a/drivers/nvme/target/debugfs.c
+++ b/drivers/nvme/target/debugfs.c
@@ -152,6 +152,23 @@ static int nvmet_ctrl_tls_concat_show(struct seq_file *m, void *p)
}
NVMET_DEBUGFS_ATTR(nvmet_ctrl_tls_concat);
#endif
+static int nvmet_ctrl_instance_ciu_show(struct seq_file *m, void *p)
+{
+ struct nvmet_ctrl *ctrl = m->private;
+
+ seq_printf(m, "%02x\n", ctrl->ciu);
+ return 0;
+}
+NVMET_DEBUGFS_ATTR(nvmet_ctrl_instance_ciu);
+
+static int nvmet_ctrl_instance_cirn_show(struct seq_file *m, void *p)
+{
+ struct nvmet_ctrl *ctrl = m->private;
+
+ seq_printf(m, "%016llx\n", ctrl->cirn);
+ return 0;
+}
+NVMET_DEBUGFS_ATTR(nvmet_ctrl_instance_cirn);
int nvmet_debugfs_ctrl_setup(struct nvmet_ctrl *ctrl)
{
@@ -184,6 +201,10 @@ int nvmet_debugfs_ctrl_setup(struct nvmet_ctrl *ctrl)
debugfs_create_file("tls_key", S_IRUSR, ctrl->debugfs_dir, ctrl,
&nvmet_ctrl_tls_key_fops);
#endif
+ debugfs_create_file("ciu", S_IRUSR, ctrl->debugfs_dir, ctrl,
+ &nvmet_ctrl_instance_ciu_fops);
+ debugfs_create_file("cirn", S_IRUSR, ctrl->debugfs_dir, ctrl,
+ &nvmet_ctrl_instance_cirn_fops);
return 0;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 03/21] nvmet: Implement CCR nvme command
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 01/21] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 02/21] nvmet/debugfs: Export controller CIU and CIRN via debugfs Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-27 16:30 ` Maurizio Lombardi
2026-02-14 4:25 ` [PATCH v3 04/21] nvmet: Implement CCR logpage Mohamed Khalfella
` (18 subsequent siblings)
21 siblings, 1 reply; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
Reset) command is an nvme command issued to source controller by
initiator to reset impacted controller. Implement CCR command for linux
nvme target.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/target/admin-cmd.c | 74 ++++++++++++++++++++++++++++++++
drivers/nvme/target/core.c | 76 +++++++++++++++++++++++++++++++++
drivers/nvme/target/nvmet.h | 13 ++++++
include/linux/nvme.h | 23 ++++++++++
4 files changed, 186 insertions(+)
diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 368e36362ac5..65ed772babb8 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -376,7 +376,9 @@ static void nvmet_get_cmd_effects_admin(struct nvmet_ctrl *ctrl,
log->acs[nvme_admin_get_features] =
log->acs[nvme_admin_async_event] =
log->acs[nvme_admin_keep_alive] =
+ log->acs[nvme_admin_cross_ctrl_reset] =
cpu_to_le32(NVME_CMD_EFFECTS_CSUPP);
+
}
static void nvmet_get_cmd_effects_nvm(struct nvme_effects_log *log)
@@ -1614,6 +1616,75 @@ void nvmet_execute_keep_alive(struct nvmet_req *req)
nvmet_req_complete(req, status);
}
+void nvmet_execute_cross_ctrl_reset(struct nvmet_req *req)
+{
+ struct nvmet_ctrl *ictrl, *sctrl = req->sq->ctrl;
+ struct nvme_command *cmd = req->cmd;
+ struct nvmet_ccr *ccr, *new_ccr;
+ int ccr_active, ccr_total;
+ u16 cntlid, status = NVME_SC_SUCCESS;
+
+ cntlid = le16_to_cpu(cmd->ccr.icid);
+ if (sctrl->cntlid == cntlid) {
+ req->error_loc =
+ offsetof(struct nvme_cross_ctrl_reset_cmd, icid);
+ status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR;
+ goto out;
+ }
+
+ /* Find and get impacted controller */
+ ictrl = nvmet_ctrl_find_get_ccr(sctrl->subsys, sctrl->hostnqn,
+ cmd->ccr.ciu, cntlid,
+ le64_to_cpu(cmd->ccr.cirn));
+ if (!ictrl) {
+ /* Immediate Reset Successful */
+ nvmet_set_result(req, 1);
+ status = NVME_SC_SUCCESS;
+ goto out;
+ }
+
+ ccr_total = ccr_active = 0;
+ mutex_lock(&sctrl->lock);
+ list_for_each_entry(ccr, &sctrl->ccr_list, entry) {
+ if (ccr->ctrl == ictrl) {
+ status = NVME_SC_CCR_IN_PROGRESS | NVME_STATUS_DNR;
+ goto out_unlock;
+ }
+
+ ccr_total++;
+ if (ccr->ctrl)
+ ccr_active++;
+ }
+
+ if (ccr_active >= NVMF_CCR_LIMIT) {
+ status = NVME_SC_CCR_LIMIT_EXCEEDED;
+ goto out_unlock;
+ }
+ if (ccr_total >= NVMF_CCR_PER_PAGE) {
+ status = NVME_SC_CCR_LOGPAGE_FULL;
+ goto out_unlock;
+ }
+
+ new_ccr = kmalloc(sizeof(*new_ccr), GFP_KERNEL);
+ if (!new_ccr) {
+ status = NVME_SC_INTERNAL;
+ goto out_unlock;
+ }
+
+ new_ccr->ciu = cmd->ccr.ciu;
+ new_ccr->icid = cntlid;
+ new_ccr->ctrl = ictrl;
+ list_add_tail(&new_ccr->entry, &sctrl->ccr_list);
+
+out_unlock:
+ mutex_unlock(&sctrl->lock);
+ if (status == NVME_SC_SUCCESS)
+ nvmet_ctrl_fatal_error(ictrl);
+ nvmet_ctrl_put(ictrl);
+out:
+ nvmet_req_complete(req, status);
+}
+
u32 nvmet_admin_cmd_data_len(struct nvmet_req *req)
{
struct nvme_command *cmd = req->cmd;
@@ -1691,6 +1762,9 @@ u16 nvmet_parse_admin_cmd(struct nvmet_req *req)
case nvme_admin_keep_alive:
req->execute = nvmet_execute_keep_alive;
return 0;
+ case nvme_admin_cross_ctrl_reset:
+ req->execute = nvmet_execute_cross_ctrl_reset;
+ return 0;
default:
return nvmet_report_invalid_opcode(req);
}
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index e5f413405604..38f71e1a1b8e 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -115,6 +115,20 @@ u16 nvmet_zero_sgl(struct nvmet_req *req, off_t off, size_t len)
return 0;
}
+void nvmet_ctrl_cleanup_ccrs(struct nvmet_ctrl *ctrl, bool all)
+{
+ struct nvmet_ccr *ccr, *tmp;
+
+ lockdep_assert_held(&ctrl->lock);
+
+ list_for_each_entry_safe(ccr, tmp, &ctrl->ccr_list, entry) {
+ if (all || ccr->ctrl == NULL) {
+ list_del(&ccr->entry);
+ kfree(ccr);
+ }
+ }
+}
+
static u32 nvmet_max_nsid(struct nvmet_subsys *subsys)
{
struct nvmet_ns *cur;
@@ -1397,6 +1411,7 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl)
if (!nvmet_is_disc_subsys(ctrl->subsys)) {
ctrl->ciu = ((u8)(ctrl->ciu + 1)) ? : 1;
ctrl->cirn = get_random_u64();
+ nvmet_ctrl_cleanup_ccrs(ctrl, false);
}
ctrl->csts = NVME_CSTS_RDY;
@@ -1502,6 +1517,35 @@ struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
return ctrl;
}
+struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
+ const char *hostnqn, u8 ciu,
+ u16 cntlid, u64 cirn)
+{
+ struct nvmet_ctrl *ctrl, *ictrl = NULL;
+ bool found = false;
+
+ mutex_lock(&subsys->lock);
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ if (ctrl->cntlid != cntlid)
+ continue;
+
+ /* Avoid racing with a controller that is becoming ready */
+ mutex_lock(&ctrl->lock);
+ if (ctrl->ciu == ciu && ctrl->cirn == cirn)
+ found = true;
+ mutex_unlock(&ctrl->lock);
+
+ if (found) {
+ if (kref_get_unless_zero(&ctrl->ref))
+ ictrl = ctrl;
+ break;
+ }
+ };
+ mutex_unlock(&subsys->lock);
+
+ return ictrl;
+}
+
u16 nvmet_check_ctrl_status(struct nvmet_req *req)
{
if (unlikely(!(req->sq->ctrl->cc & NVME_CC_ENABLE))) {
@@ -1627,6 +1671,7 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
subsys->clear_ids = 1;
#endif
+ INIT_LIST_HEAD(&ctrl->ccr_list);
INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
INIT_LIST_HEAD(&ctrl->async_events);
INIT_RADIX_TREE(&ctrl->p2p_ns_map, GFP_KERNEL);
@@ -1740,12 +1785,43 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
}
EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
+static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
+{
+ struct nvmet_subsys *subsys = ctrl->subsys;
+ struct nvmet_ctrl *sctrl;
+ struct nvmet_ccr *ccr;
+
+ lockdep_assert_held(&subsys->lock);
+
+ /* Cleanup all CCRs issued by ctrl as source controller */
+ mutex_lock(&ctrl->lock);
+ nvmet_ctrl_cleanup_ccrs(ctrl, true);
+ mutex_unlock(&ctrl->lock);
+
+ /*
+ * Find all CCRs targeting ctrl as impacted controller and
+ * set ccr->ctrl to NULL. This tells the source controller
+ * that CCR completed successfully.
+ */
+ list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
+ mutex_lock(&sctrl->lock);
+ list_for_each_entry(ccr, &sctrl->ccr_list, entry) {
+ if (ccr->ctrl == ctrl) {
+ ccr->ctrl = NULL;
+ break;
+ }
+ }
+ mutex_unlock(&sctrl->lock);
+ }
+}
+
static void nvmet_ctrl_free(struct kref *ref)
{
struct nvmet_ctrl *ctrl = container_of(ref, struct nvmet_ctrl, ref);
struct nvmet_subsys *subsys = ctrl->subsys;
mutex_lock(&subsys->lock);
+ nvmet_ctrl_complete_pending_ccr(ctrl);
nvmet_ctrl_destroy_pr(ctrl);
nvmet_release_p2p_ns_map(ctrl);
list_del(&ctrl->subsys_entry);
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index a36daa5d3a57..b06d905c08c8 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -268,6 +268,7 @@ struct nvmet_ctrl {
u32 kato;
u64 cirn;
+ struct list_head ccr_list;
struct nvmet_port *port;
u32 aen_enabled;
@@ -314,6 +315,13 @@ struct nvmet_ctrl {
struct nvmet_pr_log_mgr pr_log_mgr;
};
+struct nvmet_ccr {
+ struct nvmet_ctrl *ctrl;
+ struct list_head entry;
+ u16 icid;
+ u8 ciu;
+};
+
struct nvmet_subsys {
enum nvme_subsys_type type;
@@ -576,6 +584,7 @@ void nvmet_req_free_sgls(struct nvmet_req *req);
void nvmet_execute_set_features(struct nvmet_req *req);
void nvmet_execute_get_features(struct nvmet_req *req);
void nvmet_execute_keep_alive(struct nvmet_req *req);
+void nvmet_execute_cross_ctrl_reset(struct nvmet_req *req);
u16 nvmet_check_cqid(struct nvmet_ctrl *ctrl, u16 cqid, bool create);
u16 nvmet_check_io_cqid(struct nvmet_ctrl *ctrl, u16 cqid, bool create);
@@ -618,6 +627,10 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args);
struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
const char *hostnqn, u16 cntlid,
struct nvmet_req *req);
+struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
+ const char *hostnqn, u8 ciu,
+ u16 cntlid, u64 cirn);
+void nvmet_ctrl_cleanup_ccrs(struct nvmet_ctrl *ctrl, bool all);
void nvmet_ctrl_put(struct nvmet_ctrl *ctrl);
u16 nvmet_check_ctrl_status(struct nvmet_req *req);
ssize_t nvmet_ctrl_host_traddr(struct nvmet_ctrl *ctrl,
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 7746b6d30349..d9b421dc1ef3 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -22,6 +22,7 @@
#define NVMF_TSAS_SIZE 256
#define NVMF_CCR_LIMIT 4
+#define NVMF_CCR_PER_PAGE 511
#define NVME_DISC_SUBSYS_NAME "nqn.2014-08.org.nvmexpress.discovery"
@@ -1222,6 +1223,22 @@ struct nvme_zone_mgmt_recv_cmd {
__le32 cdw14[2];
};
+struct nvme_cross_ctrl_reset_cmd {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __le64 rsvd2[2];
+ union nvme_data_ptr dptr;
+ __u8 rsvd10;
+ __u8 ciu;
+ __le16 icid;
+ __le32 cdw11;
+ __le64 cirn;
+ __le32 cdw14;
+ __le32 cdw15;
+};
+
struct nvme_io_mgmt_recv_cmd {
__u8 opcode;
__u8 flags;
@@ -1320,6 +1337,7 @@ enum nvme_admin_opcode {
nvme_admin_virtual_mgmt = 0x1c,
nvme_admin_nvme_mi_send = 0x1d,
nvme_admin_nvme_mi_recv = 0x1e,
+ nvme_admin_cross_ctrl_reset = 0x38,
nvme_admin_dbbuf = 0x7C,
nvme_admin_format_nvm = 0x80,
nvme_admin_security_send = 0x81,
@@ -1353,6 +1371,7 @@ enum nvme_admin_opcode {
nvme_admin_opcode_name(nvme_admin_virtual_mgmt), \
nvme_admin_opcode_name(nvme_admin_nvme_mi_send), \
nvme_admin_opcode_name(nvme_admin_nvme_mi_recv), \
+ nvme_admin_opcode_name(nvme_admin_cross_ctrl_reset), \
nvme_admin_opcode_name(nvme_admin_dbbuf), \
nvme_admin_opcode_name(nvme_admin_format_nvm), \
nvme_admin_opcode_name(nvme_admin_security_send), \
@@ -2006,6 +2025,7 @@ struct nvme_command {
struct nvme_dbbuf dbbuf;
struct nvme_directive_cmd directive;
struct nvme_io_mgmt_recv_cmd imr;
+ struct nvme_cross_ctrl_reset_cmd ccr;
};
};
@@ -2170,6 +2190,9 @@ enum {
NVME_SC_PMR_SAN_PROHIBITED = 0x123,
NVME_SC_ANA_GROUP_ID_INVALID = 0x124,
NVME_SC_ANA_ATTACH_FAILED = 0x125,
+ NVME_SC_CCR_IN_PROGRESS = 0x13f,
+ NVME_SC_CCR_LOGPAGE_FULL = 0x140,
+ NVME_SC_CCR_LIMIT_EXCEEDED = 0x141,
/*
* I/O Command Set Specific - NVM commands:
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 04/21] nvmet: Implement CCR logpage
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (2 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 03/21] nvmet: Implement CCR nvme command Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 05/21] nvmet: Send an AEN on CCR completion Mohamed Khalfella
` (17 subsequent siblings)
21 siblings, 0 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
Reset) log page contains an entry for each CCR request submitted to
source controller. Implement CCR logpage for nvme linux target.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/target/admin-cmd.c | 44 +++++++++++++++++++++++++++++++++
include/linux/nvme.h | 29 ++++++++++++++++++++++
2 files changed, 73 insertions(+)
diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 65ed772babb8..925a81979278 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -220,6 +220,7 @@ static void nvmet_execute_get_supported_log_pages(struct nvmet_req *req)
logs->lids[NVME_LOG_FEATURES] = cpu_to_le32(NVME_LIDS_LSUPP);
logs->lids[NVME_LOG_RMI] = cpu_to_le32(NVME_LIDS_LSUPP);
logs->lids[NVME_LOG_RESERVATION] = cpu_to_le32(NVME_LIDS_LSUPP);
+ logs->lids[NVME_LOG_CCR] = cpu_to_le32(NVME_LIDS_LSUPP);
status = nvmet_copy_to_sgl(req, 0, logs, sizeof(*logs));
kfree(logs);
@@ -608,6 +609,47 @@ static void nvmet_execute_get_log_page_features(struct nvmet_req *req)
nvmet_req_complete(req, status);
}
+static void nvmet_execute_get_log_page_ccr(struct nvmet_req *req)
+{
+ struct nvmet_ctrl *ctrl = req->sq->ctrl;
+ struct nvmet_ccr *ccr;
+ struct nvme_ccr_log *log;
+ int index = 0;
+ u16 status;
+
+ log = kzalloc(sizeof(*log), GFP_KERNEL);
+ if (!log) {
+ status = NVME_SC_INTERNAL;
+ goto out;
+ }
+
+ mutex_lock(&ctrl->lock);
+ list_for_each_entry(ccr, &ctrl->ccr_list, entry) {
+ u8 flags = NVME_CCR_FLAGS_VALIDATED | NVME_CCR_FLAGS_INITIATED;
+ u8 status = ccr->ctrl ? NVME_CCR_STATUS_IN_PROGRESS :
+ NVME_CCR_STATUS_SUCCESS;
+
+ log->entries[index].icid = cpu_to_le16(ccr->icid);
+ log->entries[index].ciu = ccr->ciu;
+ log->entries[index].acid = cpu_to_le16(0xffff);
+ log->entries[index].ccrs = status;
+ log->entries[index].ccrf = flags;
+ index++;
+ }
+
+ /* Cleanup completed CCRs if requested */
+ if (req->cmd->get_log_page.lsp & 0x1)
+ nvmet_ctrl_cleanup_ccrs(ctrl, false);
+ mutex_unlock(&ctrl->lock);
+
+ log->ne = cpu_to_le16(index);
+ nvmet_clear_aen_bit(req, NVME_AEN_BIT_CCR_COMPLETE);
+ status = nvmet_copy_to_sgl(req, 0, log, sizeof(*log));
+ kfree(log);
+out:
+ nvmet_req_complete(req, status);
+}
+
static void nvmet_execute_get_log_page(struct nvmet_req *req)
{
if (!nvmet_check_transfer_len(req, nvmet_get_log_page_len(req->cmd)))
@@ -641,6 +683,8 @@ static void nvmet_execute_get_log_page(struct nvmet_req *req)
return nvmet_execute_get_log_page_rmi(req);
case NVME_LOG_RESERVATION:
return nvmet_execute_get_log_page_resv(req);
+ case NVME_LOG_CCR:
+ return nvmet_execute_get_log_page_ccr(req);
}
pr_debug("unhandled lid %d on qid %d\n",
req->cmd->get_log_page.lid, req->sq->qid);
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index d9b421dc1ef3..9b6d93270c59 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -1432,6 +1432,7 @@ enum {
NVME_LOG_FDP_CONFIGS = 0x20,
NVME_LOG_DISC = 0x70,
NVME_LOG_RESERVATION = 0x80,
+ NVME_LOG_CCR = 0x1E,
NVME_FWACT_REPL = (0 << 3),
NVME_FWACT_REPL_ACTV = (1 << 3),
NVME_FWACT_ACTV = (2 << 3),
@@ -1455,6 +1456,34 @@ enum {
NVME_FIS_CSCPE = 1 << 21,
};
+/* NVMe Cross-Controller Reset Status */
+enum {
+ NVME_CCR_STATUS_IN_PROGRESS,
+ NVME_CCR_STATUS_SUCCESS,
+ NVME_CCR_STATUS_FAILED,
+};
+
+/* NVMe Cross-Controller Reset Flags */
+enum {
+ NVME_CCR_FLAGS_VALIDATED = 0x01,
+ NVME_CCR_FLAGS_INITIATED = 0x02,
+};
+
+struct nvme_ccr_log_entry {
+ __le16 icid;
+ __u8 ciu;
+ __u8 rsvd3;
+ __le16 acid;
+ __u8 ccrs;
+ __u8 ccrf;
+};
+
+struct nvme_ccr_log {
+ __le16 ne;
+ __u8 rsvd2[6];
+ struct nvme_ccr_log_entry entries[NVMF_CCR_PER_PAGE];
+};
+
/* NVMe Namespace Write Protect State */
enum {
NVME_NS_NO_WRITE_PROTECT = 0,
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 05/21] nvmet: Send an AEN on CCR completion
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (3 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 04/21] nvmet: Implement CCR logpage Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 06/21] nvme: Rapid Path Failure Recovery read controller identify fields Mohamed Khalfella
` (16 subsequent siblings)
21 siblings, 0 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
Send an AEN to initiator when impacted controller exists. The
notification points to CCR log page that initiator can read to check
which CCR operation completed.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
---
drivers/nvme/target/core.c | 25 ++++++++++++++++++++++---
drivers/nvme/target/nvmet.h | 3 ++-
include/linux/nvme.h | 3 +++
3 files changed, 27 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 38f71e1a1b8e..a9f8a2242703 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -203,7 +203,7 @@ static void nvmet_async_event_work(struct work_struct *work)
nvmet_async_events_process(ctrl);
}
-void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
+static void nvmet_add_async_event_locked(struct nvmet_ctrl *ctrl, u8 event_type,
u8 event_info, u8 log_page)
{
struct nvmet_async_event *aen;
@@ -216,13 +216,19 @@ void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
aen->event_info = event_info;
aen->log_page = log_page;
- mutex_lock(&ctrl->lock);
list_add_tail(&aen->entry, &ctrl->async_events);
- mutex_unlock(&ctrl->lock);
queue_work(nvmet_wq, &ctrl->async_event_work);
}
+void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
+ u8 event_info, u8 log_page)
+{
+ mutex_lock(&ctrl->lock);
+ nvmet_add_async_event_locked(ctrl, event_type, event_info, log_page);
+ mutex_unlock(&ctrl->lock);
+}
+
static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
{
u32 i;
@@ -1785,6 +1791,18 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
}
EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
+static void nvmet_ctrl_notify_ccr(struct nvmet_ctrl *ctrl)
+{
+ lockdep_assert_held(&ctrl->lock);
+
+ if (nvmet_aen_bit_disabled(ctrl, NVME_AEN_BIT_CCR_COMPLETE))
+ return;
+
+ nvmet_add_async_event_locked(ctrl, NVME_AER_NOTICE,
+ NVME_AER_NOTICE_CCR_COMPLETED,
+ NVME_LOG_CCR);
+}
+
static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
{
struct nvmet_subsys *subsys = ctrl->subsys;
@@ -1808,6 +1826,7 @@ static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
list_for_each_entry(ccr, &sctrl->ccr_list, entry) {
if (ccr->ctrl == ctrl) {
ccr->ctrl = NULL;
+ nvmet_ctrl_notify_ccr(sctrl);
break;
}
}
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index b06d905c08c8..0ed41a3d0562 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -44,7 +44,8 @@
* Supported optional AENs:
*/
#define NVMET_AEN_CFG_OPTIONAL \
- (NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_ANA_CHANGE)
+ (NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_ANA_CHANGE | \
+ NVME_AEN_CFG_CCR_COMPLETE)
#define NVMET_DISC_AEN_CFG_OPTIONAL \
(NVME_AEN_CFG_DISC_CHANGE)
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 9b6d93270c59..fc33ae48d149 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -860,12 +860,14 @@ enum {
NVME_AER_NOTICE_FW_ACT_STARTING = 0x01,
NVME_AER_NOTICE_ANA = 0x03,
NVME_AER_NOTICE_DISC_CHANGED = 0xf0,
+ NVME_AER_NOTICE_CCR_COMPLETED = 0xf4,
};
enum {
NVME_AEN_BIT_NS_ATTR = 8,
NVME_AEN_BIT_FW_ACT = 9,
NVME_AEN_BIT_ANA_CHANGE = 11,
+ NVME_AEN_BIT_CCR_COMPLETE = 20,
NVME_AEN_BIT_DISC_CHANGE = 31,
};
@@ -873,6 +875,7 @@ enum {
NVME_AEN_CFG_NS_ATTR = 1 << NVME_AEN_BIT_NS_ATTR,
NVME_AEN_CFG_FW_ACT = 1 << NVME_AEN_BIT_FW_ACT,
NVME_AEN_CFG_ANA_CHANGE = 1 << NVME_AEN_BIT_ANA_CHANGE,
+ NVME_AEN_CFG_CCR_COMPLETE = 1 << NVME_AEN_BIT_CCR_COMPLETE,
NVME_AEN_CFG_DISC_CHANGE = 1 << NVME_AEN_BIT_DISC_CHANGE,
};
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 06/21] nvme: Rapid Path Failure Recovery read controller identify fields
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (4 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 05/21] nvmet: Send an AEN on CCR completion Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 07/21] nvme: Introduce FENCING and FENCED controller states Mohamed Khalfella
` (15 subsequent siblings)
21 siblings, 0 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
TP8028 Rapid path failure added new fileds to controller identify
response. Read CIU (Controller Instance Uniquifier), CIRN (Controller
Instance Random Number), and CCRL (Cross-Controller Reset Limit) from
controller identify response. Expose CIU and CIRN as sysfs attributes
so the values can be used directrly by user if needed.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
drivers/nvme/host/core.c | 4 ++++
drivers/nvme/host/nvme.h | 10 ++++++++++
drivers/nvme/host/sysfs.c | 23 +++++++++++++++++++++++
3 files changed, 37 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 19b67cf5d550..8d26e27992fc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3572,6 +3572,10 @@ static int nvme_init_identify(struct nvme_ctrl *ctrl)
ctrl->crdt[1] = le16_to_cpu(id->crdt2);
ctrl->crdt[2] = le16_to_cpu(id->crdt3);
+ ctrl->ciu = id->ciu;
+ ctrl->cirn = le64_to_cpu(id->cirn);
+ atomic_set(&ctrl->ccr_limit, id->ccrl);
+
ctrl->oacs = le16_to_cpu(id->oacs);
ctrl->oncs = le16_to_cpu(id->oncs);
ctrl->mtfa = le16_to_cpu(id->mtfa);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9a5f28c5103c..6984950b9aa8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -328,11 +328,14 @@ struct nvme_ctrl {
u16 crdt[3];
u16 oncs;
u8 dmrl;
+ u8 ciu;
u32 dmrsl;
+ u64 cirn;
u16 oacs;
u16 sqsize;
u32 max_namespaces;
atomic_t abort_limit;
+ atomic_t ccr_limit;
u8 vwc;
u32 vs;
u32 sgls;
@@ -1225,4 +1228,11 @@ static inline bool nvme_multi_css(struct nvme_ctrl *ctrl)
return (ctrl->ctrl_config & NVME_CC_CSS_MASK) == NVME_CC_CSS_CSI;
}
+static inline unsigned long nvme_fence_timeout_ms(struct nvme_ctrl *ctrl)
+{
+ if (ctrl->ctratt & NVME_CTRL_ATTR_TBKAS)
+ return 3 * ctrl->kato * 1000;
+ return 2 * ctrl->kato * 1000;
+}
+
#endif /* _NVME_H */
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 29430949ce2f..cd835dd2377f 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -388,6 +388,27 @@ nvme_show_int_function(queue_count);
nvme_show_int_function(sqsize);
nvme_show_int_function(kato);
+static ssize_t nvme_sysfs_ciu_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+ return sysfs_emit(buf, "%02x\n", ctrl->ciu);
+}
+static DEVICE_ATTR(ciu, S_IRUSR, nvme_sysfs_ciu_show, NULL);
+
+static ssize_t nvme_sysfs_cirn_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+ return sysfs_emit(buf, "%016llx\n", ctrl->cirn);
+}
+static DEVICE_ATTR(cirn, S_IRUSR, nvme_sysfs_cirn_show, NULL);
+
+
static ssize_t nvme_sysfs_delete(struct device *dev,
struct device_attribute *attr, const char *buf,
size_t count)
@@ -734,6 +755,8 @@ static struct attribute *nvme_dev_attrs[] = {
&dev_attr_numa_node.attr,
&dev_attr_queue_count.attr,
&dev_attr_sqsize.attr,
+ &dev_attr_ciu.attr,
+ &dev_attr_cirn.attr,
&dev_attr_hostnqn.attr,
&dev_attr_hostid.attr,
&dev_attr_ctrl_loss_tmo.attr,
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 07/21] nvme: Introduce FENCING and FENCED controller states
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (5 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 06/21] nvme: Rapid Path Failure Recovery read controller identify fields Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-16 12:33 ` Hannes Reinecke
2026-02-14 4:25 ` [PATCH v3 08/21] nvme: Implement cross-controller reset recovery Mohamed Khalfella
` (14 subsequent siblings)
21 siblings, 1 reply; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
FENCING is a new controller state that a LIVE controller enter when an
error is encountered. While in FENCING state,inflight IOs that timeout
are not canceled because they should be held until either CCR succeeds
or time-based recovery completes. While the queues remain alive requests
are not allowed to be sent in this state and the controller cannot be
reset or deleted. This is intentional because resetting or deleting the
controller results in canceling inflight IOs.
FENCED is a short-term state the controller enters before it is reset.
It exists only to prevent manual resets from happening while controller
is in FENCING state.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/core.c | 27 +++++++++++++++++++++++++--
drivers/nvme/host/nvme.h | 4 ++++
drivers/nvme/host/sysfs.c | 2 ++
3 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8d26e27992fc..231d402e9bfb 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -574,10 +574,29 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
break;
}
break;
+ case NVME_CTRL_FENCING:
+ switch (old_state) {
+ case NVME_CTRL_LIVE:
+ changed = true;
+ fallthrough;
+ default:
+ break;
+ }
+ break;
+ case NVME_CTRL_FENCED:
+ switch (old_state) {
+ case NVME_CTRL_FENCING:
+ changed = true;
+ fallthrough;
+ default:
+ break;
+ }
+ break;
case NVME_CTRL_RESETTING:
switch (old_state) {
case NVME_CTRL_NEW:
case NVME_CTRL_LIVE:
+ case NVME_CTRL_FENCED:
changed = true;
fallthrough;
default:
@@ -760,6 +779,8 @@ blk_status_t nvme_fail_nonready_command(struct nvme_ctrl *ctrl,
if (state != NVME_CTRL_DELETING_NOIO &&
state != NVME_CTRL_DELETING &&
+ state != NVME_CTRL_FENCING &&
+ state != NVME_CTRL_FENCED &&
state != NVME_CTRL_DEAD &&
!test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
!blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
@@ -802,10 +823,12 @@ bool __nvme_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
req->cmd->fabrics.fctype == nvme_fabrics_type_auth_receive))
return true;
break;
- default:
- break;
+ case NVME_CTRL_FENCING:
+ case NVME_CTRL_FENCED:
case NVME_CTRL_DEAD:
return false;
+ default:
+ break;
}
}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 6984950b9aa8..b1c37eb3379e 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -251,6 +251,8 @@ static inline u16 nvme_req_qid(struct request *req)
enum nvme_ctrl_state {
NVME_CTRL_NEW,
NVME_CTRL_LIVE,
+ NVME_CTRL_FENCING,
+ NVME_CTRL_FENCED,
NVME_CTRL_RESETTING,
NVME_CTRL_CONNECTING,
NVME_CTRL_DELETING,
@@ -776,6 +778,8 @@ static inline bool nvme_state_terminal(struct nvme_ctrl *ctrl)
switch (nvme_ctrl_state(ctrl)) {
case NVME_CTRL_NEW:
case NVME_CTRL_LIVE:
+ case NVME_CTRL_FENCING:
+ case NVME_CTRL_FENCED:
case NVME_CTRL_RESETTING:
case NVME_CTRL_CONNECTING:
return false;
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index cd835dd2377f..1e4261144933 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -443,6 +443,8 @@ static ssize_t nvme_sysfs_show_state(struct device *dev,
static const char *const state_name[] = {
[NVME_CTRL_NEW] = "new",
[NVME_CTRL_LIVE] = "live",
+ [NVME_CTRL_FENCING] = "fencing",
+ [NVME_CTRL_FENCED] = "fenced",
[NVME_CTRL_RESETTING] = "resetting",
[NVME_CTRL_CONNECTING] = "connecting",
[NVME_CTRL_DELETING] = "deleting",
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 08/21] nvme: Implement cross-controller reset recovery
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (6 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 07/21] nvme: Introduce FENCING and FENCED controller states Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-16 12:41 ` Hannes Reinecke
2026-02-26 2:37 ` Randy Jennings
2026-02-14 4:25 ` [PATCH v3 09/21] nvme: Implement cross-controller reset completion Mohamed Khalfella
` (13 subsequent siblings)
21 siblings, 2 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
A host that has more than one path connecting to an nvme subsystem
typically has an nvme controller associated with every path. This is
mostly applicable to nvmeof. If one path goes down, inflight IOs on that
path should not be retried immediately on another path because this
could lead to data corruption as described in TP4129. TP8028 defines
cross-controller reset mechanism that can be used by host to terminate
IOs on the failed path using one of the remaining healthy paths. Only
after IOs are terminated, or long enough time passes as defined by
TP4129, inflight IOs should be retried on another path. Implement core
cross-controller reset shared logic to be used by the transports.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/constants.c | 1 +
drivers/nvme/host/core.c | 141 ++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 9 +++
3 files changed, 151 insertions(+)
diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index dc90df9e13a2..f679efd5110e 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -46,6 +46,7 @@ static const char * const nvme_admin_ops[] = {
[nvme_admin_virtual_mgmt] = "Virtual Management",
[nvme_admin_nvme_mi_send] = "NVMe Send MI",
[nvme_admin_nvme_mi_recv] = "NVMe Receive MI",
+ [nvme_admin_cross_ctrl_reset] = "Cross Controller Reset",
[nvme_admin_dbbuf] = "Doorbell Buffer Config",
[nvme_admin_format_nvm] = "Format NVM",
[nvme_admin_security_send] = "Security Send",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 231d402e9bfb..765b1524b3ed 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -554,6 +554,146 @@ void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl)
}
EXPORT_SYMBOL_GPL(nvme_cancel_admin_tagset);
+static struct nvme_ctrl *nvme_find_ctrl_ccr(struct nvme_ctrl *ictrl,
+ u32 min_cntlid)
+{
+ struct nvme_subsystem *subsys = ictrl->subsys;
+ struct nvme_ctrl *ctrl, *sctrl = NULL;
+ unsigned long flags;
+
+ mutex_lock(&nvme_subsystems_lock);
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ if (ctrl->cntlid < min_cntlid)
+ continue;
+
+ if (atomic_dec_if_positive(&ctrl->ccr_limit) < 0)
+ continue;
+
+ spin_lock_irqsave(&ctrl->lock, flags);
+ if (ctrl->state != NVME_CTRL_LIVE) {
+ spin_unlock_irqrestore(&ctrl->lock, flags);
+ atomic_inc(&ctrl->ccr_limit);
+ continue;
+ }
+
+ /*
+ * We got a good candidate source controller that is locked and
+ * LIVE. However, no guarantee ctrl will not be deleted after
+ * ctrl->lock is released. Get a ref of both ctrl and admin_q
+ * so they do not disappear until we are done with them.
+ */
+ WARN_ON_ONCE(!blk_get_queue(ctrl->admin_q));
+ nvme_get_ctrl(ctrl);
+ spin_unlock_irqrestore(&ctrl->lock, flags);
+ sctrl = ctrl;
+ break;
+ }
+ mutex_unlock(&nvme_subsystems_lock);
+ return sctrl;
+}
+
+static void nvme_put_ctrl_ccr(struct nvme_ctrl *sctrl)
+{
+ atomic_inc(&sctrl->ccr_limit);
+ blk_put_queue(sctrl->admin_q);
+ nvme_put_ctrl(sctrl);
+}
+
+static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
+{
+ struct nvme_ccr_entry ccr = { };
+ union nvme_result res = { 0 };
+ struct nvme_command c = { };
+ unsigned long flags, tmo;
+ bool completed = false;
+ int ret = 0;
+ u32 result;
+
+ init_completion(&ccr.complete);
+ ccr.ictrl = ictrl;
+
+ spin_lock_irqsave(&sctrl->lock, flags);
+ list_add_tail(&ccr.list, &sctrl->ccr_list);
+ spin_unlock_irqrestore(&sctrl->lock, flags);
+
+ c.ccr.opcode = nvme_admin_cross_ctrl_reset;
+ c.ccr.ciu = ictrl->ciu;
+ c.ccr.icid = cpu_to_le16(ictrl->cntlid);
+ c.ccr.cirn = cpu_to_le64(ictrl->cirn);
+ ret = __nvme_submit_sync_cmd(sctrl->admin_q, &c, &res,
+ NULL, 0, NVME_QID_ANY, 0);
+ if (ret) {
+ ret = -EIO;
+ goto out;
+ }
+
+ result = le32_to_cpu(res.u32);
+ if (result & 0x01) /* Immediate Reset Successful */
+ goto out;
+
+ tmo = secs_to_jiffies(ictrl->kato);
+ if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
+ ret = -ETIMEDOUT;
+ goto out;
+ }
+
+ completed = true;
+
+out:
+ spin_lock_irqsave(&sctrl->lock, flags);
+ list_del(&ccr.list);
+ spin_unlock_irqrestore(&sctrl->lock, flags);
+ if (completed) {
+ if (ccr.ccrs == NVME_CCR_STATUS_SUCCESS)
+ return 0;
+ return -EREMOTEIO;
+ }
+ return ret;
+}
+
+unsigned long nvme_fence_ctrl(struct nvme_ctrl *ictrl)
+{
+ unsigned long deadline, now, timeout;
+ struct nvme_ctrl *sctrl;
+ u32 min_cntlid = 0;
+ int ret;
+
+ timeout = nvme_fence_timeout_ms(ictrl);
+ dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
+
+ now = jiffies;
+ deadline = now + msecs_to_jiffies(timeout);
+ while (time_before(now, deadline)) {
+ sctrl = nvme_find_ctrl_ccr(ictrl, min_cntlid);
+ if (!sctrl) {
+ /* CCR failed, switch to time-based recovery */
+ return deadline - now;
+ }
+
+ ret = nvme_issue_wait_ccr(sctrl, ictrl);
+ if (!ret) {
+ dev_info(ictrl->device, "CCR succeeded using %s\n",
+ dev_name(sctrl->device));
+ nvme_put_ctrl_ccr(sctrl);
+ return 0;
+ }
+
+ min_cntlid = sctrl->cntlid + 1;
+ nvme_put_ctrl_ccr(sctrl);
+ now = jiffies;
+
+ if (ret == -EIO) /* CCR command failed */
+ continue;
+
+ /* CCR operation failed or timed out */
+ return time_before(now, deadline) ? deadline - now : 0;
+ }
+
+ dev_info(ictrl->device, "CCR reached timeout, call it done\n");
+ return 0;
+}
+EXPORT_SYMBOL_GPL(nvme_fence_ctrl);
+
bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
enum nvme_ctrl_state new_state)
{
@@ -5121,6 +5261,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
mutex_init(&ctrl->scan_lock);
INIT_LIST_HEAD(&ctrl->namespaces);
+ INIT_LIST_HEAD(&ctrl->ccr_list);
xa_init(&ctrl->cels);
ctrl->dev = dev;
ctrl->ops = ops;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index b1c37eb3379e..f3ab9411cac5 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -279,6 +279,13 @@ enum nvme_ctrl_flags {
NVME_CTRL_FROZEN = 6,
};
+struct nvme_ccr_entry {
+ struct list_head list;
+ struct completion complete;
+ struct nvme_ctrl *ictrl;
+ u8 ccrs;
+};
+
struct nvme_ctrl {
bool comp_seen;
bool identified;
@@ -296,6 +303,7 @@ struct nvme_ctrl {
struct blk_mq_tag_set *tagset;
struct blk_mq_tag_set *admin_tagset;
struct list_head namespaces;
+ struct list_head ccr_list;
struct mutex namespaces_lock;
struct srcu_struct srcu;
struct device ctrl_device;
@@ -813,6 +821,7 @@ blk_status_t nvme_host_path_error(struct request *req);
bool nvme_cancel_request(struct request *req, void *data);
void nvme_cancel_tagset(struct nvme_ctrl *ctrl);
void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl);
+unsigned long nvme_fence_ctrl(struct nvme_ctrl *ctrl);
bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
enum nvme_ctrl_state new_state);
int nvme_disable_ctrl(struct nvme_ctrl *ctrl, bool shutdown);
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 09/21] nvme: Implement cross-controller reset completion
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (7 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 08/21] nvme: Implement cross-controller reset recovery Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-16 12:43 ` Hannes Reinecke
2026-02-14 4:25 ` [PATCH v3 10/21] nvme-tcp: Use CCR to recover controller that hits an error Mohamed Khalfella
` (12 subsequent siblings)
21 siblings, 1 reply; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
An nvme source controller that issues CCR command expects to receive an
NVME_AER_NOTICE_CCR_COMPLETED when pending CCR succeeds or fails. Add
sctrl->ccr_work to read NVME_LOG_CCR logpage and wakeup any thread
waiting on CCR completion.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/core.c | 49 +++++++++++++++++++++++++++++++++++++++-
drivers/nvme/host/nvme.h | 1 +
2 files changed, 49 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 765b1524b3ed..a9fcde1b411b 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1916,7 +1916,8 @@ EXPORT_SYMBOL_GPL(nvme_set_queue_count);
#define NVME_AEN_SUPPORTED \
(NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | \
- NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_DISC_CHANGE)
+ NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_CCR_COMPLETE | \
+ NVME_AEN_CFG_DISC_CHANGE)
static void nvme_enable_aen(struct nvme_ctrl *ctrl)
{
@@ -4880,6 +4881,47 @@ static void nvme_get_fw_slot_info(struct nvme_ctrl *ctrl)
kfree(log);
}
+static void nvme_ccr_work(struct work_struct *work)
+{
+ struct nvme_ctrl *ctrl = container_of(work, struct nvme_ctrl, ccr_work);
+ struct nvme_ccr_entry *ccr;
+ struct nvme_ccr_log_entry *entry;
+ struct nvme_ccr_log *log;
+ unsigned long flags;
+ int ret, i;
+
+ log = kmalloc(sizeof(*log), GFP_KERNEL);
+ if (!log)
+ return;
+
+ ret = nvme_get_log(ctrl, 0, NVME_LOG_CCR, 0x01,
+ 0x00, log, sizeof(*log), 0);
+ if (ret)
+ goto out;
+
+ spin_lock_irqsave(&ctrl->lock, flags);
+ for (i = 0; i < le16_to_cpu(log->ne); i++) {
+ entry = &log->entries[i];
+ if (entry->ccrs == NVME_CCR_STATUS_IN_PROGRESS)
+ continue;
+
+ list_for_each_entry(ccr, &ctrl->ccr_list, list) {
+ struct nvme_ctrl *ictrl = ccr->ictrl;
+
+ if (ictrl->cntlid != le16_to_cpu(entry->icid) ||
+ ictrl->ciu != entry->ciu)
+ continue;
+
+ /* Complete matching entry */
+ ccr->ccrs = entry->ccrs;
+ complete(&ccr->complete);
+ }
+ }
+ spin_unlock_irqrestore(&ctrl->lock, flags);
+out:
+ kfree(log);
+}
+
static void nvme_fw_act_work(struct work_struct *work)
{
struct nvme_ctrl *ctrl = container_of(work,
@@ -4956,6 +4998,9 @@ static bool nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result)
case NVME_AER_NOTICE_DISC_CHANGED:
ctrl->aen_result = result;
break;
+ case NVME_AER_NOTICE_CCR_COMPLETED:
+ queue_work(nvme_wq, &ctrl->ccr_work);
+ break;
default:
dev_warn(ctrl->device, "async event result %08x\n", result);
}
@@ -5145,6 +5190,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
nvme_stop_failfast_work(ctrl);
flush_work(&ctrl->async_event_work);
cancel_work_sync(&ctrl->fw_act_work);
+ cancel_work_sync(&ctrl->ccr_work);
if (ctrl->ops->stop_ctrl)
ctrl->ops->stop_ctrl(ctrl);
}
@@ -5268,6 +5314,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
ctrl->quirks = quirks;
ctrl->numa_node = NUMA_NO_NODE;
INIT_WORK(&ctrl->scan_work, nvme_scan_work);
+ INIT_WORK(&ctrl->ccr_work, nvme_ccr_work);
INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f3ab9411cac5..af6a4e83053e 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -365,6 +365,7 @@ struct nvme_ctrl {
struct nvme_effects_log *effects;
struct xarray cels;
struct work_struct scan_work;
+ struct work_struct ccr_work;
struct work_struct async_event_work;
struct delayed_work ka_work;
struct delayed_work failfast_work;
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 10/21] nvme-tcp: Use CCR to recover controller that hits an error
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (8 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 09/21] nvme: Implement cross-controller reset completion Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-16 12:47 ` Hannes Reinecke
2026-02-14 4:25 ` [PATCH v3 11/21] nvme-rdma: " Mohamed Khalfella
` (11 subsequent siblings)
21 siblings, 1 reply; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
An alive nvme controller that hits an error now will move to FENCING
state instead of RESETTING state. ctrl->fencing_work attempts CCR to
terminate inflight IOs. Regardless of the success or failure of CCR
operation the controller is transitioned to RESETTING state to continue
error recovery process.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/tcp.c | 32 +++++++++++++++++++++++++++++++-
1 file changed, 31 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 69cb04406b47..229cfdffd848 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -193,6 +193,7 @@ struct nvme_tcp_ctrl {
struct sockaddr_storage src_addr;
struct nvme_ctrl ctrl;
+ struct work_struct fencing_work;
struct work_struct err_work;
struct delayed_work connect_work;
struct nvme_tcp_request async_req;
@@ -611,6 +612,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
{
+ if (nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCING)) {
+ dev_warn(ctrl->device, "starting controller fencing\n");
+ queue_work(nvme_wq, &to_tcp_ctrl(ctrl)->fencing_work);
+ return;
+ }
+
if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
return;
@@ -2470,12 +2477,31 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
nvme_tcp_reconnect_or_remove(ctrl, ret);
}
+static void nvme_tcp_fencing_work(struct work_struct *work)
+{
+ struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
+ struct nvme_tcp_ctrl, fencing_work);
+ struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
+ unsigned long rem;
+
+ rem = nvme_fence_ctrl(ctrl);
+ if (rem) {
+ dev_info(ctrl->device,
+ "CCR failed, skipping time-based recovery\n");
+ }
+
+ nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
+ if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+ queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
+}
+
static void nvme_tcp_error_recovery_work(struct work_struct *work)
{
struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
struct nvme_tcp_ctrl, err_work);
struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
+ flush_work(&to_tcp_ctrl(ctrl)->fencing_work);
if (nvme_tcp_key_revoke_needed(ctrl))
nvme_auth_revoke_tls_key(ctrl);
nvme_stop_keep_alive(ctrl);
@@ -2518,6 +2544,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
container_of(work, struct nvme_ctrl, reset_work);
int ret;
+ flush_work(&to_tcp_ctrl(ctrl)->fencing_work);
if (nvme_tcp_key_revoke_needed(ctrl))
nvme_auth_revoke_tls_key(ctrl);
nvme_stop_ctrl(ctrl);
@@ -2643,13 +2670,15 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
struct nvme_command *cmd = &pdu->cmd;
int qid = nvme_tcp_queue_id(req->queue);
+ enum nvme_ctrl_state state;
dev_warn(ctrl->device,
"I/O tag %d (%04x) type %d opcode %#x (%s) QID %d timeout\n",
rq->tag, nvme_cid(rq), pdu->hdr.type, cmd->common.opcode,
nvme_fabrics_opcode_str(qid, cmd), qid);
- if (nvme_ctrl_state(ctrl) != NVME_CTRL_LIVE) {
+ state = nvme_ctrl_state(ctrl);
+ if (state != NVME_CTRL_LIVE && state != NVME_CTRL_FENCING) {
/*
* If we are resetting, connecting or deleting we should
* complete immediately because we may block controller
@@ -2904,6 +2933,7 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
INIT_DELAYED_WORK(&ctrl->connect_work,
nvme_tcp_reconnect_ctrl_work);
+ INIT_WORK(&ctrl->fencing_work, nvme_tcp_fencing_work);
INIT_WORK(&ctrl->err_work, nvme_tcp_error_recovery_work);
INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 11/21] nvme-rdma: Use CCR to recover controller that hits an error
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (9 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 10/21] nvme-tcp: Use CCR to recover controller that hits an error Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-16 12:47 ` Hannes Reinecke
2026-02-14 4:25 ` [PATCH v3 12/21] nvme-fc: Decouple error recovery from controller reset Mohamed Khalfella
` (10 subsequent siblings)
21 siblings, 1 reply; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
An alive nvme controller that hits an error now will move to FENCING
state instead of RESETTING state. ctrl->fencing_work attempts CCR to
terminate inflight IOs. Regardless of the success or failure of CCR
operation the controller is transitioned to RESETTING state to continue
error recovery process.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/rdma.c | 32 +++++++++++++++++++++++++++++++-
1 file changed, 31 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 35c0822edb2d..2fb47f41215f 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -106,6 +106,7 @@ struct nvme_rdma_ctrl {
/* other member variables */
struct blk_mq_tag_set tag_set;
+ struct work_struct fencing_work;
struct work_struct err_work;
struct nvme_rdma_qe async_event_sqe;
@@ -1120,11 +1121,30 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
nvme_rdma_reconnect_or_remove(ctrl, ret);
}
+static void nvme_rdma_fencing_work(struct work_struct *work)
+{
+ struct nvme_rdma_ctrl *rdma_ctrl = container_of(work,
+ struct nvme_rdma_ctrl, fencing_work);
+ struct nvme_ctrl *ctrl = &rdma_ctrl->ctrl;
+ unsigned long rem;
+
+ rem = nvme_fence_ctrl(ctrl);
+ if (rem) {
+ dev_info(ctrl->device,
+ "CCR failed, skipping time-based recovery\n");
+ }
+
+ nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
+ if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+ queue_work(nvme_reset_wq, &rdma_ctrl->err_work);
+}
+
static void nvme_rdma_error_recovery_work(struct work_struct *work)
{
struct nvme_rdma_ctrl *ctrl = container_of(work,
struct nvme_rdma_ctrl, err_work);
+ flush_work(&ctrl->fencing_work);
nvme_stop_keep_alive(&ctrl->ctrl);
flush_work(&ctrl->ctrl.async_event_work);
nvme_rdma_teardown_io_queues(ctrl, false);
@@ -1147,6 +1167,12 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
{
+ if (nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_FENCING)) {
+ dev_warn(ctrl->ctrl.device, "starting controller fencing\n");
+ queue_work(nvme_wq, &ctrl->fencing_work);
+ return;
+ }
+
if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
return;
@@ -1957,13 +1983,15 @@ static enum blk_eh_timer_return nvme_rdma_timeout(struct request *rq)
struct nvme_rdma_ctrl *ctrl = queue->ctrl;
struct nvme_command *cmd = req->req.cmd;
int qid = nvme_rdma_queue_idx(queue);
+ enum nvme_ctrl_state state;
dev_warn(ctrl->ctrl.device,
"I/O tag %d (%04x) opcode %#x (%s) QID %d timeout\n",
rq->tag, nvme_cid(rq), cmd->common.opcode,
nvme_fabrics_opcode_str(qid, cmd), qid);
- if (nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_LIVE) {
+ state = nvme_ctrl_state(&ctrl->ctrl);
+ if (state != NVME_CTRL_LIVE && state != NVME_CTRL_FENCING) {
/*
* If we are resetting, connecting or deleting we should
* complete immediately because we may block controller
@@ -2169,6 +2197,7 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
container_of(work, struct nvme_rdma_ctrl, ctrl.reset_work);
int ret;
+ flush_work(&ctrl->fencing_work);
nvme_stop_ctrl(&ctrl->ctrl);
nvme_rdma_shutdown_ctrl(ctrl, false);
@@ -2281,6 +2310,7 @@ static struct nvme_rdma_ctrl *nvme_rdma_alloc_ctrl(struct device *dev,
INIT_DELAYED_WORK(&ctrl->reconnect_work,
nvme_rdma_reconnect_ctrl_work);
+ INIT_WORK(&ctrl->fencing_work, nvme_rdma_fencing_work);
INIT_WORK(&ctrl->err_work, nvme_rdma_error_recovery_work);
INIT_WORK(&ctrl->ctrl.reset_work, nvme_rdma_reset_ctrl_work);
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 12/21] nvme-fc: Decouple error recovery from controller reset
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (10 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 11/21] nvme-rdma: " Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-28 0:12 ` James Smart
2026-02-14 4:25 ` [PATCH v3 13/21] nvme-fc: Use CCR to recover controller that hits an error Mohamed Khalfella
` (9 subsequent siblings)
21 siblings, 1 reply; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
nvme_fc_error_recovery() called from nvme_fc_timeout() while controller
in CONNECTING state results in deadlock reported in link below. Update
nvme_fc_timeout() to schedule error recovery to avoid the deadlock.
Previous to this change if controller was LIVE error recovery resets
the controller and this does not match nvme-tcp and nvme-rdma. Decouple
error recovery from controller reset to match other fabric transports.
Link: https://lore.kernel.org/all/20250529214928.2112990-1-mkhalfella@purestorage.com/
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/fc.c | 120 +++++++++++++++++++++++------------------
1 file changed, 67 insertions(+), 53 deletions(-)
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 6948de3f438a..e6ffaa19aba4 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -227,6 +227,10 @@ static DEFINE_IDA(nvme_fc_ctrl_cnt);
static struct device *fc_udev_device;
static void nvme_fc_complete_rq(struct request *rq);
+static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
+ char *errmsg);
+static void __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl,
+ bool start_queues);
/* *********************** FC-NVME Port Management ************************ */
@@ -788,7 +792,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
"Reconnect", ctrl->cnum);
set_bit(ASSOC_FAILED, &ctrl->flags);
- nvme_reset_ctrl(&ctrl->ctrl);
+ nvme_fc_start_ioerr_recovery(ctrl, "Connectivity Loss");
}
/**
@@ -985,7 +989,7 @@ fc_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
-static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg);
+static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl);
static void
__nvme_fc_finish_ls_req(struct nvmefc_ls_req_op *lsop)
@@ -1567,9 +1571,8 @@ nvme_fc_ls_disconnect_assoc(struct nvmefc_ls_rcv_op *lsop)
* for the association have been ABTS'd by
* nvme_fc_delete_association().
*/
-
- /* fail the association */
- nvme_fc_error_recovery(ctrl, "Disconnect Association LS received");
+ nvme_fc_start_ioerr_recovery(ctrl,
+ "Disconnect Association LS received");
/* release the reference taken by nvme_fc_match_disconn_ls() */
nvme_fc_ctrl_put(ctrl);
@@ -1871,7 +1874,22 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
struct nvme_fc_ctrl *ctrl =
container_of(work, struct nvme_fc_ctrl, ioerr_work);
- nvme_fc_error_recovery(ctrl, "transport detected io error");
+ /*
+ * if an error (io timeout, etc) while (re)connecting, the remote
+ * port requested terminating of the association (disconnect_ls)
+ * or an error (timeout or abort) occurred on an io while creating
+ * the controller. Abort any ios on the association and let the
+ * create_association error path resolve things.
+ */
+ if (nvme_ctrl_state(&ctrl->ctrl) == NVME_CTRL_CONNECTING) {
+ __nvme_fc_abort_outstanding_ios(ctrl, true);
+ dev_warn(ctrl->ctrl.device,
+ "NVME-FC{%d}: transport error during (re)connect\n",
+ ctrl->cnum);
+ return;
+ }
+
+ nvme_fc_error_recovery(ctrl);
}
/*
@@ -1892,6 +1910,25 @@ char *nvme_fc_io_getuuid(struct nvmefc_fcp_req *req)
}
EXPORT_SYMBOL_GPL(nvme_fc_io_getuuid);
+static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
+ char *errmsg)
+{
+ enum nvme_ctrl_state state;
+
+ if (nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING)) {
+ dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error recovery %s\n",
+ ctrl->cnum, errmsg);
+ queue_work(nvme_reset_wq, &ctrl->ioerr_work);
+ return;
+ }
+
+ state = nvme_ctrl_state(&ctrl->ctrl);
+ if (state == NVME_CTRL_CONNECTING || state == NVME_CTRL_DELETING ||
+ state == NVME_CTRL_DELETING_NOIO) {
+ queue_work(nvme_reset_wq, &ctrl->ioerr_work);
+ }
+}
+
static void
nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
{
@@ -2049,9 +2086,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
nvme_fc_complete_rq(rq);
check_error:
- if (terminate_assoc &&
- nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_RESETTING)
- queue_work(nvme_reset_wq, &ctrl->ioerr_work);
+ if (terminate_assoc)
+ nvme_fc_start_ioerr_recovery(ctrl, "io error");
}
static int
@@ -2495,39 +2531,6 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
nvme_unquiesce_admin_queue(&ctrl->ctrl);
}
-static void
-nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
-{
- enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
-
- /*
- * if an error (io timeout, etc) while (re)connecting, the remote
- * port requested terminating of the association (disconnect_ls)
- * or an error (timeout or abort) occurred on an io while creating
- * the controller. Abort any ios on the association and let the
- * create_association error path resolve things.
- */
- if (state == NVME_CTRL_CONNECTING) {
- __nvme_fc_abort_outstanding_ios(ctrl, true);
- dev_warn(ctrl->ctrl.device,
- "NVME-FC{%d}: transport error during (re)connect\n",
- ctrl->cnum);
- return;
- }
-
- /* Otherwise, only proceed if in LIVE state - e.g. on first error */
- if (state != NVME_CTRL_LIVE)
- return;
-
- dev_warn(ctrl->ctrl.device,
- "NVME-FC{%d}: transport association event: %s\n",
- ctrl->cnum, errmsg);
- dev_warn(ctrl->ctrl.device,
- "NVME-FC{%d}: resetting controller\n", ctrl->cnum);
-
- nvme_reset_ctrl(&ctrl->ctrl);
-}
-
static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
{
struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(rq);
@@ -2536,24 +2539,14 @@ static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
struct nvme_command *sqe = &cmdiu->sqe;
- /*
- * Attempt to abort the offending command. Command completion
- * will detect the aborted io and will fail the connection.
- */
dev_info(ctrl->ctrl.device,
"NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
"x%08x/x%08x\n",
ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
nvme_fabrics_opcode_str(qnum, sqe),
sqe->common.cdw10, sqe->common.cdw11);
- if (__nvme_fc_abort_op(ctrl, op))
- nvme_fc_error_recovery(ctrl, "io timeout abort failed");
- /*
- * the io abort has been initiated. Have the reset timer
- * restarted and the abort completion will complete the io
- * shortly. Avoids a synchronous wait while the abort finishes.
- */
+ nvme_fc_start_ioerr_recovery(ctrl, "io timeout");
return BLK_EH_RESET_TIMER;
}
@@ -3352,6 +3345,27 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
}
}
+static void
+nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
+{
+ nvme_stop_keep_alive(&ctrl->ctrl);
+ nvme_stop_ctrl(&ctrl->ctrl);
+ flush_work(&ctrl->ctrl.async_event_work);
+
+ /* will block while waiting for io to terminate */
+ nvme_fc_delete_association(ctrl);
+
+ /* Do not reconnect if controller is being deleted */
+ if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
+ return;
+
+ if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
+ queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
+ return;
+ }
+
+ nvme_fc_reconnect_or_delete(ctrl, -ENOTCONN);
+}
static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
.name = "fc",
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 13/21] nvme-fc: Use CCR to recover controller that hits an error
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (11 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 12/21] nvme-fc: Decouple error recovery from controller reset Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-28 1:03 ` James Smart
2026-02-14 4:25 ` [PATCH v3 14/21] nvme-fc: Hold inflight requests while in FENCING state Mohamed Khalfella
` (8 subsequent siblings)
21 siblings, 1 reply; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
An alive nvme controller that hits an error now will move to FENCING
state instead of RESETTING state. ctrl->fencing_work attempts CCR to
terminate inflight IOs. Regardless of the success or failure of CCR
operation the controller is transitioned to RESETTING state to continue
error recovery process.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/fc.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index e6ffaa19aba4..6ebabfb7e76d 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -166,6 +166,7 @@ struct nvme_fc_ctrl {
struct blk_mq_tag_set admin_tag_set;
struct blk_mq_tag_set tag_set;
+ struct work_struct fencing_work;
struct work_struct ioerr_work;
struct delayed_work connect_work;
@@ -1868,6 +1869,24 @@ __nvme_fc_fcpop_chk_teardowns(struct nvme_fc_ctrl *ctrl,
}
}
+static void nvme_fc_fencing_work(struct work_struct *work)
+{
+ struct nvme_fc_ctrl *fc_ctrl =
+ container_of(work, struct nvme_fc_ctrl, fencing_work);
+ struct nvme_ctrl *ctrl = &fc_ctrl->ctrl;
+ unsigned long rem;
+
+ rem = nvme_fence_ctrl(ctrl);
+ if (rem) {
+ dev_info(ctrl->device,
+ "CCR failed, skipping time-based recovery\n");
+ }
+
+ nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
+ if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+ queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
+}
+
static void
nvme_fc_ctrl_ioerr_work(struct work_struct *work)
{
@@ -1889,6 +1908,7 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
return;
}
+ flush_work(&ctrl->fencing_work);
nvme_fc_error_recovery(ctrl);
}
@@ -1915,6 +1935,14 @@ static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
{
enum nvme_ctrl_state state;
+ if (nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_FENCING)) {
+ dev_warn(ctrl->ctrl.device,
+ "NVME-FC{%d}: starting controller fencing %s\n",
+ ctrl->cnum, errmsg);
+ queue_work(nvme_wq, &ctrl->fencing_work);
+ return;
+ }
+
if (nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING)) {
dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error recovery %s\n",
ctrl->cnum, errmsg);
@@ -3322,6 +3350,7 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
struct nvme_fc_ctrl *ctrl =
container_of(work, struct nvme_fc_ctrl, ctrl.reset_work);
+ flush_work(&ctrl->fencing_work);
nvme_stop_ctrl(&ctrl->ctrl);
/* will block will waiting for io to terminate */
@@ -3497,6 +3526,7 @@ nvme_fc_alloc_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
INIT_DELAYED_WORK(&ctrl->connect_work, nvme_fc_connect_ctrl_work);
+ INIT_WORK(&ctrl->fencing_work, nvme_fc_fencing_work);
INIT_WORK(&ctrl->ioerr_work, nvme_fc_ctrl_ioerr_work);
spin_lock_init(&ctrl->lock);
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 14/21] nvme-fc: Hold inflight requests while in FENCING state
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (12 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 13/21] nvme-fc: Use CCR to recover controller that hits an error Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-27 2:49 ` Randy Jennings
2026-02-28 1:10 ` James Smart
2026-02-14 4:25 ` [PATCH v3 15/21] nvme-fc: Do not cancel requests in io taget before it is initialized Mohamed Khalfella
` (7 subsequent siblings)
21 siblings, 2 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
While in FENCING state, aborted inflight IOs should be held until fencing
is done. Update nvme_fc_fcpio_done() to not complete aborted requests or
requests with transport errors. These held requests will be canceled in
nvme_fc_delete_association() after fencing is done. nvme_fc_fcpio_done()
avoids racing with canceling aborted requests by making sure we complete
successful requests before waking up the waiting thread.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/fc.c | 61 +++++++++++++++++++++++++++++++++++-------
1 file changed, 51 insertions(+), 10 deletions(-)
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 6ebabfb7e76d..e605dd3f4a40 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -172,7 +172,7 @@ struct nvme_fc_ctrl {
struct kref ref;
unsigned long flags;
- u32 iocnt;
+ atomic_t iocnt;
wait_queue_head_t ioabort_wait;
struct nvme_fc_fcp_op aen_ops[NVME_NR_AEN_COMMANDS];
@@ -1823,7 +1823,7 @@ __nvme_fc_abort_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_fcp_op *op)
atomic_set(&op->state, opstate);
else if (test_bit(FCCTRL_TERMIO, &ctrl->flags)) {
op->flags |= FCOP_FLAGS_TERMIO;
- ctrl->iocnt++;
+ atomic_inc(&ctrl->iocnt);
}
spin_unlock_irqrestore(&ctrl->lock, flags);
@@ -1853,20 +1853,29 @@ nvme_fc_abort_aen_ops(struct nvme_fc_ctrl *ctrl)
}
static inline void
+__nvme_fc_fcpop_count_one_down(struct nvme_fc_ctrl *ctrl)
+{
+ if (atomic_dec_return(&ctrl->iocnt) == 0)
+ wake_up(&ctrl->ioabort_wait);
+}
+
+static inline bool
__nvme_fc_fcpop_chk_teardowns(struct nvme_fc_ctrl *ctrl,
struct nvme_fc_fcp_op *op, int opstate)
{
unsigned long flags;
+ bool ret = false;
if (opstate == FCPOP_STATE_ABORTED) {
spin_lock_irqsave(&ctrl->lock, flags);
if (test_bit(FCCTRL_TERMIO, &ctrl->flags) &&
op->flags & FCOP_FLAGS_TERMIO) {
- if (!--ctrl->iocnt)
- wake_up(&ctrl->ioabort_wait);
+ ret = true;
}
spin_unlock_irqrestore(&ctrl->lock, flags);
}
+
+ return ret;
}
static void nvme_fc_fencing_work(struct work_struct *work)
@@ -1969,7 +1978,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
struct nvme_command *sqe = &op->cmd_iu.sqe;
__le16 status = cpu_to_le16(NVME_SC_SUCCESS << 1);
union nvme_result result;
- bool terminate_assoc = true;
+ bool op_term, terminate_assoc = true;
+ enum nvme_ctrl_state state;
int opstate;
/*
@@ -2102,16 +2112,38 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
done:
if (op->flags & FCOP_FLAGS_AEN) {
nvme_complete_async_event(&queue->ctrl->ctrl, status, &result);
- __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
+ if (__nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate))
+ __nvme_fc_fcpop_count_one_down(ctrl);
atomic_set(&op->state, FCPOP_STATE_IDLE);
op->flags = FCOP_FLAGS_AEN; /* clear other flags */
nvme_fc_ctrl_put(ctrl);
goto check_error;
}
- __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
+ /*
+ * We can not access op after the request is completed because it can
+ * be reused immediately. At the same time we want to wakeup the thread
+ * waiting for ongoing IOs _after_ requests are completed. This is
+ * necessary because that thread will start canceling inflight IOs
+ * and we want to avoid request completion racing with cancellation.
+ */
+ op_term = __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
+
+ /*
+ * If we are going to terminate associations and the controller is
+ * LIVE or FENCING, then do not complete this request now. Let error
+ * recovery cancel this request when it is safe to do so.
+ */
+ state = nvme_ctrl_state(&ctrl->ctrl);
+ if (terminate_assoc &&
+ (state == NVME_CTRL_LIVE || state == NVME_CTRL_FENCING))
+ goto check_op_term;
+
if (!nvme_try_complete_req(rq, status, result))
nvme_fc_complete_rq(rq);
+check_op_term:
+ if (op_term)
+ __nvme_fc_fcpop_count_one_down(ctrl);
check_error:
if (terminate_assoc)
@@ -2750,7 +2782,8 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
* cmd with the csn was supposed to arrive.
*/
opstate = atomic_xchg(&op->state, FCPOP_STATE_COMPLETE);
- __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
+ if (__nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate))
+ __nvme_fc_fcpop_count_one_down(ctrl);
if (!(op->flags & FCOP_FLAGS_AEN)) {
nvme_fc_unmap_data(ctrl, op->rq, op);
@@ -3219,7 +3252,7 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl)
spin_lock_irqsave(&ctrl->lock, flags);
set_bit(FCCTRL_TERMIO, &ctrl->flags);
- ctrl->iocnt = 0;
+ atomic_set(&ctrl->iocnt, 0);
spin_unlock_irqrestore(&ctrl->lock, flags);
__nvme_fc_abort_outstanding_ios(ctrl, false);
@@ -3228,11 +3261,19 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl)
nvme_fc_abort_aen_ops(ctrl);
/* wait for all io that had to be aborted */
+ wait_event(ctrl->ioabort_wait, atomic_read(&ctrl->iocnt) == 0);
spin_lock_irq(&ctrl->lock);
- wait_event_lock_irq(ctrl->ioabort_wait, ctrl->iocnt == 0, ctrl->lock);
clear_bit(FCCTRL_TERMIO, &ctrl->flags);
spin_unlock_irq(&ctrl->lock);
+ /*
+ * At this point all inflight requests have been successfully
+ * aborted. Now it is safe to cancel all requests we decided
+ * not to complete in nvme_fc_fcpio_done().
+ */
+ nvme_cancel_tagset(&ctrl->ctrl);
+ nvme_cancel_admin_tagset(&ctrl->ctrl);
+
nvme_fc_term_aen_ops(ctrl);
/*
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 15/21] nvme-fc: Do not cancel requests in io taget before it is initialized
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (13 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 14/21] nvme-fc: Hold inflight requests while in FENCING state Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-28 1:12 ` James Smart
2026-02-14 4:25 ` [PATCH v3 16/21] nvmet: Add support for CQT to nvme target Mohamed Khalfella
` (6 subsequent siblings)
21 siblings, 1 reply; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
A new nvme-fc controller in CONNECTING state sees admin request timeout
schedules ctrl->ioerr_work to abort inflight requests. This ends up
calling __nvme_fc_abort_outstanding_ios() which aborts requests in both
admin and io tagsets. In case fc_ctrl->tag_set was not initialized we
see the warning below. This is because ctrl.queue_count is initialized
early in nvme_fc_alloc_ctrl().
nvme nvme0: NVME-FC{0}: starting error recovery Connectivity Loss
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
lpfc 0000:ab:00.0: queue 0 connect admin queue failed (-6).
you didn't initialize this object before use?
turning off the locking correctness validator.
Workqueue: nvme-reset-wq nvme_fc_ctrl_ioerr_work [nvme_fc]
Call Trace:
<TASK>
dump_stack_lvl+0x57/0x80
register_lock_class+0x567/0x580
__lock_acquire+0x330/0xb90
lock_acquire.part.0+0xad/0x210
blk_mq_tagset_busy_iter+0xf9/0xc00
__nvme_fc_abort_outstanding_ios+0x23f/0x320 [nvme_fc]
nvme_fc_ctrl_ioerr_work+0x172/0x210 [nvme_fc]
process_one_work+0x82c/0x1450
worker_thread+0x5ee/0xfd0
kthread+0x3a0/0x750
ret_from_fork+0x439/0x670
ret_from_fork_asm+0x1a/0x30
</TASK>
Update the check in __nvme_fc_abort_outstanding_ios() confirm that io
tagset was created before iterating over busy requests. Also make sure
to cancel ctrl->ioerr_work before removing io tagset.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/fc.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index e605dd3f4a40..eac3a7ccaa5c 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2557,7 +2557,7 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
* io requests back to the block layer as part of normal completions
* (but with error status).
*/
- if (ctrl->ctrl.queue_count > 1) {
+ if (ctrl->ctrl.queue_count > 1 && ctrl->ctrl.tagset) {
nvme_quiesce_io_queues(&ctrl->ctrl);
nvme_sync_io_queues(&ctrl->ctrl);
blk_mq_tagset_busy_iter(&ctrl->tag_set,
@@ -2954,6 +2954,11 @@ nvme_fc_create_io_queues(struct nvme_fc_ctrl *ctrl)
out_delete_hw_queues:
nvme_fc_delete_hw_io_queues(ctrl);
out_cleanup_tagset:
+ /*
+ * In CONNECTING state ctrl->ioerr_work will abort both admin
+ * and io tagsets. Cancel it first before removing io tagset.
+ */
+ cancel_work_sync(&ctrl->ioerr_work);
nvme_remove_io_tag_set(&ctrl->ctrl);
nvme_fc_free_io_queues(ctrl);
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 16/21] nvmet: Add support for CQT to nvme target
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (14 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 15/21] nvme-fc: Do not cancel requests in io taget before it is initialized Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 17/21] nvme: Add support for CQT to nvme host Mohamed Khalfella
` (5 subsequent siblings)
21 siblings, 0 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
TP4129 KATO Corrections and Clarifications defined CQT (Command Quiesce
Time) which is used along with KATO (Keep Alive Timeout) to set an upper
time limit for attempting Cross-Controller Recovery. CQT is added as a
subsystem attribute that defaults to 0 to maintain the current behavior.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/target/admin-cmd.c | 1 +
drivers/nvme/target/configfs.c | 31 +++++++++++++++++++++++++++++++
drivers/nvme/target/core.c | 3 +++
drivers/nvme/target/nvmet.h | 2 ++
include/linux/nvme.h | 5 ++++-
5 files changed, 41 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 925a81979278..5077a9ddba44 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -743,6 +743,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
id->cntlid = cpu_to_le16(ctrl->cntlid);
id->ver = cpu_to_le32(ctrl->subsys->ver);
if (!nvmet_is_disc_subsys(ctrl->subsys)) {
+ id->cqt = cpu_to_le16(ctrl->cqt);
id->ciu = ctrl->ciu;
id->cirn = cpu_to_le64(ctrl->cirn);
id->ccrl = NVMF_CCR_LIMIT;
diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index 127dae51fec1..c9b7a2eeeee5 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -1637,6 +1637,36 @@ static ssize_t nvmet_subsys_attr_pi_enable_store(struct config_item *item,
CONFIGFS_ATTR(nvmet_subsys_, attr_pi_enable);
#endif
+static ssize_t nvmet_subsys_attr_cqt_show(struct config_item *item,
+ char *page)
+{
+ return snprintf(page, PAGE_SIZE, "%u\n", to_subsys(item)->cqt);
+}
+
+static ssize_t nvmet_subsys_attr_cqt_store(struct config_item *item,
+ const char *page, size_t cnt)
+{
+ struct nvmet_subsys *subsys = to_subsys(item);
+ struct nvmet_ctrl *ctrl;
+ u16 cqt;
+
+ if (sscanf(page, "%hu\n", &cqt) != 1)
+ return -EINVAL;
+
+ down_write(&nvmet_config_sem);
+ if (subsys->cqt == cqt)
+ goto out;
+
+ subsys->cqt = cqt;
+ /* Force reconnect */
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+ ctrl->ops->delete_ctrl(ctrl);
+out:
+ up_write(&nvmet_config_sem);
+ return cnt;
+}
+CONFIGFS_ATTR(nvmet_subsys_, attr_cqt);
+
static ssize_t nvmet_subsys_attr_qid_max_show(struct config_item *item,
char *page)
{
@@ -1677,6 +1707,7 @@ static struct configfs_attribute *nvmet_subsys_attrs[] = {
&nvmet_subsys_attr_attr_vendor_id,
&nvmet_subsys_attr_attr_subsys_vendor_id,
&nvmet_subsys_attr_attr_model,
+ &nvmet_subsys_attr_attr_cqt,
&nvmet_subsys_attr_attr_qid_max,
&nvmet_subsys_attr_attr_ieee_oui,
&nvmet_subsys_attr_attr_firmware,
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index a9f8a2242703..886083bb7a83 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -1718,6 +1718,7 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
ctrl->cntlid = ret;
if (!nvmet_is_disc_subsys(ctrl->subsys)) {
+ ctrl->cqt = subsys->cqt;
ctrl->ciu = get_random_u8() ? : 1;
ctrl->cirn = get_random_u64();
}
@@ -1958,10 +1959,12 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char *subsysnqn,
switch (type) {
case NVME_NQN_NVME:
+ subsys->cqt = NVMF_CQT_MS;
subsys->max_qid = NVMET_NR_QUEUES;
break;
case NVME_NQN_DISC:
case NVME_NQN_CURR:
+ subsys->cqt = 0;
subsys->max_qid = 0;
break;
default:
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 0ed41a3d0562..00528feeb3cd 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -265,6 +265,7 @@ struct nvmet_ctrl {
uuid_t hostid;
u16 cntlid;
+ u16 cqt;
u8 ciu;
u32 kato;
u64 cirn;
@@ -342,6 +343,7 @@ struct nvmet_subsys {
#ifdef CONFIG_NVME_TARGET_DEBUGFS
struct dentry *debugfs_dir;
#endif
+ u16 cqt;
u16 max_qid;
u64 ver;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index fc33ae48d149..f6d66dadc5b1 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -21,6 +21,7 @@
#define NVMF_TRADDR_SIZE 256
#define NVMF_TSAS_SIZE 256
+#define NVMF_CQT_MS 0
#define NVMF_CCR_LIMIT 4
#define NVMF_CCR_PER_PAGE 511
@@ -368,7 +369,9 @@ struct nvme_id_ctrl {
__u8 anacap;
__le32 anagrpmax;
__le32 nanagrpid;
- __u8 rsvd352[160];
+ __u8 rsvd352[34];
+ __le16 cqt;
+ __u8 rsvd388[124];
__u8 sqes;
__u8 cqes;
__le16 maxcmd;
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 17/21] nvme: Add support for CQT to nvme host
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (15 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 16/21] nvmet: Add support for CQT to nvme target Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT Mohamed Khalfella
` (4 subsequent siblings)
21 siblings, 0 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
TP4129 KATO Corrections and Clarifications defined CQT (Command Quiesce
Time) which is used along with KATO (Keep Alive Timeout) to set an upper
limit for attempting Cross-Controller Recovery. Add ctrl->cqt and
read its value from controller identify response. Update fence timeout
to consider ctrl->cqt.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/core.c | 1 +
drivers/nvme/host/nvme.h | 5 +++--
drivers/nvme/host/sysfs.c | 2 ++
3 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a9fcde1b411b..0680d05900c1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3739,6 +3739,7 @@ static int nvme_init_identify(struct nvme_ctrl *ctrl)
ctrl->ciu = id->ciu;
ctrl->cirn = le64_to_cpu(id->cirn);
atomic_set(&ctrl->ccr_limit, id->ccrl);
+ ctrl->cqt = le16_to_cpu(id->cqt);
ctrl->oacs = le16_to_cpu(id->oacs);
ctrl->oncs = le16_to_cpu(id->oncs);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index af6a4e83053e..a7f382e35821 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -336,6 +336,7 @@ struct nvme_ctrl {
u32 max_zone_append;
#endif
u16 crdt[3];
+ u16 cqt;
u16 oncs;
u8 dmrl;
u8 ciu;
@@ -1245,8 +1246,8 @@ static inline bool nvme_multi_css(struct nvme_ctrl *ctrl)
static inline unsigned long nvme_fence_timeout_ms(struct nvme_ctrl *ctrl)
{
if (ctrl->ctratt & NVME_CTRL_ATTR_TBKAS)
- return 3 * ctrl->kato * 1000;
- return 2 * ctrl->kato * 1000;
+ return 3 * ctrl->kato * 1000 + ctrl->cqt;
+ return 2 * ctrl->kato * 1000 + ctrl->cqt;
}
#endif /* _NVME_H */
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 1e4261144933..0234e11730bb 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -387,6 +387,7 @@ nvme_show_int_function(numa_node);
nvme_show_int_function(queue_count);
nvme_show_int_function(sqsize);
nvme_show_int_function(kato);
+nvme_show_int_function(cqt);
static ssize_t nvme_sysfs_ciu_show(struct device *dev,
struct device_attribute *attr,
@@ -759,6 +760,7 @@ static struct attribute *nvme_dev_attrs[] = {
&dev_attr_sqsize.attr,
&dev_attr_ciu.attr,
&dev_attr_cirn.attr,
+ &dev_attr_cqt.attr,
&dev_attr_hostnqn.attr,
&dev_attr_hostid.attr,
&dev_attr_ctrl_loss_tmo.attr,
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (16 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 17/21] nvme: Add support for CQT to nvme host Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-16 12:54 ` Hannes Reinecke
2026-02-14 4:25 ` [PATCH v3 19/21] nvme-tcp: Extend FENCING state per TP4129 on CCR failure Mohamed Khalfella
` (3 subsequent siblings)
21 siblings, 1 reply; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
TP8028 Rapid Path Failure Recovery does not define how much time the
host should wait for CCR operation to complete. It is reasonable to
assume that CCR operation can take up to ctrl->cqt. Update wait time for
CCR operation to be max(ctrl->cqt, ctrl->kato).
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0680d05900c1..ff479c0263ab 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -631,7 +631,7 @@ static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
if (result & 0x01) /* Immediate Reset Successful */
goto out;
- tmo = secs_to_jiffies(ictrl->kato);
+ tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
ret = -ETIMEDOUT;
goto out;
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 19/21] nvme-tcp: Extend FENCING state per TP4129 on CCR failure
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (17 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-16 12:56 ` Hannes Reinecke
2026-02-14 4:25 ` [PATCH v3 20/21] nvme-rdma: " Mohamed Khalfella
` (2 subsequent siblings)
21 siblings, 1 reply; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
If CCR operations fail and CQT is supported, we must defer the retry of
inflight requests per TP4129. Update ctrl->fencing_work to schedule
ctrl->fenced_work, effectively extending the FENCING state. This delay
ensures that inflight requests are held until it is safe for them to be
retired.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/tcp.c | 39 +++++++++++++++++++++++++++++++++++----
1 file changed, 35 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 229cfdffd848..054e8a350d75 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -194,6 +194,7 @@ struct nvme_tcp_ctrl {
struct nvme_ctrl ctrl;
struct work_struct fencing_work;
+ struct delayed_work fenced_work;
struct work_struct err_work;
struct delayed_work connect_work;
struct nvme_tcp_request async_req;
@@ -2477,6 +2478,18 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
nvme_tcp_reconnect_or_remove(ctrl, ret);
}
+static void nvme_tcp_fenced_work(struct work_struct *work)
+{
+ struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
+ struct nvme_tcp_ctrl, fenced_work);
+ struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
+
+ dev_info(ctrl->device, "Time-based recovery finished\n");
+ nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
+ if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+ queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
+}
+
static void nvme_tcp_fencing_work(struct work_struct *work)
{
struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
@@ -2485,23 +2498,40 @@ static void nvme_tcp_fencing_work(struct work_struct *work)
unsigned long rem;
rem = nvme_fence_ctrl(ctrl);
- if (rem) {
+ if (!rem)
+ goto done;
+
+ if (!ctrl->cqt) {
dev_info(ctrl->device,
- "CCR failed, skipping time-based recovery\n");
+ "CCR failed, CQT not supported, skip time-based recovery\n");
+ goto done;
}
+ dev_info(ctrl->device,
+ "CCR failed, switch to time-based recovery, timeout = %ums\n",
+ jiffies_to_msecs(rem));
+ queue_delayed_work(nvme_wq, &tcp_ctrl->fenced_work, rem);
+ return;
+
+done:
nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
}
+static void nvme_tcp_flush_fencing_works(struct nvme_ctrl *ctrl)
+{
+ flush_work(&to_tcp_ctrl(ctrl)->fencing_work);
+ flush_delayed_work(&to_tcp_ctrl(ctrl)->fenced_work);
+}
+
static void nvme_tcp_error_recovery_work(struct work_struct *work)
{
struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
struct nvme_tcp_ctrl, err_work);
struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
- flush_work(&to_tcp_ctrl(ctrl)->fencing_work);
+ nvme_tcp_flush_fencing_works(ctrl);
if (nvme_tcp_key_revoke_needed(ctrl))
nvme_auth_revoke_tls_key(ctrl);
nvme_stop_keep_alive(ctrl);
@@ -2544,7 +2574,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
container_of(work, struct nvme_ctrl, reset_work);
int ret;
- flush_work(&to_tcp_ctrl(ctrl)->fencing_work);
+ nvme_tcp_flush_fencing_works(ctrl);
if (nvme_tcp_key_revoke_needed(ctrl))
nvme_auth_revoke_tls_key(ctrl);
nvme_stop_ctrl(ctrl);
@@ -2934,6 +2964,7 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
INIT_DELAYED_WORK(&ctrl->connect_work,
nvme_tcp_reconnect_ctrl_work);
INIT_WORK(&ctrl->fencing_work, nvme_tcp_fencing_work);
+ INIT_DELAYED_WORK(&ctrl->fenced_work, nvme_tcp_fenced_work);
INIT_WORK(&ctrl->err_work, nvme_tcp_error_recovery_work);
INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 20/21] nvme-rdma: Extend FENCING state per TP4129 on CCR failure
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (18 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 19/21] nvme-tcp: Extend FENCING state per TP4129 on CCR failure Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 21/21] nvme-fc: " Mohamed Khalfella
2026-04-01 13:33 ` [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Achkinazi, Igor
21 siblings, 0 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
If CCR operations fail and CQT is supported, we must defer the retry of
inflight requests per TP4129. Update ctrl->fencing_work to schedule
ctrl->fenced_work, effectively extending the FENCING state. This delay
ensures that inflight requests are held until it is safe for them to be
retired.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/rdma.c | 39 +++++++++++++++++++++++++++++++++++----
1 file changed, 35 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 2fb47f41215f..4f48780c3b19 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -107,6 +107,7 @@ struct nvme_rdma_ctrl {
/* other member variables */
struct blk_mq_tag_set tag_set;
struct work_struct fencing_work;
+ struct delayed_work fenced_work;
struct work_struct err_work;
struct nvme_rdma_qe async_event_sqe;
@@ -1121,6 +1122,18 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
nvme_rdma_reconnect_or_remove(ctrl, ret);
}
+static void nvme_rdma_fenced_work(struct work_struct *work)
+{
+ struct nvme_rdma_ctrl *rdma_ctrl = container_of(to_delayed_work(work),
+ struct nvme_rdma_ctrl, fenced_work);
+ struct nvme_ctrl *ctrl = &rdma_ctrl->ctrl;
+
+ dev_info(ctrl->device, "Time-based recovery finished\n");
+ nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
+ if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+ queue_work(nvme_reset_wq, &rdma_ctrl->err_work);
+}
+
static void nvme_rdma_fencing_work(struct work_struct *work)
{
struct nvme_rdma_ctrl *rdma_ctrl = container_of(work,
@@ -1129,22 +1142,39 @@ static void nvme_rdma_fencing_work(struct work_struct *work)
unsigned long rem;
rem = nvme_fence_ctrl(ctrl);
- if (rem) {
+ if (!rem)
+ goto done;
+
+ if (!ctrl->cqt) {
dev_info(ctrl->device,
- "CCR failed, skipping time-based recovery\n");
+ "CCR failed, CQT not supported, skip time-based recovery\n");
+ goto done;
}
+ dev_info(ctrl->device,
+ "CCR failed, switch to time-based recovery, timeout = %ums\n",
+ jiffies_to_msecs(rem));
+ queue_delayed_work(nvme_wq, &rdma_ctrl->fenced_work, rem);
+ return;
+
+done:
nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
queue_work(nvme_reset_wq, &rdma_ctrl->err_work);
}
+static void nvme_rdma_flush_fencing_works(struct nvme_rdma_ctrl *ctrl)
+{
+ flush_work(&ctrl->fencing_work);
+ flush_delayed_work(&ctrl->fenced_work);
+}
+
static void nvme_rdma_error_recovery_work(struct work_struct *work)
{
struct nvme_rdma_ctrl *ctrl = container_of(work,
struct nvme_rdma_ctrl, err_work);
- flush_work(&ctrl->fencing_work);
+ nvme_rdma_flush_fencing_works(ctrl);
nvme_stop_keep_alive(&ctrl->ctrl);
flush_work(&ctrl->ctrl.async_event_work);
nvme_rdma_teardown_io_queues(ctrl, false);
@@ -2197,7 +2227,7 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
container_of(work, struct nvme_rdma_ctrl, ctrl.reset_work);
int ret;
- flush_work(&ctrl->fencing_work);
+ nvme_rdma_flush_fencing_works(ctrl);
nvme_stop_ctrl(&ctrl->ctrl);
nvme_rdma_shutdown_ctrl(ctrl, false);
@@ -2311,6 +2341,7 @@ static struct nvme_rdma_ctrl *nvme_rdma_alloc_ctrl(struct device *dev,
INIT_DELAYED_WORK(&ctrl->reconnect_work,
nvme_rdma_reconnect_ctrl_work);
INIT_WORK(&ctrl->fencing_work, nvme_rdma_fencing_work);
+ INIT_DELAYED_WORK(&ctrl->fenced_work, nvme_rdma_fenced_work);
INIT_WORK(&ctrl->err_work, nvme_rdma_error_recovery_work);
INIT_WORK(&ctrl->ctrl.reset_work, nvme_rdma_reset_ctrl_work);
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* [PATCH v3 21/21] nvme-fc: Extend FENCING state per TP4129 on CCR failure
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (19 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 20/21] nvme-rdma: " Mohamed Khalfella
@ 2026-02-14 4:25 ` Mohamed Khalfella
2026-02-28 1:20 ` James Smart
2026-04-01 13:33 ` [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Achkinazi, Igor
21 siblings, 1 reply; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-14 4:25 UTC (permalink / raw)
To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, Mohamed Khalfella
If CCR operations fail and CQT is supported, we must defer the retry of
inflight requests per TP4129. Update ctrl->fencing_work to schedule
ctrl->fenced_work, effectively extending the FENCING state. This delay
ensures that inflight requests are held until it is safe for them to be
retired.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/fc.c | 39 +++++++++++++++++++++++++++++++++++----
1 file changed, 35 insertions(+), 4 deletions(-)
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index eac3a7ccaa5c..81088a4ce298 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -167,6 +167,7 @@ struct nvme_fc_ctrl {
struct blk_mq_tag_set tag_set;
struct work_struct fencing_work;
+ struct delayed_work fenced_work;
struct work_struct ioerr_work;
struct delayed_work connect_work;
@@ -1878,6 +1879,18 @@ __nvme_fc_fcpop_chk_teardowns(struct nvme_fc_ctrl *ctrl,
return ret;
}
+static void nvme_fc_fenced_work(struct work_struct *work)
+{
+ struct nvme_fc_ctrl *fc_ctrl = container_of(to_delayed_work(work),
+ struct nvme_fc_ctrl, fenced_work);
+ struct nvme_ctrl *ctrl = &fc_ctrl->ctrl;
+
+ dev_info(ctrl->device, "Time-based recovery finished\n");
+ nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
+ if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+ queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
+}
+
static void nvme_fc_fencing_work(struct work_struct *work)
{
struct nvme_fc_ctrl *fc_ctrl =
@@ -1886,16 +1899,33 @@ static void nvme_fc_fencing_work(struct work_struct *work)
unsigned long rem;
rem = nvme_fence_ctrl(ctrl);
- if (rem) {
+ if (!rem)
+ goto done;
+
+ if (!ctrl->cqt) {
dev_info(ctrl->device,
- "CCR failed, skipping time-based recovery\n");
+ "CCR failed, CQT not supported, skip time-based recovery\n");
+ goto done;
}
+ dev_info(ctrl->device,
+ "CCR failed, switch to time-based recovery, timeout = %ums\n",
+ jiffies_to_msecs(rem));
+ queue_delayed_work(nvme_wq, &fc_ctrl->fenced_work, rem);
+ return;
+
+done:
nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
}
+static void nvme_fc_flush_fencing_works(struct nvme_fc_ctrl *ctrl)
+{
+ flush_work(&ctrl->fencing_work);
+ flush_delayed_work(&ctrl->fenced_work);
+}
+
static void
nvme_fc_ctrl_ioerr_work(struct work_struct *work)
{
@@ -1917,7 +1947,7 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
return;
}
- flush_work(&ctrl->fencing_work);
+ nvme_fc_flush_fencing_works(ctrl);
nvme_fc_error_recovery(ctrl);
}
@@ -3396,7 +3426,7 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
struct nvme_fc_ctrl *ctrl =
container_of(work, struct nvme_fc_ctrl, ctrl.reset_work);
- flush_work(&ctrl->fencing_work);
+ nvme_fc_flush_fencing_works(ctrl);
nvme_stop_ctrl(&ctrl->ctrl);
/* will block will waiting for io to terminate */
@@ -3573,6 +3603,7 @@ nvme_fc_alloc_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
INIT_DELAYED_WORK(&ctrl->connect_work, nvme_fc_connect_ctrl_work);
INIT_WORK(&ctrl->fencing_work, nvme_fc_fencing_work);
+ INIT_DELAYED_WORK(&ctrl->fenced_work, nvme_fc_fenced_work);
INIT_WORK(&ctrl->ioerr_work, nvme_fc_ctrl_ioerr_work);
spin_lock_init(&ctrl->lock);
--
2.52.0
^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: [PATCH v3 07/21] nvme: Introduce FENCING and FENCED controller states
2026-02-14 4:25 ` [PATCH v3 07/21] nvme: Introduce FENCING and FENCED controller states Mohamed Khalfella
@ 2026-02-16 12:33 ` Hannes Reinecke
0 siblings, 0 replies; 61+ messages in thread
From: Hannes Reinecke @ 2026-02-16 12:33 UTC (permalink / raw)
To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, James Smart
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel
On 2/14/26 05:25, Mohamed Khalfella wrote:
> FENCING is a new controller state that a LIVE controller enter when an
> error is encountered. While in FENCING state,inflight IOs that timeout
> are not canceled because they should be held until either CCR succeeds
> or time-based recovery completes. While the queues remain alive requests
> are not allowed to be sent in this state and the controller cannot be
> reset or deleted. This is intentional because resetting or deleting the
> controller results in canceling inflight IOs.
>
> FENCED is a short-term state the controller enters before it is reset.
> It exists only to prevent manual resets from happening while controller
> is in FENCING state.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/host/core.c | 27 +++++++++++++++++++++++++--
> drivers/nvme/host/nvme.h | 4 ++++
> drivers/nvme/host/sysfs.c | 2 ++
> 3 files changed, 31 insertions(+), 2 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 08/21] nvme: Implement cross-controller reset recovery
2026-02-14 4:25 ` [PATCH v3 08/21] nvme: Implement cross-controller reset recovery Mohamed Khalfella
@ 2026-02-16 12:41 ` Hannes Reinecke
2026-02-17 18:35 ` Mohamed Khalfella
2026-02-26 2:37 ` Randy Jennings
1 sibling, 1 reply; 61+ messages in thread
From: Hannes Reinecke @ 2026-02-16 12:41 UTC (permalink / raw)
To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, James Smart
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel
On 2/14/26 05:25, Mohamed Khalfella wrote:
> A host that has more than one path connecting to an nvme subsystem
> typically has an nvme controller associated with every path. This is
> mostly applicable to nvmeof. If one path goes down, inflight IOs on that
> path should not be retried immediately on another path because this
> could lead to data corruption as described in TP4129. TP8028 defines
> cross-controller reset mechanism that can be used by host to terminate
> IOs on the failed path using one of the remaining healthy paths. Only
> after IOs are terminated, or long enough time passes as defined by
> TP4129, inflight IOs should be retried on another path. Implement core
> cross-controller reset shared logic to be used by the transports.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/host/constants.c | 1 +
> drivers/nvme/host/core.c | 141 ++++++++++++++++++++++++++++++++++
> drivers/nvme/host/nvme.h | 9 +++
> 3 files changed, 151 insertions(+)
>
> diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
> index dc90df9e13a2..f679efd5110e 100644
> --- a/drivers/nvme/host/constants.c
> +++ b/drivers/nvme/host/constants.c
> @@ -46,6 +46,7 @@ static const char * const nvme_admin_ops[] = {
> [nvme_admin_virtual_mgmt] = "Virtual Management",
> [nvme_admin_nvme_mi_send] = "NVMe Send MI",
> [nvme_admin_nvme_mi_recv] = "NVMe Receive MI",
> + [nvme_admin_cross_ctrl_reset] = "Cross Controller Reset",
> [nvme_admin_dbbuf] = "Doorbell Buffer Config",
> [nvme_admin_format_nvm] = "Format NVM",
> [nvme_admin_security_send] = "Security Send",
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 231d402e9bfb..765b1524b3ed 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -554,6 +554,146 @@ void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl)
> }
> EXPORT_SYMBOL_GPL(nvme_cancel_admin_tagset);
>
> +static struct nvme_ctrl *nvme_find_ctrl_ccr(struct nvme_ctrl *ictrl,
> + u32 min_cntlid)
> +{
> + struct nvme_subsystem *subsys = ictrl->subsys;
> + struct nvme_ctrl *ctrl, *sctrl = NULL;
> + unsigned long flags;
> +
> + mutex_lock(&nvme_subsystems_lock);
> + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> + if (ctrl->cntlid < min_cntlid)
> + continue;
> +
> + if (atomic_dec_if_positive(&ctrl->ccr_limit) < 0)
> + continue;
> +
> + spin_lock_irqsave(&ctrl->lock, flags);
> + if (ctrl->state != NVME_CTRL_LIVE) {
> + spin_unlock_irqrestore(&ctrl->lock, flags);
> + atomic_inc(&ctrl->ccr_limit);
> + continue;
> + }
> +
> + /*
> + * We got a good candidate source controller that is locked and
> + * LIVE. However, no guarantee ctrl will not be deleted after
> + * ctrl->lock is released. Get a ref of both ctrl and admin_q
> + * so they do not disappear until we are done with them.
> + */
> + WARN_ON_ONCE(!blk_get_queue(ctrl->admin_q));
> + nvme_get_ctrl(ctrl);
> + spin_unlock_irqrestore(&ctrl->lock, flags);
> + sctrl = ctrl;
> + break;
> + }
> + mutex_unlock(&nvme_subsystems_lock);
> + return sctrl;
> +}
> +
> +static void nvme_put_ctrl_ccr(struct nvme_ctrl *sctrl)
> +{
> + atomic_inc(&sctrl->ccr_limit);
> + blk_put_queue(sctrl->admin_q);
> + nvme_put_ctrl(sctrl);
> +}
> +
> +static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
> +{
> + struct nvme_ccr_entry ccr = { };
> + union nvme_result res = { 0 };
> + struct nvme_command c = { };
> + unsigned long flags, tmo;
> + bool completed = false;
> + int ret = 0;
> + u32 result;
> +
> + init_completion(&ccr.complete);
> + ccr.ictrl = ictrl;
> +
> + spin_lock_irqsave(&sctrl->lock, flags);
> + list_add_tail(&ccr.list, &sctrl->ccr_list);
> + spin_unlock_irqrestore(&sctrl->lock, flags);
> +
> + c.ccr.opcode = nvme_admin_cross_ctrl_reset;
> + c.ccr.ciu = ictrl->ciu;
> + c.ccr.icid = cpu_to_le16(ictrl->cntlid);
> + c.ccr.cirn = cpu_to_le64(ictrl->cirn);
> + ret = __nvme_submit_sync_cmd(sctrl->admin_q, &c, &res,
> + NULL, 0, NVME_QID_ANY, 0);
> + if (ret) {
> + ret = -EIO;
> + goto out;
> + }
> +
> + result = le32_to_cpu(res.u32);
> + if (result & 0x01) /* Immediate Reset Successful */
> + goto out;
> +
> + tmo = secs_to_jiffies(ictrl->kato);
> + if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
> + ret = -ETIMEDOUT;
> + goto out;
> + }
> +
That will be tricky. The 'ccr' comand will be sent with the default
command queue timeout which is decoupled from KATO.
So you really should set the command timeout for the 'ccr' command
to ctrl->kato to ensure it'll be terminated correctly.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 09/21] nvme: Implement cross-controller reset completion
2026-02-14 4:25 ` [PATCH v3 09/21] nvme: Implement cross-controller reset completion Mohamed Khalfella
@ 2026-02-16 12:43 ` Hannes Reinecke
2026-02-17 18:25 ` Mohamed Khalfella
0 siblings, 1 reply; 61+ messages in thread
From: Hannes Reinecke @ 2026-02-16 12:43 UTC (permalink / raw)
To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, James Smart
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel
On 2/14/26 05:25, Mohamed Khalfella wrote:
> An nvme source controller that issues CCR command expects to receive an
> NVME_AER_NOTICE_CCR_COMPLETED when pending CCR succeeds or fails. Add
> sctrl->ccr_work to read NVME_LOG_CCR logpage and wakeup any thread
> waiting on CCR completion.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/host/core.c | 49 +++++++++++++++++++++++++++++++++++++++-
> drivers/nvme/host/nvme.h | 1 +
> 2 files changed, 49 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 765b1524b3ed..a9fcde1b411b 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1916,7 +1916,8 @@ EXPORT_SYMBOL_GPL(nvme_set_queue_count);
>
> #define NVME_AEN_SUPPORTED \
> (NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | \
> - NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_DISC_CHANGE)
> + NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_CCR_COMPLETE | \
> + NVME_AEN_CFG_DISC_CHANGE)
>
> static void nvme_enable_aen(struct nvme_ctrl *ctrl)
> {
> @@ -4880,6 +4881,47 @@ static void nvme_get_fw_slot_info(struct nvme_ctrl *ctrl)
> kfree(log);
> }
>
> +static void nvme_ccr_work(struct work_struct *work)
> +{
> + struct nvme_ctrl *ctrl = container_of(work, struct nvme_ctrl, ccr_work);
> + struct nvme_ccr_entry *ccr;
> + struct nvme_ccr_log_entry *entry;
> + struct nvme_ccr_log *log;
> + unsigned long flags;
> + int ret, i;
> +
> + log = kmalloc(sizeof(*log), GFP_KERNEL);
> + if (!log)
> + return;
> +
> + ret = nvme_get_log(ctrl, 0, NVME_LOG_CCR, 0x01,
> + 0x00, log, sizeof(*log), 0);
> + if (ret)
> + goto out;
> +
> + spin_lock_irqsave(&ctrl->lock, flags);
> + for (i = 0; i < le16_to_cpu(log->ne); i++) {
> + entry = &log->entries[i];
> + if (entry->ccrs == NVME_CCR_STATUS_IN_PROGRESS)
> + continue;
> +
> + list_for_each_entry(ccr, &ctrl->ccr_list, list) {
> + struct nvme_ctrl *ictrl = ccr->ictrl;
> +
> + if (ictrl->cntlid != le16_to_cpu(entry->icid) ||
> + ictrl->ciu != entry->ciu)
> + continue;
> +
> + /* Complete matching entry */
> + ccr->ccrs = entry->ccrs;
> + complete(&ccr->complete);
> + }
> + }
> + spin_unlock_irqrestore(&ctrl->lock, flags);
> +out:
> + kfree(log);
> +}
> +
> static void nvme_fw_act_work(struct work_struct *work)
> {
> struct nvme_ctrl *ctrl = container_of(work,
> @@ -4956,6 +4998,9 @@ static bool nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result)
> case NVME_AER_NOTICE_DISC_CHANGED:
> ctrl->aen_result = result;
> break;
> + case NVME_AER_NOTICE_CCR_COMPLETED:
> + queue_work(nvme_wq, &ctrl->ccr_work);
> + break;
> default:
> dev_warn(ctrl->device, "async event result %08x\n", result);
> }
> @@ -5145,6 +5190,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
> nvme_stop_failfast_work(ctrl);
> flush_work(&ctrl->async_event_work);
> cancel_work_sync(&ctrl->fw_act_work);
> + cancel_work_sync(&ctrl->ccr_work);
> if (ctrl->ops->stop_ctrl)
> ctrl->ops->stop_ctrl(ctrl);
> }
> @@ -5268,6 +5314,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
> ctrl->quirks = quirks;
> ctrl->numa_node = NUMA_NO_NODE;
> INIT_WORK(&ctrl->scan_work, nvme_scan_work);
> + INIT_WORK(&ctrl->ccr_work, nvme_ccr_work);
> INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
> INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
> INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work);
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index f3ab9411cac5..af6a4e83053e 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -365,6 +365,7 @@ struct nvme_ctrl {
> struct nvme_effects_log *effects;
> struct xarray cels;
> struct work_struct scan_work;
> + struct work_struct ccr_work;
> struct work_struct async_event_work;
> struct delayed_work ka_work;
> struct delayed_work failfast_work;
We really would need some indicator whether 'ccr' is supported at all.
Using the number of available CCR commands would be an option, if though
that would require us to keep two counters (one for the number of
possible outstanding CCRs, and one for the number of actual outstanding
CCRs.).
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 10/21] nvme-tcp: Use CCR to recover controller that hits an error
2026-02-14 4:25 ` [PATCH v3 10/21] nvme-tcp: Use CCR to recover controller that hits an error Mohamed Khalfella
@ 2026-02-16 12:47 ` Hannes Reinecke
0 siblings, 0 replies; 61+ messages in thread
From: Hannes Reinecke @ 2026-02-16 12:47 UTC (permalink / raw)
To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, James Smart
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel
On 2/14/26 05:25, Mohamed Khalfella wrote:
> An alive nvme controller that hits an error now will move to FENCING
> state instead of RESETTING state. ctrl->fencing_work attempts CCR to
> terminate inflight IOs. Regardless of the success or failure of CCR
> operation the controller is transitioned to RESETTING state to continue
> error recovery process.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/host/tcp.c | 32 +++++++++++++++++++++++++++++++-
> 1 file changed, 31 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 69cb04406b47..229cfdffd848 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -193,6 +193,7 @@ struct nvme_tcp_ctrl {
> struct sockaddr_storage src_addr;
> struct nvme_ctrl ctrl;
>
> + struct work_struct fencing_work;
> struct work_struct err_work;
> struct delayed_work connect_work;
> struct nvme_tcp_request async_req;
> @@ -611,6 +612,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
>
> static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
> {
> + if (nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCING)) {
> + dev_warn(ctrl->device, "starting controller fencing\n");
> + queue_work(nvme_wq, &to_tcp_ctrl(ctrl)->fencing_work);
> + return;
> + }
> +
> if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> return;
>
> @@ -2470,12 +2477,31 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
> nvme_tcp_reconnect_or_remove(ctrl, ret);
> }
>
> +static void nvme_tcp_fencing_work(struct work_struct *work)
> +{
> + struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> + struct nvme_tcp_ctrl, fencing_work);
> + struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> + unsigned long rem;
> +
> + rem = nvme_fence_ctrl(ctrl);
> + if (rem) {
> + dev_info(ctrl->device,
> + "CCR failed, skipping time-based recovery\n");
> + }
> +
> + nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> + if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> + queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
> +}
> +
> static void nvme_tcp_error_recovery_work(struct work_struct *work)
> {
> struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> struct nvme_tcp_ctrl, err_work);
> struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
>
> + flush_work(&to_tcp_ctrl(ctrl)->fencing_work);
> if (nvme_tcp_key_revoke_needed(ctrl))
> nvme_auth_revoke_tls_key(ctrl);
> nvme_stop_keep_alive(ctrl);
> @@ -2518,6 +2544,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
> container_of(work, struct nvme_ctrl, reset_work);
> int ret;
>
> + flush_work(&to_tcp_ctrl(ctrl)->fencing_work);
> if (nvme_tcp_key_revoke_needed(ctrl))
> nvme_auth_revoke_tls_key(ctrl);
> nvme_stop_ctrl(ctrl);
> @@ -2643,13 +2670,15 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
> struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
> struct nvme_command *cmd = &pdu->cmd;
> int qid = nvme_tcp_queue_id(req->queue);
> + enum nvme_ctrl_state state;
>
> dev_warn(ctrl->device,
> "I/O tag %d (%04x) type %d opcode %#x (%s) QID %d timeout\n",
> rq->tag, nvme_cid(rq), pdu->hdr.type, cmd->common.opcode,
> nvme_fabrics_opcode_str(qid, cmd), qid);
>
> - if (nvme_ctrl_state(ctrl) != NVME_CTRL_LIVE) {
> + state = nvme_ctrl_state(ctrl);
> + if (state != NVME_CTRL_LIVE && state != NVME_CTRL_FENCING) {
> /*
> * If we are resetting, connecting or deleting we should
> * complete immediately because we may block controller
> @@ -2904,6 +2933,7 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
>
> INIT_DELAYED_WORK(&ctrl->connect_work,
> nvme_tcp_reconnect_ctrl_work);
> + INIT_WORK(&ctrl->fencing_work, nvme_tcp_fencing_work);
> INIT_WORK(&ctrl->err_work, nvme_tcp_error_recovery_work);
> INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
>
I still would love to have the 'FENCING/FENCED' state handled in the
generic code, but that would require quite some twiddling with the
transport-specific error handlings. So probably not for this round.
Other than that:
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 11/21] nvme-rdma: Use CCR to recover controller that hits an error
2026-02-14 4:25 ` [PATCH v3 11/21] nvme-rdma: " Mohamed Khalfella
@ 2026-02-16 12:47 ` Hannes Reinecke
0 siblings, 0 replies; 61+ messages in thread
From: Hannes Reinecke @ 2026-02-16 12:47 UTC (permalink / raw)
To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, James Smart
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel
On 2/14/26 05:25, Mohamed Khalfella wrote:
> An alive nvme controller that hits an error now will move to FENCING
> state instead of RESETTING state. ctrl->fencing_work attempts CCR to
> terminate inflight IOs. Regardless of the success or failure of CCR
> operation the controller is transitioned to RESETTING state to continue
> error recovery process.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/host/rdma.c | 32 +++++++++++++++++++++++++++++++-
> 1 file changed, 31 insertions(+), 1 deletion(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT
2026-02-14 4:25 ` [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT Mohamed Khalfella
@ 2026-02-16 12:54 ` Hannes Reinecke
2026-02-16 18:45 ` Mohamed Khalfella
0 siblings, 1 reply; 61+ messages in thread
From: Hannes Reinecke @ 2026-02-16 12:54 UTC (permalink / raw)
To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, James Smart
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel
On 2/14/26 05:25, Mohamed Khalfella wrote:
> TP8028 Rapid Path Failure Recovery does not define how much time the
> host should wait for CCR operation to complete. It is reasonable to
> assume that CCR operation can take up to ctrl->cqt. Update wait time for
> CCR operation to be max(ctrl->cqt, ctrl->kato).
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/host/core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 0680d05900c1..ff479c0263ab 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -631,7 +631,7 @@ static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
> if (result & 0x01) /* Immediate Reset Successful */
> goto out;
>
> - tmo = secs_to_jiffies(ictrl->kato);
> + tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
> if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
> ret = -ETIMEDOUT;
> goto out;
That is not my understanding. I was under the impression that CQT is the
_additional_ time a controller requires to clear out outstanding
commands once it detected a loss of communication (ie _after_ KATO).
Which would mean we have to wait for up to
(ctrl->kato * 1000) + ctrl->cqt.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 19/21] nvme-tcp: Extend FENCING state per TP4129 on CCR failure
2026-02-14 4:25 ` [PATCH v3 19/21] nvme-tcp: Extend FENCING state per TP4129 on CCR failure Mohamed Khalfella
@ 2026-02-16 12:56 ` Hannes Reinecke
2026-02-17 17:58 ` Mohamed Khalfella
0 siblings, 1 reply; 61+ messages in thread
From: Hannes Reinecke @ 2026-02-16 12:56 UTC (permalink / raw)
To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, James Smart
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel
On 2/14/26 05:25, Mohamed Khalfella wrote:
> If CCR operations fail and CQT is supported, we must defer the retry of
> inflight requests per TP4129. Update ctrl->fencing_work to schedule
> ctrl->fenced_work, effectively extending the FENCING state. This delay
> ensures that inflight requests are held until it is safe for them to be
> retired.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/host/tcp.c | 39 +++++++++++++++++++++++++++++++++++----
> 1 file changed, 35 insertions(+), 4 deletions(-)
>
Can't you merge / integrate this into the nvme_fence_ctrl() routine?
The previous patch already extended the timeout to cover for CQT, so
we can just wait for the timeout if CCR failed, no?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT
2026-02-16 12:54 ` Hannes Reinecke
@ 2026-02-16 18:45 ` Mohamed Khalfella
2026-02-17 7:09 ` Hannes Reinecke
0 siblings, 1 reply; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-16 18:45 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Aaron Dailey, Randy Jennings, Dhaval Giani,
linux-nvme, linux-kernel
On Mon 2026-02-16 13:54:18 +0100, Hannes Reinecke wrote:
> On 2/14/26 05:25, Mohamed Khalfella wrote:
> > TP8028 Rapid Path Failure Recovery does not define how much time the
> > host should wait for CCR operation to complete. It is reasonable to
> > assume that CCR operation can take up to ctrl->cqt. Update wait time for
> > CCR operation to be max(ctrl->cqt, ctrl->kato).
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> > drivers/nvme/host/core.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 0680d05900c1..ff479c0263ab 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -631,7 +631,7 @@ static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
> > if (result & 0x01) /* Immediate Reset Successful */
> > goto out;
> >
> > - tmo = secs_to_jiffies(ictrl->kato);
> > + tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
> > if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
> > ret = -ETIMEDOUT;
> > goto out;
>
> That is not my understanding. I was under the impression that CQT is the
> _additional_ time a controller requires to clear out outstanding
> commands once it detected a loss of communication (ie _after_ KATO).
> Which would mean we have to wait for up to
> (ctrl->kato * 1000) + ctrl->cqt.
At this point the source controller knows about communication loss. We
do not need kato wait. In theory we should just wait for CQT.
max(cqt, kato) is a conservative guess I made.
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke Kernel Storage Architect
> hare@suse.de +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT
2026-02-16 18:45 ` Mohamed Khalfella
@ 2026-02-17 7:09 ` Hannes Reinecke
2026-02-17 15:35 ` Mohamed Khalfella
0 siblings, 1 reply; 61+ messages in thread
From: Hannes Reinecke @ 2026-02-17 7:09 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Aaron Dailey, Randy Jennings, Dhaval Giani,
linux-nvme, linux-kernel
On 2/16/26 19:45, Mohamed Khalfella wrote:
> On Mon 2026-02-16 13:54:18 +0100, Hannes Reinecke wrote:
>> On 2/14/26 05:25, Mohamed Khalfella wrote:
>>> TP8028 Rapid Path Failure Recovery does not define how much time the
>>> host should wait for CCR operation to complete. It is reasonable to
>>> assume that CCR operation can take up to ctrl->cqt. Update wait time for
>>> CCR operation to be max(ctrl->cqt, ctrl->kato).
>>>
>>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
>>> ---
>>> drivers/nvme/host/core.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 0680d05900c1..ff479c0263ab 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -631,7 +631,7 @@ static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
>>> if (result & 0x01) /* Immediate Reset Successful */
>>> goto out;
>>>
>>> - tmo = secs_to_jiffies(ictrl->kato);
>>> + tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
>>> if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
>>> ret = -ETIMEDOUT;
>>> goto out;
>>
>> That is not my understanding. I was under the impression that CQT is the
>> _additional_ time a controller requires to clear out outstanding
>> commands once it detected a loss of communication (ie _after_ KATO).
>> Which would mean we have to wait for up to
>> (ctrl->kato * 1000) + ctrl->cqt.
>
> At this point the source controller knows about communication loss. We
> do not need kato wait. In theory we should just wait for CQT.
> max(cqt, kato) is a conservative guess I made.
>
Not quite. The source controller (on the host!) knows about the
communication loss. But the target might not, as the keep-alive
command might have arrived at the target _just_ before KATO
triggered on the host. So the target is still good, and will
be waiting for _another_ KATO interval before declaring
a loss of communication.
And only then will the CQT period start at the target.
Randy, please correct me if I'm wrong ...
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT
2026-02-17 7:09 ` Hannes Reinecke
@ 2026-02-17 15:35 ` Mohamed Khalfella
2026-02-20 1:22 ` James Smart
2026-02-20 2:01 ` Randy Jennings
0 siblings, 2 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-17 15:35 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Aaron Dailey, Randy Jennings, Dhaval Giani,
linux-nvme, linux-kernel
On Tue 2026-02-17 08:09:33 +0100, Hannes Reinecke wrote:
> On 2/16/26 19:45, Mohamed Khalfella wrote:
> > On Mon 2026-02-16 13:54:18 +0100, Hannes Reinecke wrote:
> >> On 2/14/26 05:25, Mohamed Khalfella wrote:
> >>> TP8028 Rapid Path Failure Recovery does not define how much time the
> >>> host should wait for CCR operation to complete. It is reasonable to
> >>> assume that CCR operation can take up to ctrl->cqt. Update wait time for
> >>> CCR operation to be max(ctrl->cqt, ctrl->kato).
> >>>
> >>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> >>> ---
> >>> drivers/nvme/host/core.c | 2 +-
> >>> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >>> index 0680d05900c1..ff479c0263ab 100644
> >>> --- a/drivers/nvme/host/core.c
> >>> +++ b/drivers/nvme/host/core.c
> >>> @@ -631,7 +631,7 @@ static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
> >>> if (result & 0x01) /* Immediate Reset Successful */
> >>> goto out;
> >>>
> >>> - tmo = secs_to_jiffies(ictrl->kato);
> >>> + tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
> >>> if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
> >>> ret = -ETIMEDOUT;
> >>> goto out;
> >>
> >> That is not my understanding. I was under the impression that CQT is the
> >> _additional_ time a controller requires to clear out outstanding
> >> commands once it detected a loss of communication (ie _after_ KATO).
> >> Which would mean we have to wait for up to
> >> (ctrl->kato * 1000) + ctrl->cqt.
> >
> > At this point the source controller knows about communication loss. We
> > do not need kato wait. In theory we should just wait for CQT.
> > max(cqt, kato) is a conservative guess I made.
> >
> Not quite. The source controller (on the host!) knows about the
> communication loss. But the target might not, as the keep-alive
> command might have arrived at the target _just_ before KATO
> triggered on the host. So the target is still good, and will
> be waiting for _another_ KATO interval before declaring
> a loss of communication.
> And only then will the CQT period start at the target.
>
> Randy, please correct me if I'm wrong ...
>
wait_for_completion_timeout(&ccr.complete, tmo)) waits for CCR operation
to complete. The wait starts after CCR command completed successfully.
IOW, it starts after the host received a CQE from source controller on
the target telling us all is good. If the source controller on the target
already know about loss of communication then there is no need to wait
for KATO. We just need to wait for CCR operation to finish because we
know it has been started successfully.
The specs does not tell us how much time to wait for CCR operation to
complete. max(cqt, kato) is an estimate I think reasonable to make.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 19/21] nvme-tcp: Extend FENCING state per TP4129 on CCR failure
2026-02-16 12:56 ` Hannes Reinecke
@ 2026-02-17 17:58 ` Mohamed Khalfella
2026-02-18 8:26 ` Hannes Reinecke
0 siblings, 1 reply; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-17 17:58 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Aaron Dailey, Randy Jennings, Dhaval Giani,
linux-nvme, linux-kernel
On Mon 2026-02-16 13:56:10 +0100, Hannes Reinecke wrote:
> On 2/14/26 05:25, Mohamed Khalfella wrote:
> > If CCR operations fail and CQT is supported, we must defer the retry of
> > inflight requests per TP4129. Update ctrl->fencing_work to schedule
> > ctrl->fenced_work, effectively extending the FENCING state. This delay
> > ensures that inflight requests are held until it is safe for them to be
> > retired.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> > drivers/nvme/host/tcp.c | 39 +++++++++++++++++++++++++++++++++++----
> > 1 file changed, 35 insertions(+), 4 deletions(-)
> >
> Can't you merge / integrate this into the nvme_fence_ctrl() routine?
ctrl->fencing_work and ctrl->fenced_work are in transport specific
controller, struct nvme_tcp_ctrl in this case. There is no easy way to
access these members from nvme_fence_ctrl(). One option to go around
that is to move them into struct nvme_ctrl. But we call error recovery
after a controller is fenced, and error recovery is implemented in
transport specific way. That is why the delay is implemented/repeated
for every transport.
> The previous patch already extended the timeout to cover for CQT, so
> we can just wait for the timeout if CCR failed, no?
Following on the point above. One change can be done is to reset the
controller after fencing finishes instead of using error recovery.
This way everything lives in core.c. But I have not tested that.
Do you think this is better than what has been implemented now?
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 09/21] nvme: Implement cross-controller reset completion
2026-02-16 12:43 ` Hannes Reinecke
@ 2026-02-17 18:25 ` Mohamed Khalfella
2026-02-18 7:51 ` Hannes Reinecke
0 siblings, 1 reply; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-17 18:25 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Aaron Dailey, Randy Jennings, Dhaval Giani,
linux-nvme, linux-kernel
On Mon 2026-02-16 13:43:51 +0100, Hannes Reinecke wrote:
> On 2/14/26 05:25, Mohamed Khalfella wrote:
> > An nvme source controller that issues CCR command expects to receive an
> > NVME_AER_NOTICE_CCR_COMPLETED when pending CCR succeeds or fails. Add
> > sctrl->ccr_work to read NVME_LOG_CCR logpage and wakeup any thread
> > waiting on CCR completion.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> > drivers/nvme/host/core.c | 49 +++++++++++++++++++++++++++++++++++++++-
> > drivers/nvme/host/nvme.h | 1 +
> > 2 files changed, 49 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 765b1524b3ed..a9fcde1b411b 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -1916,7 +1916,8 @@ EXPORT_SYMBOL_GPL(nvme_set_queue_count);
> >
> > #define NVME_AEN_SUPPORTED \
> > (NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | \
> > - NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_DISC_CHANGE)
> > + NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_CCR_COMPLETE | \
> > + NVME_AEN_CFG_DISC_CHANGE)
> >
> > static void nvme_enable_aen(struct nvme_ctrl *ctrl)
> > {
> > @@ -4880,6 +4881,47 @@ static void nvme_get_fw_slot_info(struct nvme_ctrl *ctrl)
> > kfree(log);
> > }
> >
> > +static void nvme_ccr_work(struct work_struct *work)
> > +{
> > + struct nvme_ctrl *ctrl = container_of(work, struct nvme_ctrl, ccr_work);
> > + struct nvme_ccr_entry *ccr;
> > + struct nvme_ccr_log_entry *entry;
> > + struct nvme_ccr_log *log;
> > + unsigned long flags;
> > + int ret, i;
> > +
> > + log = kmalloc(sizeof(*log), GFP_KERNEL);
> > + if (!log)
> > + return;
> > +
> > + ret = nvme_get_log(ctrl, 0, NVME_LOG_CCR, 0x01,
> > + 0x00, log, sizeof(*log), 0);
> > + if (ret)
> > + goto out;
> > +
> > + spin_lock_irqsave(&ctrl->lock, flags);
> > + for (i = 0; i < le16_to_cpu(log->ne); i++) {
> > + entry = &log->entries[i];
> > + if (entry->ccrs == NVME_CCR_STATUS_IN_PROGRESS)
> > + continue;
> > +
> > + list_for_each_entry(ccr, &ctrl->ccr_list, list) {
> > + struct nvme_ctrl *ictrl = ccr->ictrl;
> > +
> > + if (ictrl->cntlid != le16_to_cpu(entry->icid) ||
> > + ictrl->ciu != entry->ciu)
> > + continue;
> > +
> > + /* Complete matching entry */
> > + ccr->ccrs = entry->ccrs;
> > + complete(&ccr->complete);
> > + }
> > + }
> > + spin_unlock_irqrestore(&ctrl->lock, flags);
> > +out:
> > + kfree(log);
> > +}
> > +
> > static void nvme_fw_act_work(struct work_struct *work)
> > {
> > struct nvme_ctrl *ctrl = container_of(work,
> > @@ -4956,6 +4998,9 @@ static bool nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result)
> > case NVME_AER_NOTICE_DISC_CHANGED:
> > ctrl->aen_result = result;
> > break;
> > + case NVME_AER_NOTICE_CCR_COMPLETED:
> > + queue_work(nvme_wq, &ctrl->ccr_work);
> > + break;
> > default:
> > dev_warn(ctrl->device, "async event result %08x\n", result);
> > }
> > @@ -5145,6 +5190,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
> > nvme_stop_failfast_work(ctrl);
> > flush_work(&ctrl->async_event_work);
> > cancel_work_sync(&ctrl->fw_act_work);
> > + cancel_work_sync(&ctrl->ccr_work);
> > if (ctrl->ops->stop_ctrl)
> > ctrl->ops->stop_ctrl(ctrl);
> > }
> > @@ -5268,6 +5314,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
> > ctrl->quirks = quirks;
> > ctrl->numa_node = NUMA_NO_NODE;
> > INIT_WORK(&ctrl->scan_work, nvme_scan_work);
> > + INIT_WORK(&ctrl->ccr_work, nvme_ccr_work);
> > INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
> > INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
> > INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work);
> > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> > index f3ab9411cac5..af6a4e83053e 100644
> > --- a/drivers/nvme/host/nvme.h
> > +++ b/drivers/nvme/host/nvme.h
> > @@ -365,6 +365,7 @@ struct nvme_ctrl {
> > struct nvme_effects_log *effects;
> > struct xarray cels;
> > struct work_struct scan_work;
> > + struct work_struct ccr_work;
> > struct work_struct async_event_work;
> > struct delayed_work ka_work;
> > struct delayed_work failfast_work;
>
> We really would need some indicator whether 'ccr' is supported at all.
Why do we need this indicator, other than exporting it via sysfs?
> Using the number of available CCR commands would be an option, if though
> that would require us to keep two counters (one for the number of
> possible outstanding CCRs, and one for the number of actual outstanding
> CCRs.).
Like mentioned above ctrl->ccr_limit gives us the number of ccrs
available now. It is not 100% indicator if CCR is supported or not, but
it is enough to implement CCR. A second counter can help us skip trying
CCR if we know impacted controller does not support it.
Do you think it is worth it?
Iterating over controllers in the subsystem is not that bad IMO. This is
similar to the point raised by James Smart [1].
1- https://lore.kernel.org/all/05875e07-b908-425a-ba6f-5e060e03241e@gmail.com/
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 08/21] nvme: Implement cross-controller reset recovery
2026-02-16 12:41 ` Hannes Reinecke
@ 2026-02-17 18:35 ` Mohamed Khalfella
0 siblings, 0 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-17 18:35 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Aaron Dailey, Randy Jennings, Dhaval Giani,
linux-nvme, linux-kernel
On Mon 2026-02-16 13:41:39 +0100, Hannes Reinecke wrote:
> On 2/14/26 05:25, Mohamed Khalfella wrote:
> > A host that has more than one path connecting to an nvme subsystem
> > typically has an nvme controller associated with every path. This is
> > mostly applicable to nvmeof. If one path goes down, inflight IOs on that
> > path should not be retried immediately on another path because this
> > could lead to data corruption as described in TP4129. TP8028 defines
> > cross-controller reset mechanism that can be used by host to terminate
> > IOs on the failed path using one of the remaining healthy paths. Only
> > after IOs are terminated, or long enough time passes as defined by
> > TP4129, inflight IOs should be retried on another path. Implement core
> > cross-controller reset shared logic to be used by the transports.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> > drivers/nvme/host/constants.c | 1 +
> > drivers/nvme/host/core.c | 141 ++++++++++++++++++++++++++++++++++
> > drivers/nvme/host/nvme.h | 9 +++
> > 3 files changed, 151 insertions(+)
> >
> > diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
> > index dc90df9e13a2..f679efd5110e 100644
> > --- a/drivers/nvme/host/constants.c
> > +++ b/drivers/nvme/host/constants.c
> > @@ -46,6 +46,7 @@ static const char * const nvme_admin_ops[] = {
> > [nvme_admin_virtual_mgmt] = "Virtual Management",
> > [nvme_admin_nvme_mi_send] = "NVMe Send MI",
> > [nvme_admin_nvme_mi_recv] = "NVMe Receive MI",
> > + [nvme_admin_cross_ctrl_reset] = "Cross Controller Reset",
> > [nvme_admin_dbbuf] = "Doorbell Buffer Config",
> > [nvme_admin_format_nvm] = "Format NVM",
> > [nvme_admin_security_send] = "Security Send",
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 231d402e9bfb..765b1524b3ed 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -554,6 +554,146 @@ void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl)
> > }
> > EXPORT_SYMBOL_GPL(nvme_cancel_admin_tagset);
> >
> > +static struct nvme_ctrl *nvme_find_ctrl_ccr(struct nvme_ctrl *ictrl,
> > + u32 min_cntlid)
> > +{
> > + struct nvme_subsystem *subsys = ictrl->subsys;
> > + struct nvme_ctrl *ctrl, *sctrl = NULL;
> > + unsigned long flags;
> > +
> > + mutex_lock(&nvme_subsystems_lock);
> > + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> > + if (ctrl->cntlid < min_cntlid)
> > + continue;
> > +
> > + if (atomic_dec_if_positive(&ctrl->ccr_limit) < 0)
> > + continue;
> > +
> > + spin_lock_irqsave(&ctrl->lock, flags);
> > + if (ctrl->state != NVME_CTRL_LIVE) {
> > + spin_unlock_irqrestore(&ctrl->lock, flags);
> > + atomic_inc(&ctrl->ccr_limit);
> > + continue;
> > + }
> > +
> > + /*
> > + * We got a good candidate source controller that is locked and
> > + * LIVE. However, no guarantee ctrl will not be deleted after
> > + * ctrl->lock is released. Get a ref of both ctrl and admin_q
> > + * so they do not disappear until we are done with them.
> > + */
> > + WARN_ON_ONCE(!blk_get_queue(ctrl->admin_q));
> > + nvme_get_ctrl(ctrl);
> > + spin_unlock_irqrestore(&ctrl->lock, flags);
> > + sctrl = ctrl;
> > + break;
> > + }
> > + mutex_unlock(&nvme_subsystems_lock);
> > + return sctrl;
> > +}
> > +
> > +static void nvme_put_ctrl_ccr(struct nvme_ctrl *sctrl)
> > +{
> > + atomic_inc(&sctrl->ccr_limit);
> > + blk_put_queue(sctrl->admin_q);
> > + nvme_put_ctrl(sctrl);
> > +}
> > +
> > +static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
> > +{
> > + struct nvme_ccr_entry ccr = { };
> > + union nvme_result res = { 0 };
> > + struct nvme_command c = { };
> > + unsigned long flags, tmo;
> > + bool completed = false;
> > + int ret = 0;
> > + u32 result;
> > +
> > + init_completion(&ccr.complete);
> > + ccr.ictrl = ictrl;
> > +
> > + spin_lock_irqsave(&sctrl->lock, flags);
> > + list_add_tail(&ccr.list, &sctrl->ccr_list);
> > + spin_unlock_irqrestore(&sctrl->lock, flags);
> > +
> > + c.ccr.opcode = nvme_admin_cross_ctrl_reset;
> > + c.ccr.ciu = ictrl->ciu;
> > + c.ccr.icid = cpu_to_le16(ictrl->cntlid);
> > + c.ccr.cirn = cpu_to_le64(ictrl->cirn);
> > + ret = __nvme_submit_sync_cmd(sctrl->admin_q, &c, &res,
> > + NULL, 0, NVME_QID_ANY, 0);
> > + if (ret) {
> > + ret = -EIO;
> > + goto out;
> > + }
> > +
> > + result = le32_to_cpu(res.u32);
> > + if (result & 0x01) /* Immediate Reset Successful */
> > + goto out;
> > +
> > + tmo = secs_to_jiffies(ictrl->kato);
> > + if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
> > + ret = -ETIMEDOUT;
> > + goto out;
> > + }
> > +
> That will be tricky. The 'ccr' comand will be sent with the default
> command queue timeout which is decoupled from KATO.
> So you really should set the command timeout for the 'ccr' command
> to ctrl->kato to ensure it'll be terminated correctly.
>
Agreed. The timeout for CCR request should be ctr->kato just like what
we do for keep alive request. The easiest way IMO to do is is to extend
__nvme_submit_sync_cmd() to take request timeout. I do not want to make
this change in this patchset.
Is it okay I make this change after this patchset gets merged?
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 09/21] nvme: Implement cross-controller reset completion
2026-02-17 18:25 ` Mohamed Khalfella
@ 2026-02-18 7:51 ` Hannes Reinecke
2026-02-18 12:47 ` Mohamed Khalfella
0 siblings, 1 reply; 61+ messages in thread
From: Hannes Reinecke @ 2026-02-18 7:51 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Aaron Dailey, Randy Jennings, Dhaval Giani,
linux-nvme, linux-kernel
On 2/17/26 19:25, Mohamed Khalfella wrote:
> On Mon 2026-02-16 13:43:51 +0100, Hannes Reinecke wrote:
[ .. ]
>>
>> We really would need some indicator whether 'ccr' is supported at all.
>
> Why do we need this indicator, other than exporting it via sysfs?
>
To avoid false positives.
>> Using the number of available CCR commands would be an option, if though
>> that would require us to keep two counters (one for the number of
>> possible outstanding CCRs, and one for the number of actual outstanding
>> CCRs.).
>
> Like mentioned above ctrl->ccr_limit gives us the number of ccrs
> available now. It is not 100% indicator if CCR is supported or not, but
> it is enough to implement CCR. A second counter can help us skip trying
> CCR if we know impacted controller does not support it.
>
> Do you think it is worth it?
>
Yes. The problem is that we want to get towards TP8028 compliance, which
forces us to wait for 2*KATO + CQT before requests on the failed patch
can be retried. That will cause a _noticeable_ stall on the application
side. And the only way to shorten that is CCR; once we get confirmation
from CCR we can start retrying immediately.
At the same time the current implementation only waits for 1*KATO before
retrying, so there will be regression if we switch to TP8028-compliant
KATO handling for systems not supporting CCR.
So we can (and should) use CCR as the determining factor whether we
want to switch to TP8028-compliant behaviour or stick with the original
implementation.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 19/21] nvme-tcp: Extend FENCING state per TP4129 on CCR failure
2026-02-17 17:58 ` Mohamed Khalfella
@ 2026-02-18 8:26 ` Hannes Reinecke
0 siblings, 0 replies; 61+ messages in thread
From: Hannes Reinecke @ 2026-02-18 8:26 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Aaron Dailey, Randy Jennings, Dhaval Giani,
linux-nvme, linux-kernel
On 2/17/26 18:58, Mohamed Khalfella wrote:
> On Mon 2026-02-16 13:56:10 +0100, Hannes Reinecke wrote:
>> On 2/14/26 05:25, Mohamed Khalfella wrote:
>>> If CCR operations fail and CQT is supported, we must defer the retry of
>>> inflight requests per TP4129. Update ctrl->fencing_work to schedule
>>> ctrl->fenced_work, effectively extending the FENCING state. This delay
>>> ensures that inflight requests are held until it is safe for them to be
>>> retired.
>>>
>>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
>>> ---
>>> drivers/nvme/host/tcp.c | 39 +++++++++++++++++++++++++++++++++++----
>>> 1 file changed, 35 insertions(+), 4 deletions(-)
>>>
>> Can't you merge / integrate this into the nvme_fence_ctrl() routine?
>
> ctrl->fencing_work and ctrl->fenced_work are in transport specific
> controller, struct nvme_tcp_ctrl in this case. There is no easy way to
> access these members from nvme_fence_ctrl(). One option to go around
> that is to move them into struct nvme_ctrl. But we call error recovery
> after a controller is fenced, and error recovery is implemented in
> transport specific way. That is why the delay is implemented/repeated
> for every transport.
>
>> The previous patch already extended the timeout to cover for CQT, so
>> we can just wait for the timeout if CCR failed, no?
>
> Following on the point above. One change can be done is to reset the
> controller after fencing finishes instead of using error recovery.
> This way everything lives in core.c. But I have not tested that.
>
> Do you think this is better than what has been implemented now?
>
Yeah, the eternal problem.
At one point someone will have to explain to my why 'reset' and
'error handling' are two _distinct_ code paths in nvme-tcp.
I really don't get that. I _guess_ it's trying to hold requests
when doing a reset, and aborting requests if it's an error.
But why one needs to make that distinction is a mystery to
me; FC combines both paths and seems to work quite happily.
Thing is, that will get in the way when trying to move fencing
into the generic layer; you only can call 'nvme_reset_ctrl()',
and hope that this one will abort commands.
I'll check.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 09/21] nvme: Implement cross-controller reset completion
2026-02-18 7:51 ` Hannes Reinecke
@ 2026-02-18 12:47 ` Mohamed Khalfella
2026-02-20 3:34 ` Randy Jennings
0 siblings, 1 reply; 61+ messages in thread
From: Mohamed Khalfella @ 2026-02-18 12:47 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Aaron Dailey, Randy Jennings, Dhaval Giani,
linux-nvme, linux-kernel
On Wed 2026-02-18 08:51:31 +0100, Hannes Reinecke wrote:
> On 2/17/26 19:25, Mohamed Khalfella wrote:
> > On Mon 2026-02-16 13:43:51 +0100, Hannes Reinecke wrote:
> [ .. ]
> >>
> >> We really would need some indicator whether 'ccr' is supported at all.
> >
> > Why do we need this indicator, other than exporting it via sysfs?
> >
> To avoid false positives.
We will never try CCR on a controller that does not support it. False
positive of what?
>
> >> Using the number of available CCR commands would be an option, if though
> >> that would require us to keep two counters (one for the number of
> >> possible outstanding CCRs, and one for the number of actual outstanding
> >> CCRs.).
> >
> > Like mentioned above ctrl->ccr_limit gives us the number of ccrs
> > available now. It is not 100% indicator if CCR is supported or not, but
> > it is enough to implement CCR. A second counter can help us skip trying
> > CCR if we know impacted controller does not support it.
> >
> > Do you think it is worth it?
> >
> Yes. The problem is that we want to get towards TP8028 compliance, which
> forces us to wait for 2*KATO + CQT before requests on the failed patch
> can be retried. That will cause a _noticeable_ stall on the application
> side. And the only way to shorten that is CCR; once we get confirmation
> from CCR we can start retrying immediately.
> At the same time the current implementation only waits for 1*KATO before
> retrying, so there will be regression if we switch to TP8028-compliant
> KATO handling for systems not supporting CCR.
The statement above is not correct. Careful consideration and testing
has been made to not introduce such regression. If CCR is not supported
then nvme_find_ctrl_ccr() will return NULL and nvme_fence_ctrl() will
return immediately. No CCR command will be sent and no wait for AEN.
What happens next depends on whether ictrl->cqt is supported or not. If
not supported, which will be the case for systems in the field today,
then requests will be retried immediately. Requests will not be held in
this case and no delay will be seen in failover case.
>
> So we can (and should) use CCR as the determining factor whether we
> want to switch to TP8028-compliant behaviour or stick with the original
> implementation.
We do check CCR support and availability in nvme_find_ctrl_ccr(). Adding
a second counter will spare us the loop in nvme_find_ctrl_ccr(), which
is not worth it IMO.
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke Kernel Storage Architect
> hare@suse.de +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT
2026-02-17 15:35 ` Mohamed Khalfella
@ 2026-02-20 1:22 ` James Smart
2026-02-20 2:11 ` Randy Jennings
2026-02-20 7:23 ` Hannes Reinecke
2026-02-20 2:01 ` Randy Jennings
1 sibling, 2 replies; 61+ messages in thread
From: James Smart @ 2026-02-20 1:22 UTC (permalink / raw)
To: Mohamed Khalfella, Hannes Reinecke
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, jsmart833426
On 2/17/2026 7:35 AM, Mohamed Khalfella wrote:
> On Tue 2026-02-17 08:09:33 +0100, Hannes Reinecke wrote:
>> On 2/16/26 19:45, Mohamed Khalfella wrote:
>>> On Mon 2026-02-16 13:54:18 +0100, Hannes Reinecke wrote:
>>>> On 2/14/26 05:25, Mohamed Khalfella wrote:
>>>>> TP8028 Rapid Path Failure Recovery does not define how much time the
>>>>> host should wait for CCR operation to complete. It is reasonable to
>>>>> assume that CCR operation can take up to ctrl->cqt. Update wait time for
>>>>> CCR operation to be max(ctrl->cqt, ctrl->kato).
>>>>>
>>>>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
>>>>> ---
>>>>> drivers/nvme/host/core.c | 2 +-
>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>>> index 0680d05900c1..ff479c0263ab 100644
>>>>> --- a/drivers/nvme/host/core.c
>>>>> +++ b/drivers/nvme/host/core.c
>>>>> @@ -631,7 +631,7 @@ static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
>>>>> if (result & 0x01) /* Immediate Reset Successful */
>>>>> goto out;
>>>>>
>>>>> - tmo = secs_to_jiffies(ictrl->kato);
>>>>> + tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
>>>>> if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
>>>>> ret = -ETIMEDOUT;
>>>>> goto out;
>>>>
>>>> That is not my understanding. I was under the impression that CQT is the
>>>> _additional_ time a controller requires to clear out outstanding
>>>> commands once it detected a loss of communication (ie _after_ KATO).
>>>> Which would mean we have to wait for up to
>>>> (ctrl->kato * 1000) + ctrl->cqt.
>>>
>>> At this point the source controller knows about communication loss. We
>>> do not need kato wait. In theory we should just wait for CQT.
>>> max(cqt, kato) is a conservative guess I made.
>>>
>> Not quite. The source controller (on the host!) knows about the
>> communication loss. But the target might not, as the keep-alive
>> command might have arrived at the target _just_ before KATO
>> triggered on the host. So the target is still good, and will
>> be waiting for _another_ KATO interval before declaring
>> a loss of communication.
>> And only then will the CQT period start at the target.
>>
>> Randy, please correct me if I'm wrong ...
>>
>
> wait_for_completion_timeout(&ccr.complete, tmo)) waits for CCR operation
> to complete. The wait starts after CCR command completed successfully.
> IOW, it starts after the host received a CQE from source controller on
> the target telling us all is good. If the source controller on the target
> already know about loss of communication then there is no need to wait
> for KATO. We just need to wait for CCR operation to finish because we
> know it has been started successfully.
>
> The specs does not tell us how much time to wait for CCR operation to
> complete. max(cqt, kato) is an estimate I think reasonable to make.
So, we've sent CCR, received a CQE for the CCR within KATO (timeout in
nvme_issue_wait_ccr()), then are waiting another max(KATO, CQT) for the
io to die.
As CQT is the time to wait once the ctrl is killing the io, and as the
response indicated it certainly passed that point, a minimum of CQT
should be all that is needed. Why are we bringing KATO into the picture?
-- this takes me over to patch 8 and the timeout on CCR response being KATO:
Why is KATO being used ? nothing about getting the response says it is
related to the keep alive. Keepalive can move along happily while CCR
hangs out and really has nothing to do with KATO.
If using the rationale of a keepalive cmd processing - has roundtrip
time and minimal and prioritized processing, as CCR needs to do more and
as the spec allows holding on to always return 1, it should be
KATO+<something>, where <something> is no more than CQT.
But given that KATO can be really long as its trying to catch
communication failures, and as our ccr controller should not have comm
issues, it should be fairly quick. So rather than a 2min KATO, why not
10-15s ? This gets a little crazy as it takes me down paths of why not
fire off multiple CCRs (via different ctlrs) to the subsystem at short
intervals (the timeout) to finally find one that completes quickly and
then start CQT. And if nothing completes quickly bound the whole thing
to fencing start+KATO+CQT ?
-- james
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT
2026-02-17 15:35 ` Mohamed Khalfella
2026-02-20 1:22 ` James Smart
@ 2026-02-20 2:01 ` Randy Jennings
2026-02-20 7:25 ` Hannes Reinecke
1 sibling, 1 reply; 61+ messages in thread
From: Randy Jennings @ 2026-02-20 2:01 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Hannes Reinecke, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, James Smart, Aaron Dailey, Dhaval Giani,
linux-nvme, linux-kernel
Hannes,
> (ctrl->kato * 1000) + ctrl->cqt
As Mohamed pointed out, we have already received a response from a CCR
command. The CCR, once accepted, communicates the death of the
connection to the impacted controller and starts the cleanup tracked
by CQT. So, no need to wait for the impacted controller to figure out
the connection is down.
The max(cqt, kato) was just to give some wait time that should allow
issuing a CCR again from a different controller (in case of losing
communication with this one). It certainly does not need to be longer
than cqt (and it should be no longer than the remaining duration of
time-based retry; that should get addressed at some point). I cannot
remember why kato (if larger; I expect it would be smaller) made sense
at the time.
Sincerely,
Randy Jennings
On Tue, Feb 17, 2026 at 7:35 AM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> On Tue 2026-02-17 08:09:33 +0100, Hannes Reinecke wrote:
> > On 2/16/26 19:45, Mohamed Khalfella wrote:
> > > On Mon 2026-02-16 13:54:18 +0100, Hannes Reinecke wrote:
> > >> On 2/14/26 05:25, Mohamed Khalfella wrote:
> > >>> TP8028 Rapid Path Failure Recovery does not define how much time the
> > >>> host should wait for CCR operation to complete. It is reasonable to
> > >>> assume that CCR operation can take up to ctrl->cqt. Update wait time for
> > >>> CCR operation to be max(ctrl->cqt, ctrl->kato).
> > >>>
> > >>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > >>> ---
> > >>> drivers/nvme/host/core.c | 2 +-
> > >>> 1 file changed, 1 insertion(+), 1 deletion(-)
> > >>>
> > >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > >>> index 0680d05900c1..ff479c0263ab 100644
> > >>> --- a/drivers/nvme/host/core.c
> > >>> +++ b/drivers/nvme/host/core.c
> > >>> @@ -631,7 +631,7 @@ static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
> > >>> if (result & 0x01) /* Immediate Reset Successful */
> > >>> goto out;
> > >>>
> > >>> - tmo = secs_to_jiffies(ictrl->kato);
> > >>> + tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
> > >>> if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
> > >>> ret = -ETIMEDOUT;
> > >>> goto out;
> > >>
> > >> That is not my understanding. I was under the impression that CQT is the
> > >> _additional_ time a controller requires to clear out outstanding
> > >> commands once it detected a loss of communication (ie _after_ KATO).
> > >> Which would mean we have to wait for up to
> > >> (ctrl->kato * 1000) + ctrl->cqt.
> > >
> > > At this point the source controller knows about communication loss. We
> > > do not need kato wait. In theory we should just wait for CQT.
> > > max(cqt, kato) is a conservative guess I made.
> > >
> > Not quite. The source controller (on the host!) knows about the
> > communication loss. But the target might not, as the keep-alive
> > command might have arrived at the target _just_ before KATO
> > triggered on the host. So the target is still good, and will
> > be waiting for _another_ KATO interval before declaring
> > a loss of communication.
> > And only then will the CQT period start at the target.
> >
> > Randy, please correct me if I'm wrong ...
> >
>
> wait_for_completion_timeout(&ccr.complete, tmo)) waits for CCR operation
> to complete. The wait starts after CCR command completed successfully.
> IOW, it starts after the host received a CQE from source controller on
> the target telling us all is good. If the source controller on the target
> already know about loss of communication then there is no need to wait
> for KATO. We just need to wait for CCR operation to finish because we
> know it has been started successfully.
>
> The specs does not tell us how much time to wait for CCR operation to
> complete. max(cqt, kato) is an estimate I think reasonable to make.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT
2026-02-20 1:22 ` James Smart
@ 2026-02-20 2:11 ` Randy Jennings
2026-02-20 7:23 ` Hannes Reinecke
1 sibling, 0 replies; 61+ messages in thread
From: Randy Jennings @ 2026-02-20 2:11 UTC (permalink / raw)
To: James Smart
Cc: Mohamed Khalfella, Hannes Reinecke, Justin Tee,
Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Aaron Dailey, Dhaval Giani, linux-nvme, linux-kernel
On Thu, Feb 19, 2026 at 5:22 PM James Smart <jsmart833426@gmail.com> wrote:
>
> On 2/17/2026 7:35 AM, Mohamed Khalfella wrote:
> > On Tue 2026-02-17 08:09:33 +0100, Hannes Reinecke wrote:
> >> On 2/16/26 19:45, Mohamed Khalfella wrote:
> >>> On Mon 2026-02-16 13:54:18 +0100, Hannes Reinecke wrote:
> >>>> On 2/14/26 05:25, Mohamed Khalfella wrote:
> >>>>> TP8028 Rapid Path Failure Recovery does not define how much time the
> >>>>> host should wait for CCR operation to complete. It is reasonable to
> >>>>> assume that CCR operation can take up to ctrl->cqt. Update wait time for
> >>>>> CCR operation to be max(ctrl->cqt, ctrl->kato).
> >>>>>
> >>>>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> >>>>> ---
> >>>>> drivers/nvme/host/core.c | 2 +-
> >>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >>>>> index 0680d05900c1..ff479c0263ab 100644
> >>>>> --- a/drivers/nvme/host/core.c
> >>>>> +++ b/drivers/nvme/host/core.c
> >>>>> @@ -631,7 +631,7 @@ static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
> >>>>> if (result & 0x01) /* Immediate Reset Successful */
> >>>>> goto out;
> >>>>>
> >>>>> - tmo = secs_to_jiffies(ictrl->kato);
> >>>>> + tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
> >>>>> if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
> >>>>> ret = -ETIMEDOUT;
> >>>>> goto out;
> >>>>
> >>>> That is not my understanding. I was under the impression that CQT is the
> >>>> _additional_ time a controller requires to clear out outstanding
> >>>> commands once it detected a loss of communication (ie _after_ KATO).
> >>>> Which would mean we have to wait for up to
> >>>> (ctrl->kato * 1000) + ctrl->cqt.
> >>>
> >>> At this point the source controller knows about communication loss. We
> >>> do not need kato wait. In theory we should just wait for CQT.
> >>> max(cqt, kato) is a conservative guess I made.
> >>>
> >> Not quite. The source controller (on the host!) knows about the
> >> communication loss. But the target might not, as the keep-alive
> >> command might have arrived at the target _just_ before KATO
> >> triggered on the host. So the target is still good, and will
> >> be waiting for _another_ KATO interval before declaring
> >> a loss of communication.
> >> And only then will the CQT period start at the target.
> >>
> >> Randy, please correct me if I'm wrong ...
> >>
> >
> > wait_for_completion_timeout(&ccr.complete, tmo)) waits for CCR operation
> > to complete. The wait starts after CCR command completed successfully.
> > IOW, it starts after the host received a CQE from source controller on
> > the target telling us all is good. If the source controller on the target
> > already know about loss of communication then there is no need to wait
> > for KATO. We just need to wait for CCR operation to finish because we
> > know it has been started successfully.
> >
> > The specs does not tell us how much time to wait for CCR operation to
> > complete. max(cqt, kato) is an estimate I think reasonable to make.
>
> So, we've sent CCR, received a CQE for the CCR within KATO (timeout in
> nvme_issue_wait_ccr()), then are waiting another max(KATO, CQT) for the
> io to die.
>
> As CQT is the time to wait once the ctrl is killing the io, and as the
> response indicated it certainly passed that point, a minimum of CQT
> should be all that is needed. Why are we bringing KATO into the picture?
Good point.
>
> -- this takes me over to patch 8 and the timeout on CCR response being KATO:
> Why is KATO being used ? nothing about getting the response says it is
> related to the keep alive. Keepalive can move along happily while CCR
> hangs out and really has nothing to do with KATO.
>
> If using the rationale of a keepalive cmd processing - has roundtrip
> time and minimal and prioritized processing, as CCR needs to do more and
> as the spec allows holding on to always return 1, it should be
> KATO+<something>, where <something> is no more than CQT.
Well, CCR was supposed to decide to fail at some time less than CQT
on the controller. But I see your reasoning. Using the normal admin
timeout time would probably also work.
> But given that KATO can be really long as its trying to catch
> communication failures, and as our ccr controller should not have comm
> issues, it should be fairly quick. So rather than a 2min KATO, why not
> 10-15s ?
Ugh. 2 minute KATO? Have you seen that in the field? I've
seen 5-30 seconds.
> This gets a little crazy as it takes me down paths of why not
> fire off multiple CCRs (via different ctlrs) to the subsystem at short
> intervals (the timeout) to finally find one that completes quickly and
> then start CQT.
This is an interesting idea. That said, there was concern in the
group that controllers would have a low CCRL (like, 4). And I would
expect some paths down to be correlated (when connected to
an HA pair subsystem).
I was not sure why the expected limit would be low; the
implementation I am considering should have a rather large limit,
so I like your idea.
> And if nothing completes quickly bound the whole thing
> to fencing start+KATO+CQT ?
Well, 2x or 3x KATO.
Sincerely,
Randy Jennings
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 09/21] nvme: Implement cross-controller reset completion
2026-02-18 12:47 ` Mohamed Khalfella
@ 2026-02-20 3:34 ` Randy Jennings
0 siblings, 0 replies; 61+ messages in thread
From: Randy Jennings @ 2026-02-20 3:34 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Hannes Reinecke, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, James Smart, Aaron Dailey, Dhaval Giani,
linux-nvme, linux-kernel
On Wed, Feb 18, 2026 at 4:47 AM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> On Wed 2026-02-18 08:51:31 +0100, Hannes Reinecke wrote:
> > On 2/17/26 19:25, Mohamed Khalfella wrote:
> At the same time the current implementation only waits for 1*KATO before
> retrying, so there will be regression if we switch to TP8028-compliant
> KATO handling for systems not supporting CCR.
Hannes, as I read the code (this is patch 19), if CQT is not set,
there is no delay. I
was expecting that to continue forward (I would be happy to exclude
'1' also). I agree that we would not want to use CQT where subsystems
have not requested that time to quiesce.
Am I reading this wrong, and you are worried that committed code currently waits
for 1*KATO, and this patch set shortens that? I do not see a delay of 1*KATO in
committed code. What am I missing?
> > > On Mon 2026-02-16 13:43:51 +0100, Hannes Reinecke wrote:
> > So we can (and should) use CCR as the determining factor whether we
> > want to switch to TP8028-compliant behaviour or stick with the original
> > implementation.
>
> We do check CCR support and availability in nvme_find_ctrl_ccr(). Adding
> a second counter will spare us the loop in nvme_find_ctrl_ccr(), which
> is not worth it IMO.
Another option is the Commands Supported log page. CCR is a command,
so support for it should show up there. The data structure is not the
simplest to reference; it might end up more complicated than having a
separate flag (why use another counter?),
RE:
> > want to switch to TP8028-compliant behaviour or stick with the original
> > implementation.
Hannes, do you mean TP8028 or TP4129? Yes, if we do not support CCRs
we should not send them or expect to receive a successful response.
I would be careful of stating this in terms of TP-compliant behavior. I
care about fixing a data corruption. TP4129 worked out what that
required and provided a channel to communicate how long the
subsystem took to clean up, but I really do not care much about
compliance outside of compatibility and predictability. As long as the
data corruption is handled conclusively and in a feasible manner
(IOW, no, the subsystem cannot clean up instantaneously, and we
do have to deal with possible communication delays while
coordinating between the host and subsystem), I can be happy
with the solution.
Sincerely,
Randy Jennings
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT
2026-02-20 1:22 ` James Smart
2026-02-20 2:11 ` Randy Jennings
@ 2026-02-20 7:23 ` Hannes Reinecke
1 sibling, 0 replies; 61+ messages in thread
From: Hannes Reinecke @ 2026-02-20 7:23 UTC (permalink / raw)
To: James Smart, Mohamed Khalfella
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel
On 2/20/26 02:22, James Smart wrote:
> On 2/17/2026 7:35 AM, Mohamed Khalfella wrote:
>> On Tue 2026-02-17 08:09:33 +0100, Hannes Reinecke wrote:
>>> On 2/16/26 19:45, Mohamed Khalfella wrote:
>>>> On Mon 2026-02-16 13:54:18 +0100, Hannes Reinecke wrote:
>>>>> On 2/14/26 05:25, Mohamed Khalfella wrote:
>>>>>> TP8028 Rapid Path Failure Recovery does not define how much time the
>>>>>> host should wait for CCR operation to complete. It is reasonable to
>>>>>> assume that CCR operation can take up to ctrl->cqt. Update wait
>>>>>> time for
>>>>>> CCR operation to be max(ctrl->cqt, ctrl->kato).
>>>>>>
>>>>>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
>>>>>> ---
>>>>>> drivers/nvme/host/core.c | 2 +-
>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>>>> index 0680d05900c1..ff479c0263ab 100644
>>>>>> --- a/drivers/nvme/host/core.c
>>>>>> +++ b/drivers/nvme/host/core.c
>>>>>> @@ -631,7 +631,7 @@ static int nvme_issue_wait_ccr(struct
>>>>>> nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
>>>>>> if (result & 0x01) /* Immediate Reset Successful */
>>>>>> goto out;
>>>>>> - tmo = secs_to_jiffies(ictrl->kato);
>>>>>> + tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
>>>>>> if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
>>>>>> ret = -ETIMEDOUT;
>>>>>> goto out;
>>>>>
>>>>> That is not my understanding. I was under the impression that CQT
>>>>> is the
>>>>> _additional_ time a controller requires to clear out outstanding
>>>>> commands once it detected a loss of communication (ie _after_ KATO).
>>>>> Which would mean we have to wait for up to
>>>>> (ctrl->kato * 1000) + ctrl->cqt.
>>>>
>>>> At this point the source controller knows about communication loss. We
>>>> do not need kato wait. In theory we should just wait for CQT.
>>>> max(cqt, kato) is a conservative guess I made.
>>>>
>>> Not quite. The source controller (on the host!) knows about the
>>> communication loss. But the target might not, as the keep-alive
>>> command might have arrived at the target _just_ before KATO
>>> triggered on the host. So the target is still good, and will
>>> be waiting for _another_ KATO interval before declaring
>>> a loss of communication.
>>> And only then will the CQT period start at the target.
>>>
>>> Randy, please correct me if I'm wrong ...
>>>
>>
>> wait_for_completion_timeout(&ccr.complete, tmo)) waits for CCR operation
>> to complete. The wait starts after CCR command completed successfully.
>> IOW, it starts after the host received a CQE from source controller on
>> the target telling us all is good. If the source controller on the target
>> already know about loss of communication then there is no need to wait
>> for KATO. We just need to wait for CCR operation to finish because we
>> know it has been started successfully.
>>
>> The specs does not tell us how much time to wait for CCR operation to
>> complete. max(cqt, kato) is an estimate I think reasonable to make.
>
> So, we've sent CCR, received a CQE for the CCR within KATO (timeout in
> nvme_issue_wait_ccr()), then are waiting another max(KATO, CQT) for the
> io to die.
>
> As CQT is the time to wait once the ctrl is killing the io, and as the
> response indicated it certainly passed that point, a minimum of CQT
> should be all that is needed. Why are we bringing KATO into the picture?
>
Well, a successful CCR completion (without the IRS bit) just indicates
that the controller has started aborting commands.
The host has still to wait for that the finish.
The controller signals command abort completion via an AEN and
corresponding logpage. For which we have to wait for up to CQT.
But as commands are involved (we have to wait for the AEN) the
actual waiting time is max(KATO,CQT).
> -- this takes me over to patch 8 and the timeout on CCR response being
> KATO:
> Why is KATO being used ? nothing about getting the response says it is
> related to the keep alive. Keepalive can move along happily while CCR
> hangs out and really has nothing to do with KATO.
>
The keepalive timeout is a measure for connectivity loss.
Or, more general, the minimal time each side is required to wait before
declaring any command as 'lost' (a bit like R_A_TOV ...).
So sending the CCR command (and waiting for the response) is governed
by KATO.
> If using the rationale of a keepalive cmd processing - has roundtrip
> time and minimal and prioritized processing, as CCR needs to do more and
> as the spec allows holding on to always return 1, it should be
> KATO+<something>, where <something> is no more than CQT.
>
Again, this is not so much about the keepalive command but rather about
the _time_ each side is required to wait for the keepalive response.
Technically you are correct, though, and CCR should be treated just
like any other command. But the problem currently is that the nvme
timeout handler triggers on _command timeout_, not on KATO timeout.
We're trying to change that, but it takes time ...
> But given that KATO can be really long as its trying to catch
> communication failures, and as our ccr controller should not have comm
> issues, it should be fairly quick. So rather than a 2min KATO, why not
> 10-15s ? This gets a little crazy as it takes me down paths of why not
> fire off multiple CCRs (via different ctlrs) to the subsystem at short
> intervals (the timeout) to finally find one that completes quickly and
> then start CQT. And if nothing completes quickly bound the whole thing
> to fencing start+KATO+CQT ?
>
As it currently stands, CCR is only useful if the entire execution time
is significantly shorter than KATO.
In the current model error handling starts once KATO timeout triggers;
then CCR is sent and we're waiting for the AEN for max(CQT,KATO) before
retrying commands.
(I _think_ to be absolutely correct we would have to wait for CQT +
KATO, but that's beside the point).
So the main difference between the current error handling is the
additional waiting time for CCR or max(CQT,KATO) if CCR fails.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT
2026-02-20 2:01 ` Randy Jennings
@ 2026-02-20 7:25 ` Hannes Reinecke
2026-02-27 3:05 ` Randy Jennings
0 siblings, 1 reply; 61+ messages in thread
From: Hannes Reinecke @ 2026-02-20 7:25 UTC (permalink / raw)
To: Randy Jennings, Mohamed Khalfella
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Aaron Dailey, Dhaval Giani, linux-nvme, linux-kernel
On 2/20/26 03:01, Randy Jennings wrote:
> Hannes,
>
>> (ctrl->kato * 1000) + ctrl->cqt
> As Mohamed pointed out, we have already received a response from a CCR
> command. The CCR, once accepted, communicates the death of the
> connection to the impacted controller and starts the cleanup tracked
> by CQT. So, no need to wait for the impacted controller to figure out
> the connection is down.
>
> The max(cqt, kato) was just to give some wait time that should allow
> issuing a CCR again from a different controller (in case of losing
> communication with this one). It certainly does not need to be longer
> than cqt (and it should be no longer than the remaining duration of
> time-based retry; that should get addressed at some point). I cannot
> remember why kato (if larger; I expect it would be smaller) made sense
> at the time.
>
Because we have to wait for the AEN, at which point KATO comes into
play yet again.
So max(CQT, KATO) is the appropriate waiting time for that.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 08/21] nvme: Implement cross-controller reset recovery
2026-02-14 4:25 ` [PATCH v3 08/21] nvme: Implement cross-controller reset recovery Mohamed Khalfella
2026-02-16 12:41 ` Hannes Reinecke
@ 2026-02-26 2:37 ` Randy Jennings
2026-03-27 18:33 ` Mohamed Khalfella
1 sibling, 1 reply; 61+ messages in thread
From: Randy Jennings @ 2026-02-26 2:37 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke, Aaron Dailey, Dhaval Giani,
linux-nvme, linux-kernel
On Fri, Feb 13, 2026 at 8:28 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> A host that has more than one path connecting to an nvme subsystem
> typically has an nvme controller associated with every path. This is
> mostly applicable to nvmeof. If one path goes down, inflight IOs on that
> path should not be retried immediately on another path because this
> could lead to data corruption as described in TP4129. TP8028 defines
> cross-controller reset mechanism that can be used by host to terminate
> IOs on the failed path using one of the remaining healthy paths. Only
> after IOs are terminated, or long enough time passes as defined by
> TP4129, inflight IOs should be retried on another path. Implement core
> cross-controller reset shared logic to be used by the transports.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> +static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
> + if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
> + ret = -ETIMEDOUT;
> + goto out;
> + }
The more I look at this, the less I can ignore that this tmo should be
capped by deadline - now..
> +unsigned long nvme_fence_ctrl(struct nvme_ctrl *ictrl)
> + deadline = now + msecs_to_jiffies(timeout);
> + while (time_before(now, deadline)) {
...
> + ret = nvme_issue_wait_ccr(sctrl, ictrl);
...
> + }
Sincerely,
Randy Jennings
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 14/21] nvme-fc: Hold inflight requests while in FENCING state
2026-02-14 4:25 ` [PATCH v3 14/21] nvme-fc: Hold inflight requests while in FENCING state Mohamed Khalfella
@ 2026-02-27 2:49 ` Randy Jennings
2026-02-28 1:10 ` James Smart
1 sibling, 0 replies; 61+ messages in thread
From: Randy Jennings @ 2026-02-27 2:49 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke, Aaron Dailey, Dhaval Giani,
linux-nvme, linux-kernel
On Fri, Feb 13, 2026 at 8:28 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> While in FENCING state, aborted inflight IOs should be held until fencing
> is done. Update nvme_fc_fcpio_done() to not complete aborted requests or
> requests with transport errors. These held requests will be canceled in
> nvme_fc_delete_association() after fencing is done. nvme_fc_fcpio_done()
> avoids racing with canceling aborted requests by making sure we complete
> successful requests before waking up the waiting thread.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> static void nvme_fc_fencing_work(struct work_struct *work)
> @@ -1969,7 +1978,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
> struct nvme_command *sqe = &op->cmd_iu.sqe;
> __le16 status = cpu_to_le16(NVME_SC_SUCCESS << 1);
> union nvme_result result;
> - bool terminate_assoc = true;
> + bool op_term, terminate_assoc = true;
> + enum nvme_ctrl_state state;
> int opstate;
>
> /*
> @@ -2102,16 +2112,38 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
> done:
> if (op->flags & FCOP_FLAGS_AEN) {
> nvme_complete_async_event(&queue->ctrl->ctrl, status, &result);
> - __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
> + if (__nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate))
> + __nvme_fc_fcpop_count_one_down(ctrl);
> atomic_set(&op->state, FCPOP_STATE_IDLE);
> op->flags = FCOP_FLAGS_AEN; /* clear other flags */
> nvme_fc_ctrl_put(ctrl);
> goto check_error;
> }
>
> - __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
> + /*
> + * We can not access op after the request is completed because it can
> + * be reused immediately. At the same time we want to wakeup the thread
> + * waiting for ongoing IOs _after_ requests are completed. This is
> + * necessary because that thread will start canceling inflight IOs
> + * and we want to avoid request completion racing with cancellation.
> + */
> + op_term = __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
> +
> + /*
> + * If we are going to terminate associations and the controller is
> + * LIVE or FENCING, then do not complete this request now. Let error
> + * recovery cancel this request when it is safe to do so.
> + */
> + state = nvme_ctrl_state(&ctrl->ctrl);
> + if (terminate_assoc &&
> + (state == NVME_CTRL_LIVE || state == NVME_CTRL_FENCING))
> + goto check_op_term;
> +
> if (!nvme_try_complete_req(rq, status, result))
> nvme_fc_complete_rq(rq);
> +check_op_term:
> + if (op_term)
> + __nvme_fc_fcpop_count_one_down(ctrl);
Although it is a more complicated boolean expression, I think it is
easier to grok:
> + if (!(terminate_assoc &&
> + (state == NVME_CTRL_LIVE || state == NVME_CTRL_FENCING)) &&
> + !nvme_try_complete_req(rq, status, result))
> nvme_fc_complete_rq(rq);
resulting in one less goto.
Sincerely,
Randy Jennings
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT
2026-02-20 7:25 ` Hannes Reinecke
@ 2026-02-27 3:05 ` Randy Jennings
2026-03-02 7:32 ` Hannes Reinecke
0 siblings, 1 reply; 61+ messages in thread
From: Randy Jennings @ 2026-02-27 3:05 UTC (permalink / raw)
To: Hannes Reinecke
Cc: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, James Smart, Aaron Dailey, Dhaval Giani,
linux-nvme, linux-kernel
On Thu, Feb 19, 2026 at 11:25 PM Hannes Reinecke <hare@suse.de> wrote:
>
> On 2/20/26 03:01, Randy Jennings wrote:
> > Hannes,
> >
> >> (ctrl->kato * 1000) + ctrl->cqt
> > As Mohamed pointed out, we have already received a response from a CCR
> > command. The CCR, once accepted, communicates the death of the
> > connection to the impacted controller and starts the cleanup tracked
> > by CQT. So, no need to wait for the impacted controller to figure out
> > the connection is down.
> >
> > The max(cqt, kato) was just to give some wait time that should allow
> > issuing a CCR again from a different controller (in case of losing
> > communication with this one). It certainly does not need to be longer
> > than cqt (and it should be no longer than the remaining duration of
> > time-based retry; that should get addressed at some point). I cannot
> > remember why kato (if larger; I expect it would be smaller) made sense
> > at the time.
> >
> Because we have to wait for the AEN, at which point KATO comes into
> play yet again.
> So max(CQT, KATO) is the appropriate waiting time for that.
I see your point. It could take ~KATO time for the AEN to show up after
the CCR operation finishes. Technically true. However, if responses
are taking KATO time to get back to the host, I think would rather retry
on a more healthy link.
Sincerely,
Randy Jennings
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 03/21] nvmet: Implement CCR nvme command
2026-02-14 4:25 ` [PATCH v3 03/21] nvmet: Implement CCR nvme command Mohamed Khalfella
@ 2026-02-27 16:30 ` Maurizio Lombardi
2026-03-25 18:52 ` Mohamed Khalfella
0 siblings, 1 reply; 61+ messages in thread
From: Maurizio Lombardi @ 2026-02-27 16:30 UTC (permalink / raw)
To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel
On Sat Feb 14, 2026 at 5:25 AM CET, Mohamed Khalfella wrote:
> Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
> Reset) command is an nvme command issued to source controller by
> initiator to reset impacted controller. Implement CCR command for linux
> nvme target.
>
> +
> + new_ccr = kmalloc(sizeof(*new_ccr), GFP_KERNEL);
> + if (!new_ccr) {
> + status = NVME_SC_INTERNAL;
> + goto out_unlock;
> + }
Nit: kmalloc_obj is now the preferred function for this kind of memory
allocations, see commit 69050f8d6d075dc01a and 189f164e573e18d
scripts/checkpatch.pl is supposed to print a warning
but there must be a problem with the regex and doesn't catch it
Maurizio
> +
> + new_ccr->ciu = cmd->ccr.ciu;
> + new_ccr->icid = cntlid;
> + new_ccr->ctrl = ictrl;
> + list_add_tail(&new_ccr->entry, &sctrl->ccr_list);
> +
> +out_unlock:
> + mutex_unlock(&sctrl->lock);
> + if (status == NVME_SC_SUCCESS)
> + nvmet_ctrl_fatal_error(ictrl);
> + nvmet_ctrl_put(ictrl);
> +out:
> + nvmet_req_complete(req, status);
> +}
> +
> u32 nvmet_admin_cmd_data_len(struct nvmet_req *req)
> {
> struct nvme_command *cmd = req->cmd;
> @@ -1691,6 +1762,9 @@ u16 nvmet_parse_admin_cmd(struct nvmet_req *req)
> case nvme_admin_keep_alive:
> req->execute = nvmet_execute_keep_alive;
> return 0;
> + case nvme_admin_cross_ctrl_reset:
> + req->execute = nvmet_execute_cross_ctrl_reset;
> + return 0;
> default:
> return nvmet_report_invalid_opcode(req);
> }
> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> index e5f413405604..38f71e1a1b8e 100644
> --- a/drivers/nvme/target/core.c
> +++ b/drivers/nvme/target/core.c
> @@ -115,6 +115,20 @@ u16 nvmet_zero_sgl(struct nvmet_req *req, off_t off, size_t len)
> return 0;
> }
>
> +void nvmet_ctrl_cleanup_ccrs(struct nvmet_ctrl *ctrl, bool all)
> +{
> + struct nvmet_ccr *ccr, *tmp;
> +
> + lockdep_assert_held(&ctrl->lock);
> +
> + list_for_each_entry_safe(ccr, tmp, &ctrl->ccr_list, entry) {
> + if (all || ccr->ctrl == NULL) {
> + list_del(&ccr->entry);
> + kfree(ccr);
> + }
> + }
> +}
> +
> static u32 nvmet_max_nsid(struct nvmet_subsys *subsys)
> {
> struct nvmet_ns *cur;
> @@ -1397,6 +1411,7 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl)
> if (!nvmet_is_disc_subsys(ctrl->subsys)) {
> ctrl->ciu = ((u8)(ctrl->ciu + 1)) ? : 1;
> ctrl->cirn = get_random_u64();
> + nvmet_ctrl_cleanup_ccrs(ctrl, false);
> }
> ctrl->csts = NVME_CSTS_RDY;
>
> @@ -1502,6 +1517,35 @@ struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
> return ctrl;
> }
>
> +struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
> + const char *hostnqn, u8 ciu,
> + u16 cntlid, u64 cirn)
> +{
> + struct nvmet_ctrl *ctrl, *ictrl = NULL;
> + bool found = false;
> +
> + mutex_lock(&subsys->lock);
> + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> + if (ctrl->cntlid != cntlid)
> + continue;
> +
> + /* Avoid racing with a controller that is becoming ready */
> + mutex_lock(&ctrl->lock);
> + if (ctrl->ciu == ciu && ctrl->cirn == cirn)
> + found = true;
> + mutex_unlock(&ctrl->lock);
> +
> + if (found) {
> + if (kref_get_unless_zero(&ctrl->ref))
> + ictrl = ctrl;
> + break;
> + }
> + };
> + mutex_unlock(&subsys->lock);
> +
> + return ictrl;
> +}
> +
> u16 nvmet_check_ctrl_status(struct nvmet_req *req)
> {
> if (unlikely(!(req->sq->ctrl->cc & NVME_CC_ENABLE))) {
> @@ -1627,6 +1671,7 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
> subsys->clear_ids = 1;
> #endif
>
> + INIT_LIST_HEAD(&ctrl->ccr_list);
> INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
> INIT_LIST_HEAD(&ctrl->async_events);
> INIT_RADIX_TREE(&ctrl->p2p_ns_map, GFP_KERNEL);
> @@ -1740,12 +1785,43 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
> }
> EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
>
> +static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
> +{
> + struct nvmet_subsys *subsys = ctrl->subsys;
> + struct nvmet_ctrl *sctrl;
> + struct nvmet_ccr *ccr;
> +
> + lockdep_assert_held(&subsys->lock);
> +
> + /* Cleanup all CCRs issued by ctrl as source controller */
> + mutex_lock(&ctrl->lock);
> + nvmet_ctrl_cleanup_ccrs(ctrl, true);
> + mutex_unlock(&ctrl->lock);
> +
> + /*
> + * Find all CCRs targeting ctrl as impacted controller and
> + * set ccr->ctrl to NULL. This tells the source controller
> + * that CCR completed successfully.
> + */
> + list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
> + mutex_lock(&sctrl->lock);
> + list_for_each_entry(ccr, &sctrl->ccr_list, entry) {
> + if (ccr->ctrl == ctrl) {
> + ccr->ctrl = NULL;
> + break;
> + }
> + }
> + mutex_unlock(&sctrl->lock);
> + }
> +}
> +
> static void nvmet_ctrl_free(struct kref *ref)
> {
> struct nvmet_ctrl *ctrl = container_of(ref, struct nvmet_ctrl, ref);
> struct nvmet_subsys *subsys = ctrl->subsys;
>
> mutex_lock(&subsys->lock);
> + nvmet_ctrl_complete_pending_ccr(ctrl);
> nvmet_ctrl_destroy_pr(ctrl);
> nvmet_release_p2p_ns_map(ctrl);
> list_del(&ctrl->subsys_entry);
> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
> index a36daa5d3a57..b06d905c08c8 100644
> --- a/drivers/nvme/target/nvmet.h
> +++ b/drivers/nvme/target/nvmet.h
> @@ -268,6 +268,7 @@ struct nvmet_ctrl {
> u32 kato;
> u64 cirn;
>
> + struct list_head ccr_list;
> struct nvmet_port *port;
>
> u32 aen_enabled;
> @@ -314,6 +315,13 @@ struct nvmet_ctrl {
> struct nvmet_pr_log_mgr pr_log_mgr;
> };
>
> +struct nvmet_ccr {
> + struct nvmet_ctrl *ctrl;
> + struct list_head entry;
> + u16 icid;
> + u8 ciu;
> +};
> +
> struct nvmet_subsys {
> enum nvme_subsys_type type;
>
> @@ -576,6 +584,7 @@ void nvmet_req_free_sgls(struct nvmet_req *req);
> void nvmet_execute_set_features(struct nvmet_req *req);
> void nvmet_execute_get_features(struct nvmet_req *req);
> void nvmet_execute_keep_alive(struct nvmet_req *req);
> +void nvmet_execute_cross_ctrl_reset(struct nvmet_req *req);
>
> u16 nvmet_check_cqid(struct nvmet_ctrl *ctrl, u16 cqid, bool create);
> u16 nvmet_check_io_cqid(struct nvmet_ctrl *ctrl, u16 cqid, bool create);
> @@ -618,6 +627,10 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args);
> struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
> const char *hostnqn, u16 cntlid,
> struct nvmet_req *req);
> +struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
> + const char *hostnqn, u8 ciu,
> + u16 cntlid, u64 cirn);
> +void nvmet_ctrl_cleanup_ccrs(struct nvmet_ctrl *ctrl, bool all);
> void nvmet_ctrl_put(struct nvmet_ctrl *ctrl);
> u16 nvmet_check_ctrl_status(struct nvmet_req *req);
> ssize_t nvmet_ctrl_host_traddr(struct nvmet_ctrl *ctrl,
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index 7746b6d30349..d9b421dc1ef3 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -22,6 +22,7 @@
> #define NVMF_TSAS_SIZE 256
>
> #define NVMF_CCR_LIMIT 4
> +#define NVMF_CCR_PER_PAGE 511
>
> #define NVME_DISC_SUBSYS_NAME "nqn.2014-08.org.nvmexpress.discovery"
>
> @@ -1222,6 +1223,22 @@ struct nvme_zone_mgmt_recv_cmd {
> __le32 cdw14[2];
> };
>
> +struct nvme_cross_ctrl_reset_cmd {
> + __u8 opcode;
> + __u8 flags;
> + __u16 command_id;
> + __le32 nsid;
> + __le64 rsvd2[2];
> + union nvme_data_ptr dptr;
> + __u8 rsvd10;
> + __u8 ciu;
> + __le16 icid;
> + __le32 cdw11;
> + __le64 cirn;
> + __le32 cdw14;
> + __le32 cdw15;
> +};
> +
> struct nvme_io_mgmt_recv_cmd {
> __u8 opcode;
> __u8 flags;
> @@ -1320,6 +1337,7 @@ enum nvme_admin_opcode {
> nvme_admin_virtual_mgmt = 0x1c,
> nvme_admin_nvme_mi_send = 0x1d,
> nvme_admin_nvme_mi_recv = 0x1e,
> + nvme_admin_cross_ctrl_reset = 0x38,
> nvme_admin_dbbuf = 0x7C,
> nvme_admin_format_nvm = 0x80,
> nvme_admin_security_send = 0x81,
> @@ -1353,6 +1371,7 @@ enum nvme_admin_opcode {
> nvme_admin_opcode_name(nvme_admin_virtual_mgmt), \
> nvme_admin_opcode_name(nvme_admin_nvme_mi_send), \
> nvme_admin_opcode_name(nvme_admin_nvme_mi_recv), \
> + nvme_admin_opcode_name(nvme_admin_cross_ctrl_reset), \
> nvme_admin_opcode_name(nvme_admin_dbbuf), \
> nvme_admin_opcode_name(nvme_admin_format_nvm), \
> nvme_admin_opcode_name(nvme_admin_security_send), \
> @@ -2006,6 +2025,7 @@ struct nvme_command {
> struct nvme_dbbuf dbbuf;
> struct nvme_directive_cmd directive;
> struct nvme_io_mgmt_recv_cmd imr;
> + struct nvme_cross_ctrl_reset_cmd ccr;
> };
> };
>
> @@ -2170,6 +2190,9 @@ enum {
> NVME_SC_PMR_SAN_PROHIBITED = 0x123,
> NVME_SC_ANA_GROUP_ID_INVALID = 0x124,
> NVME_SC_ANA_ATTACH_FAILED = 0x125,
> + NVME_SC_CCR_IN_PROGRESS = 0x13f,
> + NVME_SC_CCR_LOGPAGE_FULL = 0x140,
> + NVME_SC_CCR_LIMIT_EXCEEDED = 0x141,
>
> /*
> * I/O Command Set Specific - NVM commands:
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 12/21] nvme-fc: Decouple error recovery from controller reset
2026-02-14 4:25 ` [PATCH v3 12/21] nvme-fc: Decouple error recovery from controller reset Mohamed Khalfella
@ 2026-02-28 0:12 ` James Smart
2026-03-26 2:37 ` Mohamed Khalfella
0 siblings, 1 reply; 61+ messages in thread
From: James Smart @ 2026-02-28 0:12 UTC (permalink / raw)
To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel, jsmart833426
On 2/13/2026 8:25 PM, Mohamed Khalfella wrote:
> nvme_fc_error_recovery() called from nvme_fc_timeout() while controller
> in CONNECTING state results in deadlock reported in link below. Update
> nvme_fc_timeout() to schedule error recovery to avoid the deadlock.
This seems misleading on what is changing...
How about:
Add new nvme_fc_start_ioerr_recovery() routine which effectively
"resets" a the controller.
Refactor error points that invoked routines that reset the controller
to now call nvme_fc_start_ioerr_recovery().
Eliminated io abort on io error, as we will be resetting the controller.
>
> Previous to this change if controller was LIVE error recovery resets
> the controller and this does not match nvme-tcp and nvme-rdma. Decouple
> error recovery from controller reset to match other fabric transports.
Please delete. It's irrelevant to the patch.
...
> @@ -1871,7 +1874,22 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
> struct nvme_fc_ctrl *ctrl =
> container_of(work, struct nvme_fc_ctrl, ioerr_work);
>
> - nvme_fc_error_recovery(ctrl, "transport detected io error");
> + /*
> + * if an error (io timeout, etc) while (re)connecting, the remote
> + * port requested terminating of the association (disconnect_ls)
> + * or an error (timeout or abort) occurred on an io while creating
> + * the controller. Abort any ios on the association and let the
> + * create_association error path resolve things.
> + */
> + if (nvme_ctrl_state(&ctrl->ctrl) == NVME_CTRL_CONNECTING) {
> + __nvme_fc_abort_outstanding_ios(ctrl, true);
> + dev_warn(ctrl->ctrl.device,
> + "NVME-FC{%d}: transport error during (re)connect\n",
> + ctrl->cnum);
> + return;
> + }
> +
> + nvme_fc_error_recovery(ctrl);
> }
ok - but see below...
> +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> + char *errmsg)
> +{
> + enum nvme_ctrl_state state;
> +
> + if (nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING)) {
> + dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error recovery %s\n",
> + ctrl->cnum, errmsg);
> + queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> + return;
> + }
> +
> + state = nvme_ctrl_state(&ctrl->ctrl);
> + if (state == NVME_CTRL_CONNECTING || state == NVME_CTRL_DELETING ||
> + state == NVME_CTRL_DELETING_NOIO) {
> + queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> + }
> +}
What bothers me about this (true of the tcp and rmda transports) is
there is little difference between this and using the core
nvme_reset_ctrl(), excepting that even when the state change fails, the
code continues to schedule the work element that does the reset.
And the latter odd snippet to reset anyway is only to get the CONNECTING
code snippet, which failed the RESETTING transition, to be performed.
I'd prefer the connecting snippet be at the top of start_ioerr_recovery
before any state change attempt so its in the same place as prior.
...
> static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
> {
> struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(rq);
> @@ -2536,24 +2539,14 @@ static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
> struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
> struct nvme_command *sqe = &cmdiu->sqe;
>
> - /*
> - * Attempt to abort the offending command. Command completion
> - * will detect the aborted io and will fail the connection.
> - */
> dev_info(ctrl->ctrl.device,
> "NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
> "x%08x/x%08x\n",
> ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
> nvme_fabrics_opcode_str(qnum, sqe),
> sqe->common.cdw10, sqe->common.cdw11);
> - if (__nvme_fc_abort_op(ctrl, op))
> - nvme_fc_error_recovery(ctrl, "io timeout abort failed");
>
> - /*
> - * the io abort has been initiated. Have the reset timer
> - * restarted and the abort completion will complete the io
> - * shortly. Avoids a synchronous wait while the abort finishes.
> - */
> + nvme_fc_start_ioerr_recovery(ctrl, "io timeout");
> return BLK_EH_RESET_TIMER;
> }
I eventually gave in on not doing the abort of the io as the
start_ioerr_recovery() will be resetting the controller.
>
> @@ -3352,6 +3345,27 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
> }
> }
>
> +static void
> +nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
> +{
> + nvme_stop_keep_alive(&ctrl->ctrl);
> + nvme_stop_ctrl(&ctrl->ctrl);
> + flush_work(&ctrl->ctrl.async_event_work);
> +
> + /* will block while waiting for io to terminate */
> + nvme_fc_delete_association(ctrl);
> +
> + /* Do not reconnect if controller is being deleted */
> + if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
> + return;
> +
> + if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
> + queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
> + return;
> + }
> +
> + nvme_fc_reconnect_or_delete(ctrl, -ENOTCONN);
> +}
>
> static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
> .name = "fc",
There is no reason to duplicate the code that is already in ioerr_work.
I prototyped a simple service routine. The net/net is showed what little
reason there is to have an ioerr_work and a reset_work - as they are
effectively the same. So I then eliminated ioerr_work and use reset_work
and the service routine (kept the nvme_fc_error_recovery() name).
Here's a revised diff for this patch... I have compiled but not tested.
--- fc.c.START 2026-02-27 14:10:07.631705123 -0800
+++ fc.c 2026-02-27 15:41:09.777836476 -0800
@@ -166,7 +166,6 @@ struct nvme_fc_ctrl {
struct blk_mq_tag_set admin_tag_set;
struct blk_mq_tag_set tag_set;
- struct work_struct ioerr_work;
struct delayed_work connect_work;
struct kref ref;
@@ -227,6 +226,8 @@ static DEFINE_IDA(nvme_fc_ctrl_cnt);
static struct device *fc_udev_device;
static void nvme_fc_complete_rq(struct request *rq);
+static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
+ char *errmsg);
/* *********************** FC-NVME Port Management
************************ */
@@ -788,7 +789,7 @@ nvme_fc_ctrl_connectivity_loss(struct nv
"Reconnect", ctrl->cnum);
set_bit(ASSOC_FAILED, &ctrl->flags);
- nvme_reset_ctrl(&ctrl->ctrl);
+ nvme_fc_start_ioerr_recovery(ctrl, "Connectivity Loss");
}
/**
@@ -985,8 +986,6 @@ fc_dma_unmap_sg(struct device *dev, stru
static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
-static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char
*errmsg);
-
static void
__nvme_fc_finish_ls_req(struct nvmefc_ls_req_op *lsop)
{
@@ -1569,7 +1568,8 @@ nvme_fc_ls_disconnect_assoc(struct nvmef
*/
/* fail the association */
- nvme_fc_error_recovery(ctrl, "Disconnect Association LS received");
+ nvme_fc_start_ioerr_recovery(ctrl,
+ "Disconnect Association LS received");
/* release the reference taken by nvme_fc_match_disconn_ls() */
nvme_fc_ctrl_put(ctrl);
@@ -1865,15 +1865,6 @@ __nvme_fc_fcpop_chk_teardowns(struct nvm
}
}
-static void
-nvme_fc_ctrl_ioerr_work(struct work_struct *work)
-{
- struct nvme_fc_ctrl *ctrl =
- container_of(work, struct nvme_fc_ctrl, ioerr_work);
-
- nvme_fc_error_recovery(ctrl, "transport detected io error");
-}
-
/*
* nvme_fc_io_getuuid - Routine called to get the appid field
* associated with request by the lldd
@@ -2049,9 +2040,8 @@ done:
nvme_fc_complete_rq(rq);
check_error:
- if (terminate_assoc &&
- nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_RESETTING)
- queue_work(nvme_reset_wq, &ctrl->ioerr_work);
+ if (terminate_assoc)
+ nvme_fc_start_ioerr_recovery(ctrl, "io error");
}
static int
@@ -2496,7 +2486,7 @@ __nvme_fc_abort_outstanding_ios(struct n
}
static void
-nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
+nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
{
enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
@@ -2515,17 +2505,15 @@ nvme_fc_error_recovery(struct nvme_fc_ct
return;
}
- /* Otherwise, only proceed if in LIVE state - e.g. on first error */
- if (state != NVME_CTRL_LIVE)
+ if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
return;
dev_warn(ctrl->ctrl.device,
"NVME-FC{%d}: transport association event: %s\n",
ctrl->cnum, errmsg);
- dev_warn(ctrl->ctrl.device,
- "NVME-FC{%d}: resetting controller\n", ctrl->cnum);
-
- nvme_reset_ctrl(&ctrl->ctrl);
+ dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error recovery %s\n",
+ ctrl->cnum, errmsg);
+ queue_work(nvme_reset_wq, &ctrl->ctrl.reset_work);
}
static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
@@ -2536,24 +2524,14 @@ static enum blk_eh_timer_return nvme_fc_
struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
struct nvme_command *sqe = &cmdiu->sqe;
- /*
- * Attempt to abort the offending command. Command completion
- * will detect the aborted io and will fail the connection.
- */
dev_info(ctrl->ctrl.device,
"NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
"x%08x/x%08x\n",
ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
nvme_fabrics_opcode_str(qnum, sqe),
sqe->common.cdw10, sqe->common.cdw11);
- if (__nvme_fc_abort_op(ctrl, op))
- nvme_fc_error_recovery(ctrl, "io timeout abort failed");
- /*
- * the io abort has been initiated. Have the reset timer
- * restarted and the abort completion will complete the io
- * shortly. Avoids a synchronous wait while the abort finishes.
- */
+ nvme_fc_start_ioerr_recovery(ctrl, "io timeout");
return BLK_EH_RESET_TIMER;
}
@@ -3264,7 +3242,7 @@ nvme_fc_delete_ctrl(struct nvme_ctrl *nc
* waiting for io to terminate
*/
nvme_fc_delete_association(ctrl);
- cancel_work_sync(&ctrl->ioerr_work);
+ cancel_work_sync(&ctrl->ctrl.reset_work);
if (ctrl->ctrl.tagset)
nvme_remove_io_tag_set(&ctrl->ctrl);
@@ -3324,20 +3302,27 @@ nvme_fc_reconnect_or_delete(struct nvme_
}
static void
-nvme_fc_reset_ctrl_work(struct work_struct *work)
+nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
{
- struct nvme_fc_ctrl *ctrl =
- container_of(work, struct nvme_fc_ctrl, ctrl.reset_work);
-
+ nvme_stop_keep_alive(&ctrl->ctrl);
+ flush_work(&ctrl->ctrl.async_event_work);
nvme_stop_ctrl(&ctrl->ctrl);
/* will block will waiting for io to terminate */
nvme_fc_delete_association(ctrl);
- if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
+ if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
+ enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
+
+ /* state change failure is ok if we started ctrl delete */
+ if (state == NVME_CTRL_DELETING ||
+ state == NVME_CTRL_DELETING_NOIO)
+ return;
+
dev_err(ctrl->ctrl.device,
- "NVME-FC{%d}: error_recovery: Couldn't change state "
- "to CONNECTING\n", ctrl->cnum);
+ "NVME-FC{%d}: error_recovery: Couldn't change "
+ "state to CONNECTING (%d)\n", ctrl->cnum, state);
+ }
if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
if (!queue_delayed_work(nvme_wq, &ctrl->connect_work, 0)) {
@@ -3352,6 +3337,15 @@ nvme_fc_reset_ctrl_work(struct work_stru
}
}
+static void
+nvme_fc_reset_ctrl_work(struct work_struct *work)
+{
+ struct nvme_fc_ctrl *ctrl =
+ container_of(work, struct nvme_fc_ctrl, ctrl.reset_work);
+
+ nvme_fc_error_recovery(ctrl);
+}
+
static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
.name = "fc",
@@ -3483,7 +3477,6 @@ nvme_fc_alloc_ctrl(struct device *dev, s
INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
INIT_DELAYED_WORK(&ctrl->connect_work, nvme_fc_connect_ctrl_work);
- INIT_WORK(&ctrl->ioerr_work, nvme_fc_ctrl_ioerr_work);
spin_lock_init(&ctrl->lock);
/* io queue count */
@@ -3581,7 +3574,6 @@ nvme_fc_init_ctrl(struct device *dev, st
fail_ctrl:
nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_DELETING);
- cancel_work_sync(&ctrl->ioerr_work);
cancel_work_sync(&ctrl->ctrl.reset_work);
cancel_delayed_work_sync(&ctrl->connect_work);
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 13/21] nvme-fc: Use CCR to recover controller that hits an error
2026-02-14 4:25 ` [PATCH v3 13/21] nvme-fc: Use CCR to recover controller that hits an error Mohamed Khalfella
@ 2026-02-28 1:03 ` James Smart
2026-03-26 17:40 ` Mohamed Khalfella
0 siblings, 1 reply; 61+ messages in thread
From: James Smart @ 2026-02-28 1:03 UTC (permalink / raw)
To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Hannes Reinecke, jsmart833426
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel
On 2/13/2026 8:25 PM, Mohamed Khalfella wrote:
> An alive nvme controller that hits an error now will move to FENCING
> state instead of RESETTING state. ctrl->fencing_work attempts CCR to
> terminate inflight IOs. Regardless of the success or failure of CCR
> operation the controller is transitioned to RESETTING state to continue
> error recovery process.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/host/fc.c | 30 ++++++++++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
>
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index e6ffaa19aba4..6ebabfb7e76d 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -166,6 +166,7 @@ struct nvme_fc_ctrl {
> struct blk_mq_tag_set admin_tag_set;
> struct blk_mq_tag_set tag_set;
>
> + struct work_struct fencing_work;
> struct work_struct ioerr_work;
> struct delayed_work connect_work;
>
> @@ -1868,6 +1869,24 @@ __nvme_fc_fcpop_chk_teardowns(struct nvme_fc_ctrl *ctrl,
> }
> }
>
> +static void nvme_fc_fencing_work(struct work_struct *work)
> +{
> + struct nvme_fc_ctrl *fc_ctrl =
> + container_of(work, struct nvme_fc_ctrl, fencing_work);
> + struct nvme_ctrl *ctrl = &fc_ctrl->ctrl;
> + unsigned long rem;
> +
> + rem = nvme_fence_ctrl(ctrl);
> + if (rem) {
> + dev_info(ctrl->device,
> + "CCR failed, skipping time-based recovery\n");
> + }
> +
> + nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> + if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> + queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
catch the rework of prior patch
> +}
> +
> static void
> nvme_fc_ctrl_ioerr_work(struct work_struct *work)
> {
> @@ -1889,6 +1908,7 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
> return;
> }
>
> + flush_work(&ctrl->fencing_work);
> nvme_fc_error_recovery(ctrl);
> }
>
> @@ -1915,6 +1935,14 @@ static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> {
> enum nvme_ctrl_state state;
>
From prior patch - the CONNECTING logic should be here....
> + if (nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_FENCING)) {
> + dev_warn(ctrl->ctrl.device,
> + "NVME-FC{%d}: starting controller fencing %s\n",
> + ctrl->cnum, errmsg);
> + queue_work(nvme_wq, &ctrl->fencing_work);
> + return;
> + }
> +
> if (nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING)) {
> dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error recovery %s\n",
> ctrl->cnum, errmsg);
> @@ -3322,6 +3350,7 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
> struct nvme_fc_ctrl *ctrl =
> container_of(work, struct nvme_fc_ctrl, ctrl.reset_work);
>
> + flush_work(&ctrl->fencing_work);
> nvme_stop_ctrl(&ctrl->ctrl);
>
> /* will block will waiting for io to terminate */
> @@ -3497,6 +3526,7 @@ nvme_fc_alloc_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
>
> INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
> INIT_DELAYED_WORK(&ctrl->connect_work, nvme_fc_connect_ctrl_work);
> + INIT_WORK(&ctrl->fencing_work, nvme_fc_fencing_work);
> INIT_WORK(&ctrl->ioerr_work, nvme_fc_ctrl_ioerr_work);
> spin_lock_init(&ctrl->lock);
>
there is a little to be in sync with my comment on the prior patch, but
otherwise what is here is fine.
What bothers me in this process is - there are certainly conditions
where there is not connectivity loss where FC can send things such as
the ABTS or a Disconnect LS that can inform the controller to start
terminating. Its odd that we skip this step and go directly to the CCR
reset to terminate the controller. We should have been able to continue
to send the things that start to directly tear down the controller which
can be happening in parallel with the CCR.
-- james
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 14/21] nvme-fc: Hold inflight requests while in FENCING state
2026-02-14 4:25 ` [PATCH v3 14/21] nvme-fc: Hold inflight requests while in FENCING state Mohamed Khalfella
2026-02-27 2:49 ` Randy Jennings
@ 2026-02-28 1:10 ` James Smart
1 sibling, 0 replies; 61+ messages in thread
From: James Smart @ 2026-02-28 1:10 UTC (permalink / raw)
To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Hannes Reinecke, jsmart833426
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel
On 2/13/2026 8:25 PM, Mohamed Khalfella wrote:
> While in FENCING state, aborted inflight IOs should be held until fencing
> is done. Update nvme_fc_fcpio_done() to not complete aborted requests or
> requests with transport errors. These held requests will be canceled in
> nvme_fc_delete_association() after fencing is done. nvme_fc_fcpio_done()
> avoids racing with canceling aborted requests by making sure we complete
> successful requests before waking up the waiting thread.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/host/fc.c | 61 +++++++++++++++++++++++++++++++++++-------
> 1 file changed, 51 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index 6ebabfb7e76d..e605dd3f4a40 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -172,7 +172,7 @@ struct nvme_fc_ctrl {
>
> struct kref ref;
> unsigned long flags;
> - u32 iocnt;
> + atomic_t iocnt;
> wait_queue_head_t ioabort_wait;
>
> struct nvme_fc_fcp_op aen_ops[NVME_NR_AEN_COMMANDS];
> @@ -1823,7 +1823,7 @@ __nvme_fc_abort_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_fcp_op *op)
> atomic_set(&op->state, opstate);
> else if (test_bit(FCCTRL_TERMIO, &ctrl->flags)) {
> op->flags |= FCOP_FLAGS_TERMIO;
> - ctrl->iocnt++;
> + atomic_inc(&ctrl->iocnt);
the atomic change is probably what corrects deadlocks you saw.
> }
> spin_unlock_irqrestore(&ctrl->lock, flags);
>
> @@ -1853,20 +1853,29 @@ nvme_fc_abort_aen_ops(struct nvme_fc_ctrl *ctrl)
> }
>
> static inline void
> +__nvme_fc_fcpop_count_one_down(struct nvme_fc_ctrl *ctrl)
> +{
> + if (atomic_dec_return(&ctrl->iocnt) == 0)
> + wake_up(&ctrl->ioabort_wait);
> +}
> +
> +static inline bool
> __nvme_fc_fcpop_chk_teardowns(struct nvme_fc_ctrl *ctrl,
> struct nvme_fc_fcp_op *op, int opstate)
> {
> unsigned long flags;
> + bool ret = false;
>
> if (opstate == FCPOP_STATE_ABORTED) {
> spin_lock_irqsave(&ctrl->lock, flags);
> if (test_bit(FCCTRL_TERMIO, &ctrl->flags) &&
> op->flags & FCOP_FLAGS_TERMIO) {
> - if (!--ctrl->iocnt)
> - wake_up(&ctrl->ioabort_wait);
> + ret = true;
> }
> spin_unlock_irqrestore(&ctrl->lock, flags);
> }
> +
> + return ret;
> }
>
> static void nvme_fc_fencing_work(struct work_struct *work)
> @@ -1969,7 +1978,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
> struct nvme_command *sqe = &op->cmd_iu.sqe;
> __le16 status = cpu_to_le16(NVME_SC_SUCCESS << 1);
> union nvme_result result;
> - bool terminate_assoc = true;
> + bool op_term, terminate_assoc = true;
> + enum nvme_ctrl_state state;
> int opstate;
>
> /*
> @@ -2102,16 +2112,38 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
> done:
> if (op->flags & FCOP_FLAGS_AEN) {
> nvme_complete_async_event(&queue->ctrl->ctrl, status, &result);
> - __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
> + if (__nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate))
> + __nvme_fc_fcpop_count_one_down(ctrl);
> atomic_set(&op->state, FCPOP_STATE_IDLE);
> op->flags = FCOP_FLAGS_AEN; /* clear other flags */
> nvme_fc_ctrl_put(ctrl);
> goto check_error;
> }
>
> - __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
> + /*
> + * We can not access op after the request is completed because it can
> + * be reused immediately. At the same time we want to wakeup the thread
> + * waiting for ongoing IOs _after_ requests are completed. This is
> + * necessary because that thread will start canceling inflight IOs
> + * and we want to avoid request completion racing with cancellation.
> + */
> + op_term = __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
> +
> + /*
> + * If we are going to terminate associations and the controller is
> + * LIVE or FENCING, then do not complete this request now. Let error
> + * recovery cancel this request when it is safe to do so.
> + */
> + state = nvme_ctrl_state(&ctrl->ctrl);
> + if (terminate_assoc &&
> + (state == NVME_CTRL_LIVE || state == NVME_CTRL_FENCING))
> + goto check_op_term;
> +
> if (!nvme_try_complete_req(rq, status, result))
> nvme_fc_complete_rq(rq);
> +check_op_term:
> + if (op_term)
> + __nvme_fc_fcpop_count_one_down(ctrl);
>
> check_error:
> if (terminate_assoc)
> @@ -2750,7 +2782,8 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
> * cmd with the csn was supposed to arrive.
> */
> opstate = atomic_xchg(&op->state, FCPOP_STATE_COMPLETE);
> - __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
> + if (__nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate))
> + __nvme_fc_fcpop_count_one_down(ctrl);
>
> if (!(op->flags & FCOP_FLAGS_AEN)) {
> nvme_fc_unmap_data(ctrl, op->rq, op);
> @@ -3219,7 +3252,7 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl)
>
> spin_lock_irqsave(&ctrl->lock, flags);
> set_bit(FCCTRL_TERMIO, &ctrl->flags);
> - ctrl->iocnt = 0;
> + atomic_set(&ctrl->iocnt, 0);
> spin_unlock_irqrestore(&ctrl->lock, flags);
>
> __nvme_fc_abort_outstanding_ios(ctrl, false);
> @@ -3228,11 +3261,19 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl)
> nvme_fc_abort_aen_ops(ctrl);
>
> /* wait for all io that had to be aborted */
> + wait_event(ctrl->ioabort_wait, atomic_read(&ctrl->iocnt) == 0);
> spin_lock_irq(&ctrl->lock);
> - wait_event_lock_irq(ctrl->ioabort_wait, ctrl->iocnt == 0, ctrl->lock);
> clear_bit(FCCTRL_TERMIO, &ctrl->flags);
> spin_unlock_irq(&ctrl->lock);
>
> + /*
> + * At this point all inflight requests have been successfully
> + * aborted. Now it is safe to cancel all requests we decided
> + * not to complete in nvme_fc_fcpio_done().
> + */
> + nvme_cancel_tagset(&ctrl->ctrl);
> + nvme_cancel_admin_tagset(&ctrl->ctrl);
> +
> nvme_fc_term_aen_ops(ctrl);
>
> /*
This looks good
Signed-off-by: James Smart <jsmart833426@gmail.com>
-- james
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 15/21] nvme-fc: Do not cancel requests in io taget before it is initialized
2026-02-14 4:25 ` [PATCH v3 15/21] nvme-fc: Do not cancel requests in io taget before it is initialized Mohamed Khalfella
@ 2026-02-28 1:12 ` James Smart
0 siblings, 0 replies; 61+ messages in thread
From: James Smart @ 2026-02-28 1:12 UTC (permalink / raw)
To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Hannes Reinecke, jsmart833426
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel
On 2/13/2026 8:25 PM, Mohamed Khalfella wrote:
> A new nvme-fc controller in CONNECTING state sees admin request timeout
> schedules ctrl->ioerr_work to abort inflight requests. This ends up
> calling __nvme_fc_abort_outstanding_ios() which aborts requests in both
> admin and io tagsets. In case fc_ctrl->tag_set was not initialized we
> see the warning below. This is because ctrl.queue_count is initialized
> early in nvme_fc_alloc_ctrl().
>
> nvme nvme0: NVME-FC{0}: starting error recovery Connectivity Loss
> INFO: trying to register non-static key.
> The code is fine but needs lockdep annotation, or maybe
> lpfc 0000:ab:00.0: queue 0 connect admin queue failed (-6).
> you didn't initialize this object before use?
> turning off the locking correctness validator.
> Workqueue: nvme-reset-wq nvme_fc_ctrl_ioerr_work [nvme_fc]
> Call Trace:
> <TASK>
> dump_stack_lvl+0x57/0x80
> register_lock_class+0x567/0x580
> __lock_acquire+0x330/0xb90
> lock_acquire.part.0+0xad/0x210
> blk_mq_tagset_busy_iter+0xf9/0xc00
> __nvme_fc_abort_outstanding_ios+0x23f/0x320 [nvme_fc]
> nvme_fc_ctrl_ioerr_work+0x172/0x210 [nvme_fc]
> process_one_work+0x82c/0x1450
> worker_thread+0x5ee/0xfd0
> kthread+0x3a0/0x750
> ret_from_fork+0x439/0x670
> ret_from_fork_asm+0x1a/0x30
> </TASK>
>
> Update the check in __nvme_fc_abort_outstanding_ios() confirm that io
> tagset was created before iterating over busy requests. Also make sure
> to cancel ctrl->ioerr_work before removing io tagset.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/host/fc.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index e605dd3f4a40..eac3a7ccaa5c 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -2557,7 +2557,7 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
> * io requests back to the block layer as part of normal completions
> * (but with error status).
> */
> - if (ctrl->ctrl.queue_count > 1) {
> + if (ctrl->ctrl.queue_count > 1 && ctrl->ctrl.tagset) {
> nvme_quiesce_io_queues(&ctrl->ctrl);
> nvme_sync_io_queues(&ctrl->ctrl);
> blk_mq_tagset_busy_iter(&ctrl->tag_set,
> @@ -2954,6 +2954,11 @@ nvme_fc_create_io_queues(struct nvme_fc_ctrl *ctrl)
> out_delete_hw_queues:
> nvme_fc_delete_hw_io_queues(ctrl);
> out_cleanup_tagset:
> + /*
> + * In CONNECTING state ctrl->ioerr_work will abort both admin
> + * and io tagsets. Cancel it first before removing io tagset.
> + */
> + cancel_work_sync(&ctrl->ioerr_work);
> nvme_remove_io_tag_set(&ctrl->ctrl);
> nvme_fc_free_io_queues(ctrl);
>
looks good
Signed-off-by: James Smart <jsmart833426@gmail.com>
-- james
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 21/21] nvme-fc: Extend FENCING state per TP4129 on CCR failure
2026-02-14 4:25 ` [PATCH v3 21/21] nvme-fc: " Mohamed Khalfella
@ 2026-02-28 1:20 ` James Smart
2026-03-25 19:07 ` Mohamed Khalfella
0 siblings, 1 reply; 61+ messages in thread
From: James Smart @ 2026-02-28 1:20 UTC (permalink / raw)
To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Hannes Reinecke, jsmart833426
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
linux-kernel
On 2/13/2026 8:25 PM, Mohamed Khalfella wrote:
> If CCR operations fail and CQT is supported, we must defer the retry of
> inflight requests per TP4129. Update ctrl->fencing_work to schedule
> ctrl->fenced_work, effectively extending the FENCING state. This delay
> ensures that inflight requests are held until it is safe for them to be
> retired.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/host/fc.c | 39 +++++++++++++++++++++++++++++++++++----
> 1 file changed, 35 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index eac3a7ccaa5c..81088a4ce298 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -167,6 +167,7 @@ struct nvme_fc_ctrl {
> struct blk_mq_tag_set tag_set;
>
> struct work_struct fencing_work;
> + struct delayed_work fenced_work;
> struct work_struct ioerr_work;
> struct delayed_work connect_work;
>
> @@ -1878,6 +1879,18 @@ __nvme_fc_fcpop_chk_teardowns(struct nvme_fc_ctrl *ctrl,
> return ret;
> }
>
> +static void nvme_fc_fenced_work(struct work_struct *work)
> +{
> + struct nvme_fc_ctrl *fc_ctrl = container_of(to_delayed_work(work),
> + struct nvme_fc_ctrl, fenced_work);
> + struct nvme_ctrl *ctrl = &fc_ctrl->ctrl;
> +
> + dev_info(ctrl->device, "Time-based recovery finished\n");
> + nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> + if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> + queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
sync with comments on patch 12
> +}
> +
> static void nvme_fc_fencing_work(struct work_struct *work)
> {
> struct nvme_fc_ctrl *fc_ctrl =
> @@ -1886,16 +1899,33 @@ static void nvme_fc_fencing_work(struct work_struct *work)
> unsigned long rem;
>
> rem = nvme_fence_ctrl(ctrl);
> - if (rem) {
> + if (!rem)
> + goto done;
> +
> + if (!ctrl->cqt) {
> dev_info(ctrl->device,
> - "CCR failed, skipping time-based recovery\n");
> + "CCR failed, CQT not supported, skip time-based recovery\n");
> + goto done;
> }
>
> + dev_info(ctrl->device,
> + "CCR failed, switch to time-based recovery, timeout = %ums\n",
> + jiffies_to_msecs(rem));
> + queue_delayed_work(nvme_wq, &fc_ctrl->fenced_work, rem);
> + return;
> +
> +done:
> nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
> }
>
> +static void nvme_fc_flush_fencing_works(struct nvme_fc_ctrl *ctrl)
> +{
> + flush_work(&ctrl->fencing_work);
> + flush_delayed_work(&ctrl->fenced_work);
> +}
> +
> static void
> nvme_fc_ctrl_ioerr_work(struct work_struct *work)
> {
> @@ -1917,7 +1947,7 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
> return;
> }
>
> - flush_work(&ctrl->fencing_work);
> + nvme_fc_flush_fencing_works(ctrl);
> nvme_fc_error_recovery(ctrl);
> }
>
> @@ -3396,7 +3426,7 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
> struct nvme_fc_ctrl *ctrl =
> container_of(work, struct nvme_fc_ctrl, ctrl.reset_work);
>
> - flush_work(&ctrl->fencing_work);
> + nvme_fc_flush_fencing_works(ctrl);
> nvme_stop_ctrl(&ctrl->ctrl);
>
> /* will block will waiting for io to terminate */
> @@ -3573,6 +3603,7 @@ nvme_fc_alloc_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
> INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
> INIT_DELAYED_WORK(&ctrl->connect_work, nvme_fc_connect_ctrl_work);
> INIT_WORK(&ctrl->fencing_work, nvme_fc_fencing_work);
> + INIT_DELAYED_WORK(&ctrl->fenced_work, nvme_fc_fenced_work);
> INIT_WORK(&ctrl->ioerr_work, nvme_fc_ctrl_ioerr_work);
> spin_lock_init(&ctrl->lock);
>
looks ok.
Signed-off-by: James Smart <jsmart833426@gmail.com>
-- james
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT
2026-02-27 3:05 ` Randy Jennings
@ 2026-03-02 7:32 ` Hannes Reinecke
0 siblings, 0 replies; 61+ messages in thread
From: Hannes Reinecke @ 2026-03-02 7:32 UTC (permalink / raw)
To: Randy Jennings
Cc: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, James Smart, Aaron Dailey, Dhaval Giani,
linux-nvme, linux-kernel
On 2/27/26 04:05, Randy Jennings wrote:
> On Thu, Feb 19, 2026 at 11:25 PM Hannes Reinecke <hare@suse.de> wrote:
>>
>> On 2/20/26 03:01, Randy Jennings wrote:
>>> Hannes,
>>>
>>>> (ctrl->kato * 1000) + ctrl->cqt
>>> As Mohamed pointed out, we have already received a response from a CCR
>>> command. The CCR, once accepted, communicates the death of the
>>> connection to the impacted controller and starts the cleanup tracked
>>> by CQT. So, no need to wait for the impacted controller to figure out
>>> the connection is down.
>>>
>>> The max(cqt, kato) was just to give some wait time that should allow
>>> issuing a CCR again from a different controller (in case of losing
>>> communication with this one). It certainly does not need to be longer
>>> than cqt (and it should be no longer than the remaining duration of
>>> time-based retry; that should get addressed at some point). I cannot
>>> remember why kato (if larger; I expect it would be smaller) made sense
>>> at the time.
>>>
>> Because we have to wait for the AEN, at which point KATO comes into
>> play yet again.
>> So max(CQT, KATO) is the appropriate waiting time for that.
> I see your point. It could take ~KATO time for the AEN to show up after
> the CCR operation finishes. Technically true. However, if responses
> are taking KATO time to get back to the host, I think would rather retry
> on a more healthy link.
>
Sure. But currently we don't have a policy for this; for us the
AEN is just a normal completion, for which we have to wait until
the KATO interval is exhausted.
We really should have a session or BOF about CCR handling at LSF.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 03/21] nvmet: Implement CCR nvme command
2026-02-27 16:30 ` Maurizio Lombardi
@ 2026-03-25 18:52 ` Mohamed Khalfella
0 siblings, 0 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-03-25 18:52 UTC (permalink / raw)
To: Maurizio Lombardi
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke, Aaron Dailey, Randy Jennings,
Dhaval Giani, linux-nvme, linux-kernel
On Fri 2026-02-27 17:30:29 +0100, Maurizio Lombardi wrote:
> On Sat Feb 14, 2026 at 5:25 AM CET, Mohamed Khalfella wrote:
> > Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
> > Reset) command is an nvme command issued to source controller by
> > initiator to reset impacted controller. Implement CCR command for linux
> > nvme target.
> >
> > +
> > + new_ccr = kmalloc(sizeof(*new_ccr), GFP_KERNEL);
> > + if (!new_ccr) {
> > + status = NVME_SC_INTERNAL;
> > + goto out_unlock;
> > + }
>
> Nit: kmalloc_obj is now the preferred function for this kind of memory
> allocations, see commit 69050f8d6d075dc01a and 189f164e573e18d
>
> scripts/checkpatch.pl is supposed to print a warning
> but there must be a problem with the regex and doesn't catch it
>
Got it. Switched the allocation to use kmalloc_obj().
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 21/21] nvme-fc: Extend FENCING state per TP4129 on CCR failure
2026-02-28 1:20 ` James Smart
@ 2026-03-25 19:07 ` Mohamed Khalfella
0 siblings, 0 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-03-25 19:07 UTC (permalink / raw)
To: James Smart
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Hannes Reinecke, Aaron Dailey, Randy Jennings, Dhaval Giani,
linux-nvme, linux-kernel
On Fri 2026-02-27 17:20:45 -0800, James Smart wrote:
> On 2/13/2026 8:25 PM, Mohamed Khalfella wrote:
> > If CCR operations fail and CQT is supported, we must defer the retry of
> > inflight requests per TP4129. Update ctrl->fencing_work to schedule
> > ctrl->fenced_work, effectively extending the FENCING state. This delay
> > ensures that inflight requests are held until it is safe for them to be
> > retired.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> > drivers/nvme/host/fc.c | 39 +++++++++++++++++++++++++++++++++++----
> > 1 file changed, 35 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> > index eac3a7ccaa5c..81088a4ce298 100644
> > --- a/drivers/nvme/host/fc.c
> > +++ b/drivers/nvme/host/fc.c
> > @@ -167,6 +167,7 @@ struct nvme_fc_ctrl {
> > struct blk_mq_tag_set tag_set;
> >
> > struct work_struct fencing_work;
> > + struct delayed_work fenced_work;
> > struct work_struct ioerr_work;
> > struct delayed_work connect_work;
> >
> > @@ -1878,6 +1879,18 @@ __nvme_fc_fcpop_chk_teardowns(struct nvme_fc_ctrl *ctrl,
> > return ret;
> > }
> >
> > +static void nvme_fc_fenced_work(struct work_struct *work)
> > +{
> > + struct nvme_fc_ctrl *fc_ctrl = container_of(to_delayed_work(work),
> > + struct nvme_fc_ctrl, fenced_work);
> > + struct nvme_ctrl *ctrl = &fc_ctrl->ctrl;
> > +
> > + dev_info(ctrl->device, "Time-based recovery finished\n");
> > + nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> > + if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> > + queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
>
> sync with comments on patch 12
I will do that. It has been suggested to move CQT changes into a
separate patchset and focus on CCR changes for now. I will drop
patches [16 - 21] from this patchset to be re-introduced later.
>
> > +}
> > +
> > static void nvme_fc_fencing_work(struct work_struct *work)
> > {
> > struct nvme_fc_ctrl *fc_ctrl =
> > @@ -1886,16 +1899,33 @@ static void nvme_fc_fencing_work(struct work_struct *work)
> > unsigned long rem;
> >
> > rem = nvme_fence_ctrl(ctrl);
> > - if (rem) {
> > + if (!rem)
> > + goto done;
> > +
> > + if (!ctrl->cqt) {
> > dev_info(ctrl->device,
> > - "CCR failed, skipping time-based recovery\n");
> > + "CCR failed, CQT not supported, skip time-based recovery\n");
> > + goto done;
> > }
> >
> > + dev_info(ctrl->device,
> > + "CCR failed, switch to time-based recovery, timeout = %ums\n",
> > + jiffies_to_msecs(rem));
> > + queue_delayed_work(nvme_wq, &fc_ctrl->fenced_work, rem);
> > + return;
> > +
> > +done:
> > nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> > if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> > queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
> > }
> >
> > +static void nvme_fc_flush_fencing_works(struct nvme_fc_ctrl *ctrl)
> > +{
> > + flush_work(&ctrl->fencing_work);
> > + flush_delayed_work(&ctrl->fenced_work);
> > +}
> > +
> > static void
> > nvme_fc_ctrl_ioerr_work(struct work_struct *work)
> > {
> > @@ -1917,7 +1947,7 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
> > return;
> > }
> >
> > - flush_work(&ctrl->fencing_work);
> > + nvme_fc_flush_fencing_works(ctrl);
> > nvme_fc_error_recovery(ctrl);
> > }
> >
> > @@ -3396,7 +3426,7 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
> > struct nvme_fc_ctrl *ctrl =
> > container_of(work, struct nvme_fc_ctrl, ctrl.reset_work);
> >
> > - flush_work(&ctrl->fencing_work);
> > + nvme_fc_flush_fencing_works(ctrl);
> > nvme_stop_ctrl(&ctrl->ctrl);
> >
> > /* will block will waiting for io to terminate */
> > @@ -3573,6 +3603,7 @@ nvme_fc_alloc_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
> > INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
> > INIT_DELAYED_WORK(&ctrl->connect_work, nvme_fc_connect_ctrl_work);
> > INIT_WORK(&ctrl->fencing_work, nvme_fc_fencing_work);
> > + INIT_DELAYED_WORK(&ctrl->fenced_work, nvme_fc_fenced_work);
> > INIT_WORK(&ctrl->ioerr_work, nvme_fc_ctrl_ioerr_work);
> > spin_lock_init(&ctrl->lock);
> >
>
> looks ok.
>
> Signed-off-by: James Smart <jsmart833426@gmail.com>
>
> -- james
>
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 12/21] nvme-fc: Decouple error recovery from controller reset
2026-02-28 0:12 ` James Smart
@ 2026-03-26 2:37 ` Mohamed Khalfella
0 siblings, 0 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-03-26 2:37 UTC (permalink / raw)
To: James Smart
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Hannes Reinecke, Aaron Dailey, Randy Jennings, Dhaval Giani,
linux-nvme, linux-kernel
On Fri 2026-02-27 16:12:05 -0800, James Smart wrote:
> On 2/13/2026 8:25 PM, Mohamed Khalfella wrote:
> > nvme_fc_error_recovery() called from nvme_fc_timeout() while controller
> > in CONNECTING state results in deadlock reported in link below. Update
> > nvme_fc_timeout() to schedule error recovery to avoid the deadlock.
>
> This seems misleading on what is changing...
>
> How about:
> Add new nvme_fc_start_ioerr_recovery() routine which effectively
> "resets" a the controller.
> Refactor error points that invoked routines that reset the controller
> to now call nvme_fc_start_ioerr_recovery().
> Eliminated io abort on io error, as we will be resetting the controller.
>
nvme-fc: Refactor IO error recovery
Added new nvme_fc_start_ioerr_recovery() to trigger error recovery
instead of directly queueing ctrl->ioerr_work. nvme_fc_error_recovery()
now called only from ctrl->ioerr_work has been updated to not depend on
nvme_reset_ctrl() to handle error recovery. nvme_fc_error_recovery()
effectively resets the controller and attempts reconnection if needed.
This makes nvme-fc ioerr handling similar to other fabric transports.
Update nvme_fc_timeout() to not abort timed out IOs. IOs aborted from
nvme_fc_timeout() are not accounted for in ctrl->iocnt and this causes
nvme_fc_delete_association() not to wait for them. Instead of aborting
IOs nvme_fc_timeout() calls nvme_fc_start_ioerr_recovery() to start IO
error recovery. Since error recovery runs in ctrl->ioerr_work this
change fixes the issue reported in the link below.
Above is the updated commit message. Let me know if there is any part
you want me to change before I submit v4.
>
> >
> > Previous to this change if controller was LIVE error recovery resets
> > the controller and this does not match nvme-tcp and nvme-rdma. Decouple
> > error recovery from controller reset to match other fabric transports.
>
> Please delete. It's irrelevant to the patch.
Deleted.
>
>
> ...
> > @@ -1871,7 +1874,22 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
> > struct nvme_fc_ctrl *ctrl =
> > container_of(work, struct nvme_fc_ctrl, ioerr_work);
> >
> > - nvme_fc_error_recovery(ctrl, "transport detected io error");
> > + /*
> > + * if an error (io timeout, etc) while (re)connecting, the remote
> > + * port requested terminating of the association (disconnect_ls)
> > + * or an error (timeout or abort) occurred on an io while creating
> > + * the controller. Abort any ios on the association and let the
> > + * create_association error path resolve things.
> > + */
> > + if (nvme_ctrl_state(&ctrl->ctrl) == NVME_CTRL_CONNECTING) {
> > + __nvme_fc_abort_outstanding_ios(ctrl, true);
> > + dev_warn(ctrl->ctrl.device,
> > + "NVME-FC{%d}: transport error during (re)connect\n",
> > + ctrl->cnum);
> > + return;
> > + }
> > +
> > + nvme_fc_error_recovery(ctrl);
> > }
>
> ok - but see below...
>
>
> > +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> > + char *errmsg)
> > +{
> > + enum nvme_ctrl_state state;
> > +
> > + if (nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING)) {
> > + dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error recovery %s\n",
> > + ctrl->cnum, errmsg);
> > + queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> > + return;
> > + }
> > +
> > + state = nvme_ctrl_state(&ctrl->ctrl);
> > + if (state == NVME_CTRL_CONNECTING || state == NVME_CTRL_DELETING ||
> > + state == NVME_CTRL_DELETING_NOIO) {
> > + queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> > + }
> > +}
>
> What bothers me about this (true of the tcp and rmda transports) is
> there is little difference between this and using the core
> nvme_reset_ctrl(), excepting that even when the state change fails, the
> code continues to schedule the work element that does the reset.
It does bother me too. The existance of controller reset and error
recovery as two separate and very similar codepaths has been pointed to
in emails in this very patchset. I think at some point the two codepaths
should be refactored. Until this happens the change above should be easy
to understand.
>
> And the latter odd snippet to reset anyway is only to get the CONNECTING
> code snippet, which failed the RESETTING transition, to be performed.
> I'd prefer the connecting snippet be at the top of start_ioerr_recovery
> before any state change attempt so its in the same place as prior.
Updated nvme_fc_start_ioerr_recovery() to handle the case of CONNECTING,
DELETING, DELETING_NOIO first.
>
>
> ...
> > static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
> > {
> > struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(rq);
> > @@ -2536,24 +2539,14 @@ static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
> > struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
> > struct nvme_command *sqe = &cmdiu->sqe;
> >
> > - /*
> > - * Attempt to abort the offending command. Command completion
> > - * will detect the aborted io and will fail the connection.
> > - */
> > dev_info(ctrl->ctrl.device,
> > "NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
> > "x%08x/x%08x\n",
> > ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
> > nvme_fabrics_opcode_str(qnum, sqe),
> > sqe->common.cdw10, sqe->common.cdw11);
> > - if (__nvme_fc_abort_op(ctrl, op))
> > - nvme_fc_error_recovery(ctrl, "io timeout abort failed");
> >
> > - /*
> > - * the io abort has been initiated. Have the reset timer
> > - * restarted and the abort completion will complete the io
> > - * shortly. Avoids a synchronous wait while the abort finishes.
> > - */
> > + nvme_fc_start_ioerr_recovery(ctrl, "io timeout");
> > return BLK_EH_RESET_TIMER;
> > }
>
> I eventually gave in on not doing the abort of the io as the
> start_ioerr_recovery() will be resetting the controller.
>
>
> >
> > @@ -3352,6 +3345,27 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
> > }
> > }
> >
> > +static void
> > +nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
> > +{
> > + nvme_stop_keep_alive(&ctrl->ctrl);
> > + nvme_stop_ctrl(&ctrl->ctrl);
> > + flush_work(&ctrl->ctrl.async_event_work);
> > +
> > + /* will block while waiting for io to terminate */
> > + nvme_fc_delete_association(ctrl);
> > +
> > + /* Do not reconnect if controller is being deleted */
> > + if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
> > + return;
> > +
> > + if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
> > + queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
> > + return;
> > + }
> > +
> > + nvme_fc_reconnect_or_delete(ctrl, -ENOTCONN);
> > +}
> >
> > static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
> > .name = "fc",
>
> There is no reason to duplicate the code that is already in ioerr_work.
> I prototyped a simple service routine. The net/net is showed what little
> reason there is to have an ioerr_work and a reset_work - as they are
> effectively the same. So I then eliminated ioerr_work and use reset_work
> and the service routine (kept the nvme_fc_error_recovery() name).
>
>
> Here's a revised diff for this patch... I have compiled but not tested.
>
>
> --- fc.c.START 2026-02-27 14:10:07.631705123 -0800
> +++ fc.c 2026-02-27 15:41:09.777836476 -0800
> @@ -166,7 +166,6 @@ struct nvme_fc_ctrl {
> struct blk_mq_tag_set admin_tag_set;
> struct blk_mq_tag_set tag_set;
>
> - struct work_struct ioerr_work;
> struct delayed_work connect_work;
>
> struct kref ref;
> @@ -227,6 +226,8 @@ static DEFINE_IDA(nvme_fc_ctrl_cnt);
> static struct device *fc_udev_device;
>
> static void nvme_fc_complete_rq(struct request *rq);
> +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> + char *errmsg);
>
> /* *********************** FC-NVME Port Management
> ************************ */
>
> @@ -788,7 +789,7 @@ nvme_fc_ctrl_connectivity_loss(struct nv
> "Reconnect", ctrl->cnum);
>
> set_bit(ASSOC_FAILED, &ctrl->flags);
> - nvme_reset_ctrl(&ctrl->ctrl);
> + nvme_fc_start_ioerr_recovery(ctrl, "Connectivity Loss");
> }
>
> /**
> @@ -985,8 +986,6 @@ fc_dma_unmap_sg(struct device *dev, stru
> static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
> static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
>
> -static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char
> *errmsg);
> -
> static void
> __nvme_fc_finish_ls_req(struct nvmefc_ls_req_op *lsop)
> {
> @@ -1569,7 +1568,8 @@ nvme_fc_ls_disconnect_assoc(struct nvmef
> */
>
> /* fail the association */
> - nvme_fc_error_recovery(ctrl, "Disconnect Association LS received");
> + nvme_fc_start_ioerr_recovery(ctrl,
> + "Disconnect Association LS received");
>
> /* release the reference taken by nvme_fc_match_disconn_ls() */
> nvme_fc_ctrl_put(ctrl);
> @@ -1865,15 +1865,6 @@ __nvme_fc_fcpop_chk_teardowns(struct nvm
> }
> }
>
> -static void
> -nvme_fc_ctrl_ioerr_work(struct work_struct *work)
> -{
> - struct nvme_fc_ctrl *ctrl =
> - container_of(work, struct nvme_fc_ctrl, ioerr_work);
> -
> - nvme_fc_error_recovery(ctrl, "transport detected io error");
> -}
> -
> /*
> * nvme_fc_io_getuuid - Routine called to get the appid field
> * associated with request by the lldd
> @@ -2049,9 +2040,8 @@ done:
> nvme_fc_complete_rq(rq);
>
> check_error:
> - if (terminate_assoc &&
> - nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_RESETTING)
> - queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> + if (terminate_assoc)
> + nvme_fc_start_ioerr_recovery(ctrl, "io error");
> }
>
> static int
> @@ -2496,7 +2486,7 @@ __nvme_fc_abort_outstanding_ios(struct n
> }
>
> static void
> -nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
> +nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
> {
> enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
>
> @@ -2515,17 +2505,15 @@ nvme_fc_error_recovery(struct nvme_fc_ct
> return;
> }
>
> - /* Otherwise, only proceed if in LIVE state - e.g. on first error */
> - if (state != NVME_CTRL_LIVE)
> + if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
> return;
>
> dev_warn(ctrl->ctrl.device,
> "NVME-FC{%d}: transport association event: %s\n",
> ctrl->cnum, errmsg);
> - dev_warn(ctrl->ctrl.device,
> - "NVME-FC{%d}: resetting controller\n", ctrl->cnum);
> -
> - nvme_reset_ctrl(&ctrl->ctrl);
> + dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error recovery %s\n",
> + ctrl->cnum, errmsg);
> + queue_work(nvme_reset_wq, &ctrl->ctrl.reset_work);
> }
>
> static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
> @@ -2536,24 +2524,14 @@ static enum blk_eh_timer_return nvme_fc_
> struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
> struct nvme_command *sqe = &cmdiu->sqe;
>
> - /*
> - * Attempt to abort the offending command. Command completion
> - * will detect the aborted io and will fail the connection.
> - */
> dev_info(ctrl->ctrl.device,
> "NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
> "x%08x/x%08x\n",
> ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
> nvme_fabrics_opcode_str(qnum, sqe),
> sqe->common.cdw10, sqe->common.cdw11);
> - if (__nvme_fc_abort_op(ctrl, op))
> - nvme_fc_error_recovery(ctrl, "io timeout abort failed");
>
> - /*
> - * the io abort has been initiated. Have the reset timer
> - * restarted and the abort completion will complete the io
> - * shortly. Avoids a synchronous wait while the abort finishes.
> - */
> + nvme_fc_start_ioerr_recovery(ctrl, "io timeout");
> return BLK_EH_RESET_TIMER;
> }
>
> @@ -3264,7 +3242,7 @@ nvme_fc_delete_ctrl(struct nvme_ctrl *nc
> * waiting for io to terminate
> */
> nvme_fc_delete_association(ctrl);
> - cancel_work_sync(&ctrl->ioerr_work);
> + cancel_work_sync(&ctrl->ctrl.reset_work);
>
> if (ctrl->ctrl.tagset)
> nvme_remove_io_tag_set(&ctrl->ctrl);
> @@ -3324,20 +3302,27 @@ nvme_fc_reconnect_or_delete(struct nvme_
> }
>
> static void
> -nvme_fc_reset_ctrl_work(struct work_struct *work)
> +nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
> {
> - struct nvme_fc_ctrl *ctrl =
> - container_of(work, struct nvme_fc_ctrl, ctrl.reset_work);
> -
> + nvme_stop_keep_alive(&ctrl->ctrl);
> + flush_work(&ctrl->ctrl.async_event_work);
> nvme_stop_ctrl(&ctrl->ctrl);
>
> /* will block will waiting for io to terminate */
> nvme_fc_delete_association(ctrl);
>
> - if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
> + if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
> + enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
> +
> + /* state change failure is ok if we started ctrl delete */
> + if (state == NVME_CTRL_DELETING ||
> + state == NVME_CTRL_DELETING_NOIO)
> + return;
> +
> dev_err(ctrl->ctrl.device,
> - "NVME-FC{%d}: error_recovery: Couldn't change state "
> - "to CONNECTING\n", ctrl->cnum);
> + "NVME-FC{%d}: error_recovery: Couldn't change "
> + "state to CONNECTING (%d)\n", ctrl->cnum, state);
> + }
>
> if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
> if (!queue_delayed_work(nvme_wq, &ctrl->connect_work, 0)) {
> @@ -3352,6 +3337,15 @@ nvme_fc_reset_ctrl_work(struct work_stru
> }
> }
>
> +static void
> +nvme_fc_reset_ctrl_work(struct work_struct *work)
> +{
> + struct nvme_fc_ctrl *ctrl =
> + container_of(work, struct nvme_fc_ctrl, ctrl.reset_work);
> +
> + nvme_fc_error_recovery(ctrl);
> +}
> +
>
> static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
> .name = "fc",
> @@ -3483,7 +3477,6 @@ nvme_fc_alloc_ctrl(struct device *dev, s
>
> INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
> INIT_DELAYED_WORK(&ctrl->connect_work, nvme_fc_connect_ctrl_work);
> - INIT_WORK(&ctrl->ioerr_work, nvme_fc_ctrl_ioerr_work);
> spin_lock_init(&ctrl->lock);
>
> /* io queue count */
> @@ -3581,7 +3574,6 @@ nvme_fc_init_ctrl(struct device *dev, st
>
> fail_ctrl:
> nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_DELETING);
> - cancel_work_sync(&ctrl->ioerr_work);
> cancel_work_sync(&ctrl->ctrl.reset_work);
> cancel_delayed_work_sync(&ctrl->connect_work);
>
nvme_fc_timeout() ->
nvme_fc_start_ioerr_recovery() ->
__nvme_fc_abort_outstanding_ios() ->
blk_sync_queue();
The codepath in the patch above will cause a deadlock.
nvme_fc_unregister_remoteport() ->
nvme_fc_ctrl_connectivity_loss() ->
nvme_fc_start_ioerr_recovery()
nvme_fc_fcpio_done() ->
nvme_fc_start_ioerr_recovery()
The above codepaths use LLDD threads to do recovery. I thought we should
not be doing that.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 13/21] nvme-fc: Use CCR to recover controller that hits an error
2026-02-28 1:03 ` James Smart
@ 2026-03-26 17:40 ` Mohamed Khalfella
0 siblings, 0 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-03-26 17:40 UTC (permalink / raw)
To: James Smart
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
Hannes Reinecke, Aaron Dailey, Randy Jennings, Dhaval Giani,
linux-nvme, linux-kernel
On Fri 2026-02-27 17:03:55 -0800, James Smart wrote:
> On 2/13/2026 8:25 PM, Mohamed Khalfella wrote:
> > An alive nvme controller that hits an error now will move to FENCING
> > state instead of RESETTING state. ctrl->fencing_work attempts CCR to
> > terminate inflight IOs. Regardless of the success or failure of CCR
> > operation the controller is transitioned to RESETTING state to continue
> > error recovery process.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> > drivers/nvme/host/fc.c | 30 ++++++++++++++++++++++++++++++
> > 1 file changed, 30 insertions(+)
> >
> > diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> > index e6ffaa19aba4..6ebabfb7e76d 100644
> > --- a/drivers/nvme/host/fc.c
> > +++ b/drivers/nvme/host/fc.c
> > @@ -166,6 +166,7 @@ struct nvme_fc_ctrl {
> > struct blk_mq_tag_set admin_tag_set;
> > struct blk_mq_tag_set tag_set;
> >
> > + struct work_struct fencing_work;
> > struct work_struct ioerr_work;
> > struct delayed_work connect_work;
> >
> > @@ -1868,6 +1869,24 @@ __nvme_fc_fcpop_chk_teardowns(struct nvme_fc_ctrl *ctrl,
> > }
> > }
> >
> > +static void nvme_fc_fencing_work(struct work_struct *work)
> > +{
> > + struct nvme_fc_ctrl *fc_ctrl =
> > + container_of(work, struct nvme_fc_ctrl, fencing_work);
> > + struct nvme_ctrl *ctrl = &fc_ctrl->ctrl;
> > + unsigned long rem;
> > +
> > + rem = nvme_fence_ctrl(ctrl);
> > + if (rem) {
> > + dev_info(ctrl->device,
> > + "CCR failed, skipping time-based recovery\n");
> > + }
> > +
> > + nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> > + if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> > + queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
>
> catch the rework of prior patch
I ended up not dropping ctrl->ioerr_work. There are situations we need
error recovery work to run on a separate thread.
>
> > +}
> > +
> > static void
> > nvme_fc_ctrl_ioerr_work(struct work_struct *work)
> > {
> > @@ -1889,6 +1908,7 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
> > return;
> > }
> >
> > + flush_work(&ctrl->fencing_work);
> > nvme_fc_error_recovery(ctrl);
> > }
> >
> > @@ -1915,6 +1935,14 @@ static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> > {
> > enum nvme_ctrl_state state;
> >
> From prior patch - the CONNECTING logic should be here....
Yes, it is here. The check for CONNECTING state is at the top of
nvme_fc_start_ioerr_recovery().
>
> > + if (nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_FENCING)) {
> > + dev_warn(ctrl->ctrl.device,
> > + "NVME-FC{%d}: starting controller fencing %s\n",
> > + ctrl->cnum, errmsg);
> > + queue_work(nvme_wq, &ctrl->fencing_work);
> > + return;
> > + }
> > +
> > if (nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING)) {
> > dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error recovery %s\n",
> > ctrl->cnum, errmsg);
> > @@ -3322,6 +3350,7 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
> > struct nvme_fc_ctrl *ctrl =
> > container_of(work, struct nvme_fc_ctrl, ctrl.reset_work);
> >
> > + flush_work(&ctrl->fencing_work);
> > nvme_stop_ctrl(&ctrl->ctrl);
> >
> > /* will block will waiting for io to terminate */
> > @@ -3497,6 +3526,7 @@ nvme_fc_alloc_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
> >
> > INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
> > INIT_DELAYED_WORK(&ctrl->connect_work, nvme_fc_connect_ctrl_work);
> > + INIT_WORK(&ctrl->fencing_work, nvme_fc_fencing_work);
> > INIT_WORK(&ctrl->ioerr_work, nvme_fc_ctrl_ioerr_work);
> > spin_lock_init(&ctrl->lock);
> >
>
> there is a little to be in sync with my comment on the prior patch, but
> otherwise what is here is fine.
>
> What bothers me in this process is - there are certainly conditions
> where there is not connectivity loss where FC can send things such as
> the ABTS or a Disconnect LS that can inform the controller to start
> terminating. Its odd that we skip this step and go directly to the CCR
> reset to terminate the controller. We should have been able to continue
> to send the things that start to directly tear down the controller which
> can be happening in parallel with the CCR.
Depending on how the target is implemented ABTS or Disconnect LS do not
guarantee inflight IOs are terminated. CCR main point is terminate
inflight IOs making it safe to retry failed IOs.
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 08/21] nvme: Implement cross-controller reset recovery
2026-02-26 2:37 ` Randy Jennings
@ 2026-03-27 18:33 ` Mohamed Khalfella
0 siblings, 0 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-03-27 18:33 UTC (permalink / raw)
To: Randy Jennings
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke, Aaron Dailey, Dhaval Giani,
linux-nvme, linux-kernel
On Wed 2026-02-25 18:37:44 -0800, Randy Jennings wrote:
> On Fri, Feb 13, 2026 at 8:28 PM Mohamed Khalfella
> <mkhalfella@purestorage.com> wrote:
> >
> > A host that has more than one path connecting to an nvme subsystem
> > typically has an nvme controller associated with every path. This is
> > mostly applicable to nvmeof. If one path goes down, inflight IOs on that
> > path should not be retried immediately on another path because this
> > could lead to data corruption as described in TP4129. TP8028 defines
> > cross-controller reset mechanism that can be used by host to terminate
> > IOs on the failed path using one of the remaining healthy paths. Only
> > after IOs are terminated, or long enough time passes as defined by
> > TP4129, inflight IOs should be retried on another path. Implement core
> > cross-controller reset shared logic to be used by the transports.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > +static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
> > + if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
> > + ret = -ETIMEDOUT;
> > + goto out;
> > + }
> The more I look at this, the less I can ignore that this tmo should be
> capped by deadline - now..
I updated nvme_issue_wait_ccr() to do that.
>
> > +unsigned long nvme_fence_ctrl(struct nvme_ctrl *ictrl)
> > + deadline = now + msecs_to_jiffies(timeout);
> > + while (time_before(now, deadline)) {
> ...
> > + ret = nvme_issue_wait_ccr(sctrl, ictrl);
> ...
> > + }
> Sincerely,
> Randy Jennings
^ permalink raw reply [flat|nested] 61+ messages in thread
* RE: [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (20 preceding siblings ...)
2026-02-14 4:25 ` [PATCH v3 21/21] nvme-fc: " Mohamed Khalfella
@ 2026-04-01 13:33 ` Achkinazi, Igor
2026-04-01 16:37 ` Mohamed Khalfella
21 siblings, 1 reply; 61+ messages in thread
From: Achkinazi, Igor @ 2026-04-01 13:33 UTC (permalink / raw)
To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, James Smart, Hannes Reinecke
Cc: Aaron Dailey, Randy Jennings, Dhaval Giani,
linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org
Hi Mohamed,
We tested this patch v3 against Dell PowerFlex as the NVMe over TCP target.
Below is a summary of our test methodology and results.
Test Environment
----------------
- Target: Dell PowerFlex with NVMe over TCP with engineering code
- Host: Standard Linux host with the patch
- IO: vdbench with data integrity (DI) validation
Test Scenarios and Results
--------------------------
1) Target supports CQT + CCR -- without the patch (baseline)
vdbench DI validation FAILED. Data integrity errors observed.
2) Target supports CQT + CCR -- with the patch applied
Keep Alive timeout triggers controller fencing. CCR is issued
via the surviving controller and succeeds. Controller reconnects
and vdbench DI validation PASSES with no data integrity errors.
Kernel log:
nvme nvme4: I/O tag 1 (b001) type 4 opcode 0x18 (Keep Alive) QID 0 timeout
nvme nvme4: starting controller fencing
nvme nvme4: attempting CCR, timeout 15000ms
nvme nvme4: CCR succeeded using nvme3
nvme nvme4: failed nvme_keep_alive_end_io error=10
nvme nvme4: Reconnecting in 10 seconds...
nvme nvme4: creating 1 I/O queues.
nvme nvme4: mapped 1/0/0 default/read/poll queues.
nvme nvme4: Successfully reconnected (attempt 1/60)
3) Target supports CQT only (no CCR) -- with the patch applied
CCR fails as expected, patch falls back to TP4129 time-based
recovery. Controller reconnects after the recovery timer expires
and vdbench DI validation PASSES with no data integrity errors.
Kernel log:
nvme nvme4: I/O tag 0 (9000) type 4 opcode 0x18 (Keep Alive) QID 0 timeout
nvme nvme4: starting controller fencing
nvme nvme4: attempting CCR, timeout 15000ms
nvme nvme4: CCR failed, switch to time-based recovery, timeout = 15000ms
nvme nvme4: failed nvme_keep_alive_end_io error=5
nvme nvme4: Time-based recovery finished
nvme nvme4: Reconnecting in 10 seconds...
nvme nvme4: creating 1 I/O queues.
nvme nvme4: mapped 1/0/0 default/read/poll queues.
nvme nvme4: Successfully reconnected (attempt 1/60)
4) Target supports neither CQT nor CCR -- with the patch applied
vdbench DI validation FAILED. Data integrity errors observed.
This is expected because without CQT the host has no safe hold
period and inflight IO may be retried prematurely.
Additional Targeted Tests
-------------------------
All of the following passed on a PowerFlex target with CQT + CCR.
Simple CCR:
- 2 controllers, one times out. CCR issued via surviving controller.
CCR log page entry created and read by the host successfully.
Kernel log:
nvme nvme3: I/O tag 0 (2000) type 4 opcode 0x18 (Keep Alive) QID 0 timeout
nvme nvme3: starting controller fencing
nvme nvme3: attempting CCR, timeout 15000ms
nvme nvme3: CCR succeeded using nvme4
Multi-Controller / Scale:
- 3+ controllers with multiple simultaneous CCRs. Controller A
resets B and C concurrently. Both entries tracked correctly,
completions trigger coalesced AEN.
- Cross-CCR: Controller A resets B while B resets A simultaneously.
Both operations proceed correctly.
- 4 controllers, CCR Limit set to 2, 3 controllers timed out.
2 CCRs issued, 3rd controller defaults to TP4129 time-based
recovery as expected.
AEN (Async Event Notification):
- AEN delivered on CCR completion with NVME_ASYNC_EVENT_CCR_CHANGED.
- AEN re-arm verified: after reading CCR log page (clearing
AEN_CCR_PENDING), another CCR triggers a new AEN.
Identify Controller:
- CIU is non-zero, CIRN is populated, CCRL = 4.
- CIU/CIRN values change after disconnect/reconnect (new instance).
CCR Log Page:
- After successful CCR, log page fields verified: ICID, CIU, ACID,
CCRS all populated correctly.
Summary
-------
The patch v3 works correctly in all tested scenarios. CCR recovery
functions as designed when the target supports it, and the TP4129
fallback path operates correctly when CCR is unavailable. Data
integrity is preserved in all supported configurations.
PowerFlex was running engineering code and not a production code.
Tested-by: Igor Achkinazi <igor.achkinazi@dell.com>
Internal Use - Confidential
^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery
2026-04-01 13:33 ` [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Achkinazi, Igor
@ 2026-04-01 16:37 ` Mohamed Khalfella
0 siblings, 0 replies; 61+ messages in thread
From: Mohamed Khalfella @ 2026-04-01 16:37 UTC (permalink / raw)
To: Achkinazi, Igor
Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
James Smart, Hannes Reinecke, Aaron Dailey, Randy Jennings,
Dhaval Giani, linux-nvme@lists.infradead.org,
linux-kernel@vger.kernel.org
On Wed 2026-04-01 13:33:50 +0000, Achkinazi, Igor wrote:
> Hi Mohamed,
>
> We tested this patch v3 against Dell PowerFlex as the NVMe over TCP target.
> Below is a summary of our test methodology and results.
>
Thank you for testing v3 of this patchset. Can you please test v4 [1].
One thing to note on v4 is that a bug has been fixed in CCR nvme command
layout, so you may need to do similar change on the target side.
1- https://lore.kernel.org/all/20260328004518.1729186-1-mkhalfella@purestorage.com/
^ permalink raw reply [flat|nested] 61+ messages in thread
end of thread, other threads:[~2026-04-01 16:37 UTC | newest]
Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-14 4:25 [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 01/21] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 02/21] nvmet/debugfs: Export controller CIU and CIRN via debugfs Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 03/21] nvmet: Implement CCR nvme command Mohamed Khalfella
2026-02-27 16:30 ` Maurizio Lombardi
2026-03-25 18:52 ` Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 04/21] nvmet: Implement CCR logpage Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 05/21] nvmet: Send an AEN on CCR completion Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 06/21] nvme: Rapid Path Failure Recovery read controller identify fields Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 07/21] nvme: Introduce FENCING and FENCED controller states Mohamed Khalfella
2026-02-16 12:33 ` Hannes Reinecke
2026-02-14 4:25 ` [PATCH v3 08/21] nvme: Implement cross-controller reset recovery Mohamed Khalfella
2026-02-16 12:41 ` Hannes Reinecke
2026-02-17 18:35 ` Mohamed Khalfella
2026-02-26 2:37 ` Randy Jennings
2026-03-27 18:33 ` Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 09/21] nvme: Implement cross-controller reset completion Mohamed Khalfella
2026-02-16 12:43 ` Hannes Reinecke
2026-02-17 18:25 ` Mohamed Khalfella
2026-02-18 7:51 ` Hannes Reinecke
2026-02-18 12:47 ` Mohamed Khalfella
2026-02-20 3:34 ` Randy Jennings
2026-02-14 4:25 ` [PATCH v3 10/21] nvme-tcp: Use CCR to recover controller that hits an error Mohamed Khalfella
2026-02-16 12:47 ` Hannes Reinecke
2026-02-14 4:25 ` [PATCH v3 11/21] nvme-rdma: " Mohamed Khalfella
2026-02-16 12:47 ` Hannes Reinecke
2026-02-14 4:25 ` [PATCH v3 12/21] nvme-fc: Decouple error recovery from controller reset Mohamed Khalfella
2026-02-28 0:12 ` James Smart
2026-03-26 2:37 ` Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 13/21] nvme-fc: Use CCR to recover controller that hits an error Mohamed Khalfella
2026-02-28 1:03 ` James Smart
2026-03-26 17:40 ` Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 14/21] nvme-fc: Hold inflight requests while in FENCING state Mohamed Khalfella
2026-02-27 2:49 ` Randy Jennings
2026-02-28 1:10 ` James Smart
2026-02-14 4:25 ` [PATCH v3 15/21] nvme-fc: Do not cancel requests in io taget before it is initialized Mohamed Khalfella
2026-02-28 1:12 ` James Smart
2026-02-14 4:25 ` [PATCH v3 16/21] nvmet: Add support for CQT to nvme target Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 17/21] nvme: Add support for CQT to nvme host Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT Mohamed Khalfella
2026-02-16 12:54 ` Hannes Reinecke
2026-02-16 18:45 ` Mohamed Khalfella
2026-02-17 7:09 ` Hannes Reinecke
2026-02-17 15:35 ` Mohamed Khalfella
2026-02-20 1:22 ` James Smart
2026-02-20 2:11 ` Randy Jennings
2026-02-20 7:23 ` Hannes Reinecke
2026-02-20 2:01 ` Randy Jennings
2026-02-20 7:25 ` Hannes Reinecke
2026-02-27 3:05 ` Randy Jennings
2026-03-02 7:32 ` Hannes Reinecke
2026-02-14 4:25 ` [PATCH v3 19/21] nvme-tcp: Extend FENCING state per TP4129 on CCR failure Mohamed Khalfella
2026-02-16 12:56 ` Hannes Reinecke
2026-02-17 17:58 ` Mohamed Khalfella
2026-02-18 8:26 ` Hannes Reinecke
2026-02-14 4:25 ` [PATCH v3 20/21] nvme-rdma: " Mohamed Khalfella
2026-02-14 4:25 ` [PATCH v3 21/21] nvme-fc: " Mohamed Khalfella
2026-02-28 1:20 ` James Smart
2026-03-25 19:07 ` Mohamed Khalfella
2026-04-01 13:33 ` [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery Achkinazi, Igor
2026-04-01 16:37 ` Mohamed Khalfella
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox