* [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery
@ 2025-11-26 2:11 Mohamed Khalfella
2025-11-26 2:11 ` [RFC PATCH 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
` (13 more replies)
0 siblings, 14 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2025-11-26 2:11 UTC (permalink / raw)
To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel, Mohamed Khalfella
This patchset adds support for TP8028 Rapid Path Failure Recovery for
both nvme target and initiator. Rapid Path Failure Recovery brings
Cross-Controller Reset (CCR) functionality to nvme. This allows nvme
host to send an nvme command to source nvme controller to reset impacted
nvme controller. Provided that both source and impacted controllers are
in the same nvme subsystem.
The main use of CCR is when one path to nvme subsystem fails. Inflight
IOs on impacted nvme controller need to be terminated first before they
can be retried on another path. Otherwise data corruption may happen.
CCR provides a quick way to terminate these IOs on the unreachable nvme
controller allowing recovery to move quickly and avoiding unnecessary
delays. In case of CCR is not possible, then inflight requests are held
for duration defined by TP4129 KATO Corrections and Clarifications
before they are allowed to be retried.
On the target side:
- New struct members have been added to support CCR. struct nvme_id_ctrl
has been updated with CIU (Controller Instance Uniquifier), CIRN
(Controller Instance Random Number), and CQT (Command Quiesce Time).
The combination of CIU, CNTLID, and CIRN is used to identify impacted
controller in CCR command.
- CCR nvme command implemented on the target causes impacted controller
to fail and drop connections to host.
- CCR logpage contains the status of pending CCR requests. An entry is
added to the logpage after CCR request is validated. Completed CCR
requests are removed from the logpage when controller becomes ready or
when requested in get logpage command.
- An AEN is sent when CCR completes to let the host know that it is safe
to retry inflight requests.
On the host side:
- CIU, CIRN, and CQT have been added to struct nvme_ctrl. CIU and CIRN
have been added to sysfs to make the values visible to user. CIU and
CIRN can be used to construct and manually send admin-passthru CCR
commands.
- New controller state NVME_CTRL_RECOVERING has been added to prevent
cancelling timed out inflight requests while CCR is in progress.
Controller flag NVME_CTRL_RECOVERED was also added to signal end of
time-based recovery.
- Controller recovery in nvme_recover_ctrl() is invoked when LIVE
controller hits an error or when a request times out. CCR is attempted
to reset impacted controller.
- Updated nvme fabric transports nvme-tcp, nvme-rdma, and nvme-fc to use
CCR recovery.
Ideally all inflight requests should be held during controller recovery
and only retried after recovery is done. However, there are known
situations that is not the case in this implementation. These gaps will
be addressed in future patches:
- Manual controller reset from sysfs will result in controller going to
RESETTING state and all inflight requests to be canceled immediately
and maybe retried on another path.
- Manual controller delete from sysfs will also result in all inflight
requests to be canceled immediately and maybe retried on another path.
- In nvme-fc nvme controller will be deleted if remote port disappears
with no timeout specified. This results in immediate cancellation of
requests that maybe retried on another path.
- In nvme-rdma if HCA is removed all nvme controllers will be deleted.
This results in canceling inflight IOs and maybe they will be retred
on another path.
- In nvme-fc if controller is LIVE and an IO ends with an error from
LLDD, only this IO will be completed immediately. However, the rest of
inflight IOs will be held correctly because the controller will have
transitioned to RECOVERING state.
Mohamed Khalfella (14):
nvmet: Rapid Path Failure Recovery set controller identify fields
nvmet/debugfs: Add ctrl uniquifier and random values
nvmet: Implement CCR nvme command
nvmet: Implement CCR logpage
nvmet: Send an AEN on CCR completion
nvme: Rapid Path Failure Recovery read controller identify fields
nvme: Add RECOVERING nvme controller state
nvme: Implement cross-controller reset recovery
nvme: Implement cross-controller reset completion
nvme-tcp: Use CCR to recover controller that hits an error
nvme-rdma: Use CCR to recover controller that hits an error
nvme-fc: Decouple error recovery from controller reset
nvme-fc: Use CCR to recover controller that hits an error
nvme-fc: Hold inflight requests while in RECOVERING state
drivers/nvme/host/constants.c | 1 +
drivers/nvme/host/core.c | 197 +++++++++++++++++++++++++++++++-
drivers/nvme/host/fc.c | 194 ++++++++++++++++++++-----------
drivers/nvme/host/nvme.h | 24 ++++
drivers/nvme/host/rdma.c | 51 +++++++--
drivers/nvme/host/sysfs.c | 24 ++++
drivers/nvme/host/tcp.c | 52 +++++++--
drivers/nvme/target/admin-cmd.c | 127 ++++++++++++++++++++
drivers/nvme/target/core.c | 103 ++++++++++++++++-
drivers/nvme/target/debugfs.c | 21 ++++
drivers/nvme/target/nvmet.h | 18 ++-
include/linux/nvme.h | 57 ++++++++-
12 files changed, 778 insertions(+), 91 deletions(-)
base-commit: fd95357fd8c6778ac7dea6c57a19b8b182b6e91f
--
2.51.2
^ permalink raw reply [flat|nested] 68+ messages in thread
* [RFC PATCH 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields
2025-11-26 2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
@ 2025-11-26 2:11 ` Mohamed Khalfella
2025-12-16 1:35 ` Randy Jennings
2025-11-26 2:11 ` [RFC PATCH 02/14] nvmet/debugfs: Add ctrl uniquifier and random values Mohamed Khalfella
` (12 subsequent siblings)
13 siblings, 1 reply; 68+ messages in thread
From: Mohamed Khalfella @ 2025-11-26 2:11 UTC (permalink / raw)
To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel, Mohamed Khalfella
TP8028 Rapid Path Failure Recovery defined new fields in controller
identify response. The newly defined fields are:
- CIU (Controller Instance UNIQUIFIER): is an 8bit non-zero value that
is assigned a random value when controller first created. The value is
expected to be incremented when RDY bit in CSTS register is asserted
- CIRN (Controller Instance Random Number): is 64bit random value that
gets generated when controller is crated. CIRN is regenerated everytime
RDY bit is CSTS register is asserted.
- CCRL (Cross-Controller Reset Limit) is an 8bit value that defines the
maximum number of in-progress controller reset operations. CCRL is
hardcoded to 4 as recommended by TP8028.
TP4129 KATO Corrections and Clarifications defined CQT (Command Quiesce
Time) which is used along with KATO (Keep Alive Timeout) to set an upper
time limit for attempting Cross-Controller Recovery.
Make the new fields available for IO controllers only since TP8028 is
not very useful for discovery controllers.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/target/admin-cmd.c | 6 ++++++
drivers/nvme/target/core.c | 9 +++++++++
drivers/nvme/target/nvmet.h | 2 ++
include/linux/nvme.h | 15 ++++++++++++---
4 files changed, 29 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 3e378153a781..aaceb697e4d2 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -696,6 +696,12 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
id->cntlid = cpu_to_le16(ctrl->cntlid);
id->ver = cpu_to_le32(ctrl->subsys->ver);
+ if (!nvmet_is_disc_subsys(ctrl->subsys)) {
+ id->cqt = NVMF_CQT_MS;
+ id->ciu = ctrl->uniquifier;
+ id->cirn = cpu_to_le64(ctrl->random);
+ id->ccrl = NVMF_CCR_LIMIT;
+ }
/* XXX: figure out what to do about RTD3R/RTD3 */
id->oaes = cpu_to_le32(NVMET_AEN_CFG_OPTIONAL);
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 5d7d483bfbe3..409928202503 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -1393,6 +1393,10 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl)
return;
}
+ if (!nvmet_is_disc_subsys(ctrl->subsys)) {
+ ctrl->uniquifier = ((u8)(ctrl->uniquifier + 1)) ? : 1;
+ ctrl->random = get_random_u64();
+ }
ctrl->csts = NVME_CSTS_RDY;
/*
@@ -1662,6 +1666,11 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
}
ctrl->cntlid = ret;
+ if (!nvmet_is_disc_subsys(ctrl->subsys)) {
+ ctrl->uniquifier = get_random_u8() ? : 1;
+ ctrl->random = get_random_u64();
+ }
+
/*
* Discovery controllers may use some arbitrary high value
* in order to cleanup stale discovery sessions
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 51df72f5e89b..4195c9eff1da 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -263,7 +263,9 @@ struct nvmet_ctrl {
uuid_t hostid;
u16 cntlid;
+ u8 uniquifier;
u32 kato;
+ u64 random;
struct nvmet_port *port;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 655d194f8e72..5135cdc3c120 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -21,6 +21,9 @@
#define NVMF_TRADDR_SIZE 256
#define NVMF_TSAS_SIZE 256
+#define NVMF_CQT_MS 0
+#define NVMF_CCR_LIMIT 4
+
#define NVME_DISC_SUBSYS_NAME "nqn.2014-08.org.nvmexpress.discovery"
#define NVME_NSID_ALL 0xffffffff
@@ -328,7 +331,10 @@ struct nvme_id_ctrl {
__le16 crdt1;
__le16 crdt2;
__le16 crdt3;
- __u8 rsvd134[122];
+ __u8 rsvd134[1];
+ __u8 ciu;
+ __le64 cirn;
+ __u8 rsvd144[112];
__le16 oacs;
__u8 acl;
__u8 aerl;
@@ -362,7 +368,9 @@ struct nvme_id_ctrl {
__u8 anacap;
__le32 anagrpmax;
__le32 nanagrpid;
- __u8 rsvd352[160];
+ __u8 rsvd352[34];
+ __le16 cqt;
+ __u8 rsvd388[124];
__u8 sqes;
__u8 cqes;
__le16 maxcmd;
@@ -389,7 +397,8 @@ struct nvme_id_ctrl {
__u8 msdbd;
__u8 rsvd1804[2];
__u8 dctype;
- __u8 rsvd1807[241];
+ __u8 ccrl;
+ __u8 rsvd1808[240];
struct nvme_id_power_state psd[32];
__u8 vs[1024];
};
--
2.51.2
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [RFC PATCH 02/14] nvmet/debugfs: Add ctrl uniquifier and random values
2025-11-26 2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
2025-11-26 2:11 ` [RFC PATCH 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
@ 2025-11-26 2:11 ` Mohamed Khalfella
2025-12-16 1:43 ` Randy Jennings
2025-11-26 2:11 ` [RFC PATCH 03/14] nvmet: Implement CCR nvme command Mohamed Khalfella
` (11 subsequent siblings)
13 siblings, 1 reply; 68+ messages in thread
From: Mohamed Khalfella @ 2025-11-26 2:11 UTC (permalink / raw)
To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel, Mohamed Khalfella
Export ctrl->random and ctrl->uniquifier as debugfs files under
controller debugfs directory.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/target/debugfs.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/drivers/nvme/target/debugfs.c b/drivers/nvme/target/debugfs.c
index 5dcbd5aa86e1..c983b1776ab8 100644
--- a/drivers/nvme/target/debugfs.c
+++ b/drivers/nvme/target/debugfs.c
@@ -152,6 +152,23 @@ static int nvmet_ctrl_tls_concat_show(struct seq_file *m, void *p)
}
NVMET_DEBUGFS_ATTR(nvmet_ctrl_tls_concat);
#endif
+static int nvmet_ctrl_instance_uniquifier_show(struct seq_file *m, void *p)
+{
+ struct nvmet_ctrl *ctrl = m->private;
+
+ seq_printf(m, "%02x\n", ctrl->uniquifier);
+ return 0;
+}
+NVMET_DEBUGFS_ATTR(nvmet_ctrl_instance_uniquifier);
+
+static int nvmet_ctrl_instance_random_show(struct seq_file *m, void *p)
+{
+ struct nvmet_ctrl *ctrl = m->private;
+
+ seq_printf(m, "%016llx\n", ctrl->random);
+ return 0;
+}
+NVMET_DEBUGFS_ATTR(nvmet_ctrl_instance_random);
int nvmet_debugfs_ctrl_setup(struct nvmet_ctrl *ctrl)
{
@@ -184,6 +201,10 @@ int nvmet_debugfs_ctrl_setup(struct nvmet_ctrl *ctrl)
debugfs_create_file("tls_key", S_IRUSR, ctrl->debugfs_dir, ctrl,
&nvmet_ctrl_tls_key_fops);
#endif
+ debugfs_create_file("uniquifier", S_IRUSR, ctrl->debugfs_dir, ctrl,
+ &nvmet_ctrl_instance_uniquifier_fops);
+ debugfs_create_file("random", S_IRUSR, ctrl->debugfs_dir, ctrl,
+ &nvmet_ctrl_instance_random_fops);
return 0;
}
--
2.51.2
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [RFC PATCH 03/14] nvmet: Implement CCR nvme command
2025-11-26 2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
2025-11-26 2:11 ` [RFC PATCH 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
2025-11-26 2:11 ` [RFC PATCH 02/14] nvmet/debugfs: Add ctrl uniquifier and random values Mohamed Khalfella
@ 2025-11-26 2:11 ` Mohamed Khalfella
2025-12-16 3:01 ` Randy Jennings
2025-12-25 13:14 ` Sagi Grimberg
2025-11-26 2:11 ` [RFC PATCH 04/14] nvmet: Implement CCR logpage Mohamed Khalfella
` (10 subsequent siblings)
13 siblings, 2 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2025-11-26 2:11 UTC (permalink / raw)
To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel, Mohamed Khalfella
Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
Reset) command is an nvme command the is issued to source controller by
initiator to reset impacted controller. Implement CCR command for linux
nvme target.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/target/admin-cmd.c | 79 +++++++++++++++++++++++++++++++++
drivers/nvme/target/core.c | 69 ++++++++++++++++++++++++++++
drivers/nvme/target/nvmet.h | 13 ++++++
include/linux/nvme.h | 23 ++++++++++
4 files changed, 184 insertions(+)
diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index aaceb697e4d2..a55ca010d34f 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -376,7 +376,9 @@ static void nvmet_get_cmd_effects_admin(struct nvmet_ctrl *ctrl,
log->acs[nvme_admin_get_features] =
log->acs[nvme_admin_async_event] =
log->acs[nvme_admin_keep_alive] =
+ log->acs[nvme_admin_cross_ctrl_reset] =
cpu_to_le32(NVME_CMD_EFFECTS_CSUPP);
+
}
static void nvmet_get_cmd_effects_nvm(struct nvme_effects_log *log)
@@ -1615,6 +1617,80 @@ void nvmet_execute_keep_alive(struct nvmet_req *req)
nvmet_req_complete(req, status);
}
+void nvmet_execute_cross_ctrl_reset(struct nvmet_req *req)
+{
+ struct nvmet_ctrl *ictrl, *ctrl = req->sq->ctrl;
+ struct nvme_command *cmd = req->cmd;
+ struct nvmet_ccr *ccr, *new_ccr;
+ int ccr_active, ccr_total;
+ u16 cntlid, status = 0;
+
+ cntlid = le16_to_cpu(cmd->ccr.icid);
+ if (ctrl->cntlid == cntlid) {
+ req->error_loc =
+ offsetof(struct nvme_cross_ctrl_reset_cmd, icid);
+ status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR;
+ goto out;
+ }
+
+ ictrl = nvmet_ctrl_find_get_ccr(ctrl->subsys, ctrl->hostnqn,
+ cmd->ccr.ciu, cntlid,
+ le64_to_cpu(cmd->ccr.cirn));
+ if (!ictrl) {
+ /* Immediate Reset Successful */
+ nvmet_set_result(req, 1);
+ status = NVME_SC_SUCCESS;
+ goto out;
+ }
+
+ new_ccr = kmalloc(sizeof(*ccr), GFP_KERNEL);
+ if (!new_ccr) {
+ status = NVME_SC_INTERNAL;
+ goto out_put_ctrl;
+ }
+
+ ccr_total = ccr_active = 0;
+ mutex_lock(&ctrl->lock);
+ list_for_each_entry(ccr, &ctrl->ccrs, entry) {
+ if (ccr->ctrl == ictrl) {
+ status = NVME_SC_CCR_IN_PROGRESS | NVME_STATUS_DNR;
+ goto out_unlock;
+ }
+
+ ccr_total++;
+ if (ccr->ctrl)
+ ccr_active++;
+ }
+
+ if (ccr_active >= NVMF_CCR_LIMIT) {
+ status = NVME_SC_CCR_LIMIT_EXCEEDED;
+ goto out_unlock;
+ }
+ if (ccr_total >= NVMF_CCR_PER_PAGE) {
+ status = NVME_SC_CCR_LOGPAGE_FULL;
+ goto out_unlock;
+ }
+
+ new_ccr->ciu = cmd->ccr.ciu;
+ new_ccr->icid = cntlid;
+ new_ccr->ctrl = ictrl;
+ list_add_tail(&new_ccr->entry, &ctrl->ccrs);
+ mutex_unlock(&ctrl->lock);
+
+ nvmet_ctrl_fatal_error(ictrl);
+ nvmet_ctrl_put(ictrl);
+ nvmet_req_complete(req, 0);
+ return;
+
+out_unlock:
+ mutex_unlock(&ctrl->lock);
+ kfree(new_ccr);
+out_put_ctrl:
+ nvmet_ctrl_put(ictrl);
+out:
+ nvmet_req_complete(req, status);
+}
+
u32 nvmet_admin_cmd_data_len(struct nvmet_req *req)
{
struct nvme_command *cmd = req->cmd;
@@ -1692,6 +1768,9 @@ u16 nvmet_parse_admin_cmd(struct nvmet_req *req)
case nvme_admin_keep_alive:
req->execute = nvmet_execute_keep_alive;
return 0;
+ case nvme_admin_cross_ctrl_reset:
+ req->execute = nvmet_execute_cross_ctrl_reset;
+ return 0;
default:
return nvmet_report_invalid_opcode(req);
}
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 409928202503..7dbe9255ff42 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -114,6 +114,20 @@ u16 nvmet_zero_sgl(struct nvmet_req *req, off_t off, size_t len)
return 0;
}
+void nvmet_ctrl_cleanup_ccrs(struct nvmet_ctrl *ctrl, bool all)
+{
+ struct nvmet_ccr *ccr, *tmp;
+
+ lockdep_assert_held(&ctrl->lock);
+
+ list_for_each_entry_safe(ccr, tmp, &ctrl->ccrs, entry) {
+ if (all || ccr->ctrl == NULL) {
+ list_del(&ccr->entry);
+ kfree(ccr);
+ }
+ }
+}
+
static u32 nvmet_max_nsid(struct nvmet_subsys *subsys)
{
struct nvmet_ns *cur;
@@ -1396,6 +1410,7 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl)
if (!nvmet_is_disc_subsys(ctrl->subsys)) {
ctrl->uniquifier = ((u8)(ctrl->uniquifier + 1)) ? : 1;
ctrl->random = get_random_u64();
+ nvmet_ctrl_cleanup_ccrs(ctrl, false);
}
ctrl->csts = NVME_CSTS_RDY;
@@ -1501,6 +1516,38 @@ struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
return ctrl;
}
+struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
+ const char *hostnqn, u8 ciu,
+ u16 cntlid, u64 cirn)
+{
+ struct nvmet_ctrl *ctrl;
+ bool found = false;
+
+ mutex_lock(&subsys->lock);
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ if (ctrl->cntlid != cntlid)
+ continue;
+ if (strncmp(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE))
+ continue;
+
+ /* Avoid racing with a controller that is becoming ready */
+ mutex_lock(&ctrl->lock);
+ if (ctrl->uniquifier == ciu && ctrl->random == cirn)
+ found = true;
+ mutex_unlock(&ctrl->lock);
+
+ if (found) {
+ if (kref_get_unless_zero(&ctrl->ref))
+ goto out;
+ break;
+ }
+ };
+ ctrl = NULL;
+out:
+ mutex_unlock(&subsys->lock);
+ return ctrl;
+}
+
u16 nvmet_check_ctrl_status(struct nvmet_req *req)
{
if (unlikely(!(req->sq->ctrl->cc & NVME_CC_ENABLE))) {
@@ -1626,6 +1673,7 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
subsys->clear_ids = 1;
#endif
+ INIT_LIST_HEAD(&ctrl->ccrs);
INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
INIT_LIST_HEAD(&ctrl->async_events);
INIT_RADIX_TREE(&ctrl->p2p_ns_map, GFP_KERNEL);
@@ -1740,12 +1788,33 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
}
EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
+static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
+{
+ struct nvmet_subsys *subsys = ctrl->subsys;
+ struct nvmet_ctrl *sctrl;
+ struct nvmet_ccr *ccr;
+
+ mutex_lock(&ctrl->lock);
+ nvmet_ctrl_cleanup_ccrs(ctrl, true);
+ mutex_unlock(&ctrl->lock);
+
+ list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
+ mutex_lock(&sctrl->lock);
+ list_for_each_entry(ccr, &sctrl->ccrs, entry) {
+ if (ccr->ctrl == ctrl)
+ ccr->ctrl = NULL;
+ }
+ mutex_unlock(&sctrl->lock);
+ }
+}
+
static void nvmet_ctrl_free(struct kref *ref)
{
struct nvmet_ctrl *ctrl = container_of(ref, struct nvmet_ctrl, ref);
struct nvmet_subsys *subsys = ctrl->subsys;
mutex_lock(&subsys->lock);
+ nvmet_ctrl_complete_pending_ccr(ctrl);
nvmet_ctrl_destroy_pr(ctrl);
nvmet_release_p2p_ns_map(ctrl);
list_del(&ctrl->subsys_entry);
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 4195c9eff1da..6c0091b8af8b 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -267,6 +267,7 @@ struct nvmet_ctrl {
u32 kato;
u64 random;
+ struct list_head ccrs;
struct nvmet_port *port;
u32 aen_enabled;
@@ -314,6 +315,13 @@ struct nvmet_ctrl {
struct nvmet_pr_log_mgr pr_log_mgr;
};
+struct nvmet_ccr {
+ struct nvmet_ctrl *ctrl;
+ struct list_head entry;
+ u16 icid;
+ u8 ciu;
+};
+
struct nvmet_subsys {
enum nvme_subsys_type type;
@@ -576,6 +584,7 @@ void nvmet_req_free_sgls(struct nvmet_req *req);
void nvmet_execute_set_features(struct nvmet_req *req);
void nvmet_execute_get_features(struct nvmet_req *req);
void nvmet_execute_keep_alive(struct nvmet_req *req);
+void nvmet_execute_cross_ctrl_reset(struct nvmet_req *req);
u16 nvmet_check_cqid(struct nvmet_ctrl *ctrl, u16 cqid, bool create);
u16 nvmet_check_io_cqid(struct nvmet_ctrl *ctrl, u16 cqid, bool create);
@@ -618,6 +627,10 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args);
struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
const char *hostnqn, u16 cntlid,
struct nvmet_req *req);
+struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
+ const char *hostnqn, u8 ciu,
+ u16 cntlid, u64 cirn);
+void nvmet_ctrl_cleanup_ccrs(struct nvmet_ctrl *ctrl, bool all);
void nvmet_ctrl_put(struct nvmet_ctrl *ctrl);
u16 nvmet_check_ctrl_status(struct nvmet_req *req);
ssize_t nvmet_ctrl_host_traddr(struct nvmet_ctrl *ctrl,
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 5135cdc3c120..0f305b317aa3 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -23,6 +23,7 @@
#define NVMF_CQT_MS 0
#define NVMF_CCR_LIMIT 4
+#define NVMF_CCR_PER_PAGE 511
#define NVME_DISC_SUBSYS_NAME "nqn.2014-08.org.nvmexpress.discovery"
@@ -1225,6 +1226,22 @@ struct nvme_zone_mgmt_recv_cmd {
__le32 cdw14[2];
};
+struct nvme_cross_ctrl_reset_cmd {
+ __u8 opcode;
+ __u8 flags;
+ __u16 command_id;
+ __le32 nsid;
+ __le64 rsvd2[2];
+ union nvme_data_ptr dptr;
+ __u8 rsvd10;
+ __u8 ciu;
+ __le16 icid;
+ __le32 cdw11;
+ __le64 cirn;
+ __le32 cdw14;
+ __le32 cdw15;
+};
+
struct nvme_io_mgmt_recv_cmd {
__u8 opcode;
__u8 flags;
@@ -1323,6 +1340,7 @@ enum nvme_admin_opcode {
nvme_admin_virtual_mgmt = 0x1c,
nvme_admin_nvme_mi_send = 0x1d,
nvme_admin_nvme_mi_recv = 0x1e,
+ nvme_admin_cross_ctrl_reset = 0x38,
nvme_admin_dbbuf = 0x7C,
nvme_admin_format_nvm = 0x80,
nvme_admin_security_send = 0x81,
@@ -1356,6 +1374,7 @@ enum nvme_admin_opcode {
nvme_admin_opcode_name(nvme_admin_virtual_mgmt), \
nvme_admin_opcode_name(nvme_admin_nvme_mi_send), \
nvme_admin_opcode_name(nvme_admin_nvme_mi_recv), \
+ nvme_admin_opcode_name(nvme_admin_cross_ctrl_reset), \
nvme_admin_opcode_name(nvme_admin_dbbuf), \
nvme_admin_opcode_name(nvme_admin_format_nvm), \
nvme_admin_opcode_name(nvme_admin_security_send), \
@@ -2009,6 +2028,7 @@ struct nvme_command {
struct nvme_dbbuf dbbuf;
struct nvme_directive_cmd directive;
struct nvme_io_mgmt_recv_cmd imr;
+ struct nvme_cross_ctrl_reset_cmd ccr;
};
};
@@ -2173,6 +2193,9 @@ enum {
NVME_SC_PMR_SAN_PROHIBITED = 0x123,
NVME_SC_ANA_GROUP_ID_INVALID = 0x124,
NVME_SC_ANA_ATTACH_FAILED = 0x125,
+ NVME_SC_CCR_IN_PROGRESS = 0x13f,
+ NVME_SC_CCR_LOGPAGE_FULL = 0x140,
+ NVME_SC_CCR_LIMIT_EXCEEDED = 0x141,
/*
* I/O Command Set Specific - NVM commands:
--
2.51.2
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [RFC PATCH 04/14] nvmet: Implement CCR logpage
2025-11-26 2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (2 preceding siblings ...)
2025-11-26 2:11 ` [RFC PATCH 03/14] nvmet: Implement CCR nvme command Mohamed Khalfella
@ 2025-11-26 2:11 ` Mohamed Khalfella
2025-12-16 3:11 ` Randy Jennings
2025-11-26 2:11 ` [RFC PATCH 05/14] nvmet: Send an AEN on CCR completion Mohamed Khalfella
` (9 subsequent siblings)
13 siblings, 1 reply; 68+ messages in thread
From: Mohamed Khalfella @ 2025-11-26 2:11 UTC (permalink / raw)
To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel, Mohamed Khalfella
Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
Reset) log page contains an entry for each CCR request submitted to
source controller. Implement CCR logpage for nvme linux target.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/target/admin-cmd.c | 42 +++++++++++++++++++++++++++++++++
include/linux/nvme.h | 16 +++++++++++++
2 files changed, 58 insertions(+)
diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index a55ca010d34f..d2892354bf81 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -220,6 +220,7 @@ static void nvmet_execute_get_supported_log_pages(struct nvmet_req *req)
logs->lids[NVME_LOG_FEATURES] = cpu_to_le32(NVME_LIDS_LSUPP);
logs->lids[NVME_LOG_RMI] = cpu_to_le32(NVME_LIDS_LSUPP);
logs->lids[NVME_LOG_RESERVATION] = cpu_to_le32(NVME_LIDS_LSUPP);
+ logs->lids[NVME_LOG_CCR] = cpu_to_le32(NVME_LIDS_LSUPP);
status = nvmet_copy_to_sgl(req, 0, logs, sizeof(*logs));
kfree(logs);
@@ -608,6 +609,45 @@ static void nvmet_execute_get_log_page_features(struct nvmet_req *req)
nvmet_req_complete(req, status);
}
+static void nvmet_execute_get_log_page_ccr(struct nvmet_req *req)
+{
+ struct nvmet_ctrl *ctrl = req->sq->ctrl;
+ struct nvmet_ccr *ccr;
+ struct nvme_ccr_log *log;
+ int index = 0;
+ u16 status;
+
+ log = kzalloc(sizeof(*log), GFP_KERNEL);
+ if (!log) {
+ status = NVME_SC_INTERNAL;
+ goto out;
+ }
+
+ mutex_lock(&ctrl->lock);
+ list_for_each_entry(ccr, &ctrl->ccrs, entry) {
+ log->entries[index].icid = cpu_to_le16(ccr->icid);
+ log->entries[index].ciu = ccr->ciu;
+ log->entries[index].acid = cpu_to_le16(0xffff);
+
+ /* If ccr->ctrl is NULL then we know reset succeeded */
+ log->entries[index].ccrs = ccr->ctrl ? 0x00 : 0x01;
+ log->entries[index].ccrf = 0x03; /* Validated and Initiated */
+ index++;
+ }
+
+ /* Cleanup completed CCRs if requested */
+ if (req->cmd->get_log_page.lsp & 0x1)
+ nvmet_ctrl_cleanup_ccrs(ctrl, false);
+ mutex_unlock(&ctrl->lock);
+
+ log->ne = cpu_to_le16(index);
+ nvmet_clear_aen_bit(req, NVME_AEN_BIT_CCR_COMPLETE);
+ status = nvmet_copy_to_sgl(req, 0, log, sizeof(*log));
+ kfree(log);
+out:
+ nvmet_req_complete(req, status);
+}
+
static void nvmet_execute_get_log_page(struct nvmet_req *req)
{
if (!nvmet_check_transfer_len(req, nvmet_get_log_page_len(req->cmd)))
@@ -641,6 +681,8 @@ static void nvmet_execute_get_log_page(struct nvmet_req *req)
return nvmet_execute_get_log_page_rmi(req);
case NVME_LOG_RESERVATION:
return nvmet_execute_get_log_page_resv(req);
+ case NVME_LOG_CCR:
+ return nvmet_execute_get_log_page_ccr(req);
}
pr_debug("unhandled lid %d on qid %d\n",
req->cmd->get_log_page.lid, req->sq->qid);
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 0f305b317aa3..d51883122d65 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -1435,6 +1435,7 @@ enum {
NVME_LOG_FDP_CONFIGS = 0x20,
NVME_LOG_DISC = 0x70,
NVME_LOG_RESERVATION = 0x80,
+ NVME_LOG_CCR = 0x1E,
NVME_FWACT_REPL = (0 << 3),
NVME_FWACT_REPL_ACTV = (1 << 3),
NVME_FWACT_ACTV = (2 << 3),
@@ -1458,6 +1459,21 @@ enum {
NVME_FIS_CSCPE = 1 << 21,
};
+struct nvme_ccr_log_entry {
+ __le16 icid;
+ __u8 ciu;
+ __u8 rsvd3;
+ __le16 acid;
+ __u8 ccrs;
+ __u8 ccrf;
+};
+
+struct nvme_ccr_log {
+ __le16 ne;
+ __u8 rsvd2[6];
+ struct nvme_ccr_log_entry entries[NVMF_CCR_PER_PAGE];
+};
+
/* NVMe Namespace Write Protect State */
enum {
NVME_NS_NO_WRITE_PROTECT = 0,
--
2.51.2
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [RFC PATCH 05/14] nvmet: Send an AEN on CCR completion
2025-11-26 2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (3 preceding siblings ...)
2025-11-26 2:11 ` [RFC PATCH 04/14] nvmet: Implement CCR logpage Mohamed Khalfella
@ 2025-11-26 2:11 ` Mohamed Khalfella
2025-12-16 3:31 ` Randy Jennings
2025-12-25 13:23 ` Sagi Grimberg
2025-11-26 2:11 ` [RFC PATCH 06/14] nvme: Rapid Path Failure Recovery read controller identify fields Mohamed Khalfella
` (8 subsequent siblings)
13 siblings, 2 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2025-11-26 2:11 UTC (permalink / raw)
To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel, Mohamed Khalfella
Send an AEN to initiator when impacted controller exists. The
notification points to CCR log page that initiator can read to check
which CCR operation completed.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/target/core.c | 27 +++++++++++++++++++++++----
drivers/nvme/target/nvmet.h | 3 ++-
include/linux/nvme.h | 3 +++
3 files changed, 28 insertions(+), 5 deletions(-)
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 7dbe9255ff42..60173833c3eb 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -202,7 +202,7 @@ static void nvmet_async_event_work(struct work_struct *work)
nvmet_async_events_process(ctrl);
}
-void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
+static void nvmet_add_async_event_locked(struct nvmet_ctrl *ctrl, u8 event_type,
u8 event_info, u8 log_page)
{
struct nvmet_async_event *aen;
@@ -215,12 +215,17 @@ void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
aen->event_info = event_info;
aen->log_page = log_page;
- mutex_lock(&ctrl->lock);
list_add_tail(&aen->entry, &ctrl->async_events);
- mutex_unlock(&ctrl->lock);
queue_work(nvmet_wq, &ctrl->async_event_work);
}
+void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
+ u8 event_info, u8 log_page)
+{
+ mutex_lock(&ctrl->lock);
+ nvmet_add_async_event_locked(ctrl, event_type, event_info, log_page);
+ mutex_unlock(&ctrl->lock);
+}
static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
{
@@ -1788,6 +1793,18 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
}
EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
+static void nvmet_ctrl_notify_ccr(struct nvmet_ctrl *ctrl)
+{
+ lockdep_assert_held(&ctrl->lock);
+
+ if (nvmet_aen_bit_disabled(ctrl, NVME_AEN_BIT_CCR_COMPLETE))
+ return;
+
+ nvmet_add_async_event_locked(ctrl, NVME_AER_NOTICE,
+ NVME_AER_NOTICE_CCR_COMPLETED,
+ NVME_LOG_CCR);
+}
+
static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
{
struct nvmet_subsys *subsys = ctrl->subsys;
@@ -1801,8 +1818,10 @@ static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
mutex_lock(&sctrl->lock);
list_for_each_entry(ccr, &sctrl->ccrs, entry) {
- if (ccr->ctrl == ctrl)
+ if (ccr->ctrl == ctrl) {
+ nvmet_ctrl_notify_ccr(sctrl);
ccr->ctrl = NULL;
+ }
}
mutex_unlock(&sctrl->lock);
}
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 6c0091b8af8b..7ebcef13be2b 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -44,7 +44,8 @@
* Supported optional AENs:
*/
#define NVMET_AEN_CFG_OPTIONAL \
- (NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_ANA_CHANGE)
+ (NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_ANA_CHANGE | \
+ NVME_AEN_CFG_CCR_COMPLETE)
#define NVMET_DISC_AEN_CFG_OPTIONAL \
(NVME_AEN_CFG_DISC_CHANGE)
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index d51883122d65..a145417dccd3 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -863,12 +863,14 @@ enum {
NVME_AER_NOTICE_FW_ACT_STARTING = 0x01,
NVME_AER_NOTICE_ANA = 0x03,
NVME_AER_NOTICE_DISC_CHANGED = 0xf0,
+ NVME_AER_NOTICE_CCR_COMPLETED = 0xf4,
};
enum {
NVME_AEN_BIT_NS_ATTR = 8,
NVME_AEN_BIT_FW_ACT = 9,
NVME_AEN_BIT_ANA_CHANGE = 11,
+ NVME_AEN_BIT_CCR_COMPLETE = 20,
NVME_AEN_BIT_DISC_CHANGE = 31,
};
@@ -876,6 +878,7 @@ enum {
NVME_AEN_CFG_NS_ATTR = 1 << NVME_AEN_BIT_NS_ATTR,
NVME_AEN_CFG_FW_ACT = 1 << NVME_AEN_BIT_FW_ACT,
NVME_AEN_CFG_ANA_CHANGE = 1 << NVME_AEN_BIT_ANA_CHANGE,
+ NVME_AEN_CFG_CCR_COMPLETE = 1 << NVME_AEN_BIT_CCR_COMPLETE,
NVME_AEN_CFG_DISC_CHANGE = 1 << NVME_AEN_BIT_DISC_CHANGE,
};
--
2.51.2
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [RFC PATCH 06/14] nvme: Rapid Path Failure Recovery read controller identify fields
2025-11-26 2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (4 preceding siblings ...)
2025-11-26 2:11 ` [RFC PATCH 05/14] nvmet: Send an AEN on CCR completion Mohamed Khalfella
@ 2025-11-26 2:11 ` Mohamed Khalfella
2025-12-18 15:22 ` Randy Jennings
2025-11-26 2:11 ` [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state Mohamed Khalfella
` (7 subsequent siblings)
13 siblings, 1 reply; 68+ messages in thread
From: Mohamed Khalfella @ 2025-11-26 2:11 UTC (permalink / raw)
To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel, Mohamed Khalfella
TP2028 Rapid path failure added new fileds to controller identify
response. Read CIU (Controller Instance Uniquifier), CIRN (Controller
Instance Random Number), and CCRL (Cross-Controller Reset Limit) from
controller identify response. Expose CIU and CIRN as sysfs attributes
so the values can be used directrly by user if needed.
TP4129 KATO Corrections and Clarifications defined CQT (Command Quiesce
Time) which is used along with KATO (Keep Alive Timeout) to set an upper
limite for attempting Cross-Controller Recovery.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/core.c | 5 +++++
drivers/nvme/host/nvme.h | 11 +++++++++++
drivers/nvme/host/sysfs.c | 23 +++++++++++++++++++++++
3 files changed, 39 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fa4181d7de73..aa007a7b9606 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3572,12 +3572,17 @@ static int nvme_init_identify(struct nvme_ctrl *ctrl)
ctrl->crdt[1] = le16_to_cpu(id->crdt2);
ctrl->crdt[2] = le16_to_cpu(id->crdt3);
+ ctrl->ciu = id->ciu;
+ ctrl->cirn = le64_to_cpu(id->cirn);
+ atomic_set(&ctrl->ccr_limit, id->ccrl);
+
ctrl->oacs = le16_to_cpu(id->oacs);
ctrl->oncs = le16_to_cpu(id->oncs);
ctrl->mtfa = le16_to_cpu(id->mtfa);
ctrl->oaes = le32_to_cpu(id->oaes);
ctrl->wctemp = le16_to_cpu(id->wctemp);
ctrl->cctemp = le16_to_cpu(id->cctemp);
+ ctrl->cqt = le16_to_cpu(id->cqt);
atomic_set(&ctrl->abort_limit, id->acl + 1);
ctrl->vwc = id->vwc;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 102fae6a231c..5195a9abfadf 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -326,13 +326,17 @@ struct nvme_ctrl {
u32 max_zone_append;
#endif
u16 crdt[3];
+ u16 cqt;
u16 oncs;
u8 dmrl;
+ u8 ciu;
u32 dmrsl;
+ u64 cirn;
u16 oacs;
u16 sqsize;
u32 max_namespaces;
atomic_t abort_limit;
+ atomic_t ccr_limit;
u8 vwc;
u32 vs;
u32 sgls;
@@ -1218,4 +1222,11 @@ static inline bool nvme_multi_css(struct nvme_ctrl *ctrl)
return (ctrl->ctrl_config & NVME_CC_CSS_MASK) == NVME_CC_CSS_CSI;
}
+static inline unsigned long nvme_recovery_timeout_ms(struct nvme_ctrl *ctrl)
+{
+ if (ctrl->ctratt & NVME_CTRL_ATTR_TBKAS)
+ return 3 * ctrl->kato * 1000 + ctrl->cqt;
+ return 2 * ctrl->kato * 1000 + ctrl->cqt;
+}
+
#endif /* _NVME_H */
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 29430949ce2f..ae36249ad61e 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -388,6 +388,27 @@ nvme_show_int_function(queue_count);
nvme_show_int_function(sqsize);
nvme_show_int_function(kato);
+static ssize_t nvme_sysfs_uniquifier_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+ return sysfs_emit(buf, "%02x\n", ctrl->ciu);
+}
+static DEVICE_ATTR(uniquifier, S_IRUGO, nvme_sysfs_uniquifier_show, NULL);
+
+static ssize_t nvme_sysfs_random_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+ return sysfs_emit(buf, "%016llx\n", ctrl->cirn);
+}
+static DEVICE_ATTR(random, S_IRUGO, nvme_sysfs_random_show, NULL);
+
+
static ssize_t nvme_sysfs_delete(struct device *dev,
struct device_attribute *attr, const char *buf,
size_t count)
@@ -734,6 +755,8 @@ static struct attribute *nvme_dev_attrs[] = {
&dev_attr_numa_node.attr,
&dev_attr_queue_count.attr,
&dev_attr_sqsize.attr,
+ &dev_attr_uniquifier.attr,
+ &dev_attr_random.attr,
&dev_attr_hostnqn.attr,
&dev_attr_hostid.attr,
&dev_attr_ctrl_loss_tmo.attr,
--
2.51.2
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state
2025-11-26 2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (5 preceding siblings ...)
2025-11-26 2:11 ` [RFC PATCH 06/14] nvme: Rapid Path Failure Recovery read controller identify fields Mohamed Khalfella
@ 2025-11-26 2:11 ` Mohamed Khalfella
2025-12-18 23:18 ` Randy Jennings
2025-12-25 13:29 ` Sagi Grimberg
2025-11-26 2:11 ` [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery Mohamed Khalfella
` (6 subsequent siblings)
13 siblings, 2 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2025-11-26 2:11 UTC (permalink / raw)
To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel, Mohamed Khalfella
Add NVME_CTRL_RECOVERING as a new controller state to be used when
impacted controller is being recovered. A LIVE controller enters
RECOVERING state when an IO error is encountered. While recovering
inflight IOs will not be canceled if they timeout. These IOs will be
canceled after recovery finishes. Also, while recovering a controller
can not be reset or deleted. This is intentional because reset or delete
will result in canceling inflight IOs. When recovery finishes, the
impacted controller transitions from RECOVERING state to RESETTING state.
Reset codepath takes care of queues teardown and inflight requests
cancellation.
Note, there is no transition from RECOVERING to RESETTING added to
nvme_change_ctrl_state(). The reason is that user should not be allowed
to reset or delete a controller that is being recovered.
Add NVME_CTRL_RECOVERED controller flag. This flag is set on a controller
about to schedule delayed work for time based recovery.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/core.c | 10 ++++++++++
drivers/nvme/host/nvme.h | 2 ++
drivers/nvme/host/sysfs.c | 1 +
3 files changed, 13 insertions(+)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index aa007a7b9606..f5b84bc327d3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -574,6 +574,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
break;
}
break;
+ case NVME_CTRL_RECOVERING:
+ switch (old_state) {
+ case NVME_CTRL_LIVE:
+ changed = true;
+ fallthrough;
+ default:
+ break;
+ }
+ break;
case NVME_CTRL_RESETTING:
switch (old_state) {
case NVME_CTRL_NEW:
@@ -761,6 +770,7 @@ blk_status_t nvme_fail_nonready_command(struct nvme_ctrl *ctrl,
if (state != NVME_CTRL_DELETING_NOIO &&
state != NVME_CTRL_DELETING &&
state != NVME_CTRL_DEAD &&
+ state != NVME_CTRL_RECOVERING &&
!test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
!blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
return BLK_STS_RESOURCE;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 5195a9abfadf..cde427353e0a 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -251,6 +251,7 @@ static inline u16 nvme_req_qid(struct request *req)
enum nvme_ctrl_state {
NVME_CTRL_NEW,
NVME_CTRL_LIVE,
+ NVME_CTRL_RECOVERING,
NVME_CTRL_RESETTING,
NVME_CTRL_CONNECTING,
NVME_CTRL_DELETING,
@@ -275,6 +276,7 @@ enum nvme_ctrl_flags {
NVME_CTRL_SKIP_ID_CNS_CS = 4,
NVME_CTRL_DIRTY_CAPABILITY = 5,
NVME_CTRL_FROZEN = 6,
+ NVME_CTRL_RECOVERED = 7,
};
struct nvme_ctrl {
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index ae36249ad61e..55f907fb6c86 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -443,6 +443,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev,
static const char *const state_name[] = {
[NVME_CTRL_NEW] = "new",
[NVME_CTRL_LIVE] = "live",
+ [NVME_CTRL_RECOVERING] = "recovering",
[NVME_CTRL_RESETTING] = "resetting",
[NVME_CTRL_CONNECTING] = "connecting",
[NVME_CTRL_DELETING] = "deleting",
--
2.51.2
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery
2025-11-26 2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (6 preceding siblings ...)
2025-11-26 2:11 ` [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state Mohamed Khalfella
@ 2025-11-26 2:11 ` Mohamed Khalfella
2025-12-19 1:21 ` Randy Jennings
2025-12-27 10:14 ` Sagi Grimberg
2025-11-26 2:11 ` [RFC PATCH 09/14] nvme: Implement cross-controller reset completion Mohamed Khalfella
` (5 subsequent siblings)
13 siblings, 2 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2025-11-26 2:11 UTC (permalink / raw)
To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel, Mohamed Khalfella
A host that has more than one path connecting to an nvme subsystem
typically has an nvme controller associated with every path. This is
mostly applicable to nvmeof. If one path goes down, inflight IOs on that
path should not be retried immediately on another path because this
could lead to data corruption as described in TP4129. TP8028 defines
cross-controller reset mechanism that can be used by host to terminate
IOs on the failed path using one of the remaining healthy paths. Only
after IOs are terminated, or long enough time passes as defined by
TP4129, inflight IOs should be retried on another path. Implement core
cross-controller reset shared logic to be used by the transports.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/constants.c | 1 +
drivers/nvme/host/core.c | 133 ++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 10 +++
3 files changed, 144 insertions(+)
diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index dc90df9e13a2..f679efd5110e 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -46,6 +46,7 @@ static const char * const nvme_admin_ops[] = {
[nvme_admin_virtual_mgmt] = "Virtual Management",
[nvme_admin_nvme_mi_send] = "NVMe Send MI",
[nvme_admin_nvme_mi_recv] = "NVMe Receive MI",
+ [nvme_admin_cross_ctrl_reset] = "Cross Controller Reset",
[nvme_admin_dbbuf] = "Doorbell Buffer Config",
[nvme_admin_format_nvm] = "Format NVM",
[nvme_admin_security_send] = "Security Send",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f5b84bc327d3..f38b70ca9cee 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -554,6 +554,138 @@ void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl)
}
EXPORT_SYMBOL_GPL(nvme_cancel_admin_tagset);
+static struct nvme_ctrl *nvme_find_ccr_ctrl(struct nvme_ctrl *ictrl,
+ u32 min_cntlid)
+{
+ struct nvme_subsystem *subsys = ictrl->subsys;
+ struct nvme_ctrl *sctrl;
+ unsigned long flags;
+
+ mutex_lock(&nvme_subsystems_lock);
+ list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
+ if (sctrl->cntlid < min_cntlid)
+ continue;
+
+ if (atomic_dec_if_positive(&sctrl->ccr_limit) < 0)
+ continue;
+
+ spin_lock_irqsave(&sctrl->lock, flags);
+ if (sctrl->state != NVME_CTRL_LIVE) {
+ spin_unlock_irqrestore(&sctrl->lock, flags);
+ atomic_inc(&sctrl->ccr_limit);
+ continue;
+ }
+
+ /*
+ * We got a good candidate source controller that is locked and
+ * LIVE. However, no guarantee sctrl will not be deleted after
+ * sctrl->lock is released. Get a ref of both sctrl and admin_q
+ * so they do not disappear until we are done with them.
+ */
+ WARN_ON_ONCE(!blk_get_queue(sctrl->admin_q));
+ nvme_get_ctrl(sctrl);
+ spin_unlock_irqrestore(&sctrl->lock, flags);
+ goto found;
+ }
+ sctrl = NULL;
+found:
+ mutex_unlock(&nvme_subsystems_lock);
+ return sctrl;
+}
+
+static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
+{
+ unsigned long flags, tmo, remain;
+ struct nvme_ccr_entry ccr = { };
+ union nvme_result res = { 0 };
+ struct nvme_command c = { };
+ u32 result;
+ int ret = 0;
+
+ init_completion(&ccr.complete);
+ ccr.ictrl = ictrl;
+
+ spin_lock_irqsave(&sctrl->lock, flags);
+ list_add_tail(&ccr.list, &sctrl->ccrs);
+ spin_unlock_irqrestore(&sctrl->lock, flags);
+
+ c.ccr.opcode = nvme_admin_cross_ctrl_reset;
+ c.ccr.ciu = ictrl->ciu;
+ c.ccr.icid = cpu_to_le16(ictrl->cntlid);
+ c.ccr.cirn = cpu_to_le64(ictrl->cirn);
+ ret = __nvme_submit_sync_cmd(sctrl->admin_q, &c, &res,
+ NULL, 0, NVME_QID_ANY, 0);
+ if (ret)
+ goto out;
+
+ result = le32_to_cpu(res.u32);
+ if (result & 0x01) /* Immediate Reset */
+ goto out;
+
+ tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
+ remain = wait_for_completion_timeout(&ccr.complete, tmo);
+ if (!remain)
+ ret = -EAGAIN;
+out:
+ spin_lock_irqsave(&sctrl->lock, flags);
+ list_del(&ccr.list);
+ spin_unlock_irqrestore(&sctrl->lock, flags);
+ return ccr.ccrs == 1 ? 0 : ret;
+}
+
+unsigned long nvme_recover_ctrl(struct nvme_ctrl *ictrl)
+{
+ unsigned long deadline, now, timeout;
+ struct nvme_ctrl *sctrl;
+ u32 min_cntlid = 0;
+ int ret;
+
+ timeout = nvme_recovery_timeout_ms(ictrl);
+ dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
+
+ now = jiffies;
+ deadline = now + msecs_to_jiffies(timeout);
+ while (time_before(now, deadline)) {
+ sctrl = nvme_find_ccr_ctrl(ictrl, min_cntlid);
+ if (!sctrl) {
+ /* CCR failed, switch to time-based recovery */
+ return deadline - now;
+ }
+
+ ret = nvme_issue_wait_ccr(sctrl, ictrl);
+ atomic_inc(&sctrl->ccr_limit);
+
+ if (!ret) {
+ dev_info(ictrl->device, "CCR succeeded using %s\n",
+ dev_name(sctrl->device));
+ blk_put_queue(sctrl->admin_q);
+ nvme_put_ctrl(sctrl);
+ return 0;
+ }
+
+ /* Try another controller */
+ min_cntlid = sctrl->cntlid + 1;
+ blk_put_queue(sctrl->admin_q);
+ nvme_put_ctrl(sctrl);
+ now = jiffies;
+ }
+
+ dev_info(ictrl->device, "CCR reached timeout, call it done\n");
+ return 0;
+}
+EXPORT_SYMBOL_GPL(nvme_recover_ctrl);
+
+void nvme_end_ctrl_recovery(struct nvme_ctrl *ctrl)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&ctrl->lock, flags);
+ WRITE_ONCE(ctrl->state, NVME_CTRL_RESETTING);
+ wake_up_all(&ctrl->state_wq);
+ spin_unlock_irqrestore(&ctrl->lock, flags);
+}
+EXPORT_SYMBOL_GPL(nvme_end_ctrl_recovery);
+
bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
enum nvme_ctrl_state new_state)
{
@@ -5108,6 +5240,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
mutex_init(&ctrl->scan_lock);
INIT_LIST_HEAD(&ctrl->namespaces);
+ INIT_LIST_HEAD(&ctrl->ccrs);
xa_init(&ctrl->cels);
ctrl->dev = dev;
ctrl->ops = ops;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index cde427353e0a..1f8937fce9a7 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -279,6 +279,13 @@ enum nvme_ctrl_flags {
NVME_CTRL_RECOVERED = 7,
};
+struct nvme_ccr_entry {
+ struct list_head list;
+ struct completion complete;
+ struct nvme_ctrl *ictrl;
+ u8 ccrs;
+};
+
struct nvme_ctrl {
bool comp_seen;
bool identified;
@@ -296,6 +303,7 @@ struct nvme_ctrl {
struct blk_mq_tag_set *tagset;
struct blk_mq_tag_set *admin_tagset;
struct list_head namespaces;
+ struct list_head ccrs;
struct mutex namespaces_lock;
struct srcu_struct srcu;
struct device ctrl_device;
@@ -805,6 +813,8 @@ blk_status_t nvme_host_path_error(struct request *req);
bool nvme_cancel_request(struct request *req, void *data);
void nvme_cancel_tagset(struct nvme_ctrl *ctrl);
void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl);
+unsigned long nvme_recover_ctrl(struct nvme_ctrl *ctrl);
+void nvme_end_ctrl_recovery(struct nvme_ctrl *ctrl);
bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
enum nvme_ctrl_state new_state);
int nvme_disable_ctrl(struct nvme_ctrl *ctrl, bool shutdown);
--
2.51.2
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [RFC PATCH 09/14] nvme: Implement cross-controller reset completion
2025-11-26 2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (7 preceding siblings ...)
2025-11-26 2:11 ` [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery Mohamed Khalfella
@ 2025-11-26 2:11 ` Mohamed Khalfella
2025-12-19 1:31 ` Randy Jennings
2025-12-27 10:24 ` Sagi Grimberg
2025-11-26 2:11 ` [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error Mohamed Khalfella
` (4 subsequent siblings)
13 siblings, 2 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2025-11-26 2:11 UTC (permalink / raw)
To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel, Mohamed Khalfella
An nvme source controller that issues CCR command expects to receive an
NVME_AER_NOTICE_CCR_COMPLETED when pending CCR succeeds or fails. Add
sctrl->ccr_work to read NVME_LOG_CCR logpage and wakeup any thread
waiting on CCR completion.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/core.c | 49 +++++++++++++++++++++++++++++++++++++++-
drivers/nvme/host/nvme.h | 1 +
2 files changed, 49 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f38b70ca9cee..467754e77a2d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1894,7 +1894,8 @@ EXPORT_SYMBOL_GPL(nvme_set_queue_count);
#define NVME_AEN_SUPPORTED \
(NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | \
- NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_DISC_CHANGE)
+ NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_CCR_COMPLETE | \
+ NVME_AEN_CFG_DISC_CHANGE)
static void nvme_enable_aen(struct nvme_ctrl *ctrl)
{
@@ -4860,6 +4861,47 @@ static void nvme_get_fw_slot_info(struct nvme_ctrl *ctrl)
kfree(log);
}
+static void nvme_ccr_work(struct work_struct *work)
+{
+ struct nvme_ctrl *ctrl = container_of(work, struct nvme_ctrl, ccr_work);
+ struct nvme_ccr_entry *ccr;
+ struct nvme_ccr_log_entry *entry;
+ struct nvme_ccr_log *log;
+ unsigned long flags;
+ int ret, i;
+
+ log = kmalloc(sizeof(*log), GFP_KERNEL);
+ if (!log)
+ return;
+
+ ret = nvme_get_log(ctrl, 0, NVME_LOG_CCR, 0x01,
+ 0x00, log, sizeof(*log), 0);
+ if (ret)
+ goto out;
+
+ spin_lock_irqsave(&ctrl->lock, flags);
+ for (i = 0; i < le16_to_cpu(log->ne); i++) {
+ entry = &log->entries[i];
+ if (entry->ccrs == 0) /* skip in progress entries */
+ continue;
+
+ list_for_each_entry(ccr, &ctrl->ccrs, list) {
+ struct nvme_ctrl *ictrl = ccr->ictrl;
+
+ if (ictrl->cntlid != le16_to_cpu(entry->icid) ||
+ ictrl->ciu != entry->ciu)
+ continue;
+
+ /* Complete matching entry */
+ ccr->ccrs = entry->ccrs;
+ complete(&ccr->complete);
+ }
+ }
+ spin_unlock_irqrestore(&ctrl->lock, flags);
+out:
+ kfree(log);
+}
+
static void nvme_fw_act_work(struct work_struct *work)
{
struct nvme_ctrl *ctrl = container_of(work,
@@ -4936,6 +4978,9 @@ static bool nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result)
case NVME_AER_NOTICE_DISC_CHANGED:
ctrl->aen_result = result;
break;
+ case NVME_AER_NOTICE_CCR_COMPLETED:
+ queue_work(nvme_wq, &ctrl->ccr_work);
+ break;
default:
dev_warn(ctrl->device, "async event result %08x\n", result);
}
@@ -5126,6 +5171,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
nvme_stop_failfast_work(ctrl);
flush_work(&ctrl->async_event_work);
cancel_work_sync(&ctrl->fw_act_work);
+ cancel_work_sync(&ctrl->ccr_work);
if (ctrl->ops->stop_ctrl)
ctrl->ops->stop_ctrl(ctrl);
}
@@ -5247,6 +5293,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
ctrl->quirks = quirks;
ctrl->numa_node = NUMA_NO_NODE;
INIT_WORK(&ctrl->scan_work, nvme_scan_work);
+ INIT_WORK(&ctrl->ccr_work, nvme_ccr_work);
INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1f8937fce9a7..3f5a0722304d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -366,6 +366,7 @@ struct nvme_ctrl {
struct nvme_effects_log *effects;
struct xarray cels;
struct work_struct scan_work;
+ struct work_struct ccr_work;
struct work_struct async_event_work;
struct delayed_work ka_work;
struct delayed_work failfast_work;
--
2.51.2
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error
2025-11-26 2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (8 preceding siblings ...)
2025-11-26 2:11 ` [RFC PATCH 09/14] nvme: Implement cross-controller reset completion Mohamed Khalfella
@ 2025-11-26 2:11 ` Mohamed Khalfella
2025-12-19 2:06 ` Randy Jennings
2025-12-27 10:35 ` Sagi Grimberg
2025-11-26 2:11 ` [RFC PATCH 11/14] nvme-rdma: " Mohamed Khalfella
` (3 subsequent siblings)
13 siblings, 2 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2025-11-26 2:11 UTC (permalink / raw)
To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel, Mohamed Khalfella
An alive nvme controller that hits an error now will move to RECOVERING
state instead of RESETTING state. In RECOVERING state ctrl->err_work
will attempt to use cross-controller recovery to terminate inflight IOs
on the controller. If CCR succeeds, then switch to RESETTING state and
continue error recovery as usuall by tearing down controller and attempt
reconnecting to target. If CCR fails, then the behavior of recovery
depends on whether CQT is supported or not. If CQT is supported, switch
to time-based recovery by holding inflight IOs until it is safe for them
to be retried. If CQT is not supported proceed to retry requests
immediately, as the code currently does.
To support implementing time-based recovery turn ctrl->err_work into
delayed work. Update nvme_tcp_timeout() to not complete inflight IOs
while controller in RECOVERING state.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/tcp.c | 52 +++++++++++++++++++++++++++++++++++------
1 file changed, 45 insertions(+), 7 deletions(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 9a96df1a511c..ec9a713490a9 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -193,7 +193,7 @@ struct nvme_tcp_ctrl {
struct sockaddr_storage src_addr;
struct nvme_ctrl ctrl;
- struct work_struct err_work;
+ struct delayed_work err_work;
struct delayed_work connect_work;
struct nvme_tcp_request async_req;
u32 io_queues[HCTX_MAX_TYPES];
@@ -611,11 +611,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
{
- if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+ if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RECOVERING) &&
+ !nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
return;
dev_warn(ctrl->device, "starting error recovery\n");
- queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
+ queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, 0);
}
static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
@@ -2470,12 +2471,48 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
nvme_tcp_reconnect_or_remove(ctrl, ret);
}
+static int nvme_tcp_recover_ctrl(struct nvme_ctrl *ctrl)
+{
+ unsigned long rem;
+
+ if (test_and_clear_bit(NVME_CTRL_RECOVERED, &ctrl->flags)) {
+ dev_info(ctrl->device, "completed time-based recovery\n");
+ goto done;
+ }
+
+ rem = nvme_recover_ctrl(ctrl);
+ if (!rem)
+ goto done;
+
+ if (!ctrl->cqt) {
+ dev_info(ctrl->device,
+ "CCR failed, CQT not supported, skip time-based recovery\n");
+ goto done;
+ }
+
+ dev_info(ctrl->device,
+ "CCR failed, switch to time-based recovery, timeout = %ums\n",
+ jiffies_to_msecs(rem));
+ set_bit(NVME_CTRL_RECOVERED, &ctrl->flags);
+ queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, rem);
+ return -EAGAIN;
+
+done:
+ nvme_end_ctrl_recovery(ctrl);
+ return 0;
+}
+
static void nvme_tcp_error_recovery_work(struct work_struct *work)
{
- struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
+ struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
struct nvme_tcp_ctrl, err_work);
struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
+ if (nvme_ctrl_state(ctrl) == NVME_CTRL_RECOVERING) {
+ if (nvme_tcp_recover_ctrl(ctrl))
+ return;
+ }
+
if (nvme_tcp_key_revoke_needed(ctrl))
nvme_auth_revoke_tls_key(ctrl);
nvme_stop_keep_alive(ctrl);
@@ -2545,7 +2582,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
static void nvme_tcp_stop_ctrl(struct nvme_ctrl *ctrl)
{
- flush_work(&to_tcp_ctrl(ctrl)->err_work);
+ flush_delayed_work(&to_tcp_ctrl(ctrl)->err_work);
cancel_delayed_work_sync(&to_tcp_ctrl(ctrl)->connect_work);
}
@@ -2640,6 +2677,7 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
{
struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
struct nvme_ctrl *ctrl = &req->queue->ctrl->ctrl;
+ enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
struct nvme_command *cmd = &pdu->cmd;
int qid = nvme_tcp_queue_id(req->queue);
@@ -2649,7 +2687,7 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
rq->tag, nvme_cid(rq), pdu->hdr.type, cmd->common.opcode,
nvme_fabrics_opcode_str(qid, cmd), qid);
- if (nvme_ctrl_state(ctrl) != NVME_CTRL_LIVE) {
+ if (state != NVME_CTRL_LIVE && state != NVME_CTRL_RECOVERING) {
/*
* If we are resetting, connecting or deleting we should
* complete immediately because we may block controller
@@ -2903,7 +2941,7 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
INIT_DELAYED_WORK(&ctrl->connect_work,
nvme_tcp_reconnect_ctrl_work);
- INIT_WORK(&ctrl->err_work, nvme_tcp_error_recovery_work);
+ INIT_DELAYED_WORK(&ctrl->err_work, nvme_tcp_error_recovery_work);
INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
if (!(opts->mask & NVMF_OPT_TRSVCID)) {
--
2.51.2
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [RFC PATCH 11/14] nvme-rdma: Use CCR to recover controller that hits an error
2025-11-26 2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (9 preceding siblings ...)
2025-11-26 2:11 ` [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error Mohamed Khalfella
@ 2025-11-26 2:11 ` Mohamed Khalfella
2025-12-19 2:16 ` Randy Jennings
2025-12-27 10:36 ` Sagi Grimberg
2025-11-26 2:11 ` [RFC PATCH 12/14] nvme-fc: Decouple error recovery from controller reset Mohamed Khalfella
` (2 subsequent siblings)
13 siblings, 2 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2025-11-26 2:11 UTC (permalink / raw)
To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel, Mohamed Khalfella
An alive nvme controller that hits an error will now move to RECOVERING
state instead of RESETTING state. In RECOVERING state, ctrl->err_work
will attempt to use cross-controller recovery to terminate inflight IOs
on the controller. If CCR succeeds, then switch to RESETTING state and
continue error recovery as usuall by tearing down the controller, and
attempting reconnect to target. If CCR fails, the behavior of recovery
depends on whether CQT is supported or not. If CQT is supported, switch
to time-based recovery by holding inflight IOs until it is safe for them
to be retried. If CQT is not supported proceed to retry requests
immediately, as the code currently does.
To support implementing time-based recovery turn ctrl->err_work into
delayed work. Update nvme_rdma_timeout() to not complete inflight IOs
while controller in RECOVERING state.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/rdma.c | 51 ++++++++++++++++++++++++++++++++++------
1 file changed, 44 insertions(+), 7 deletions(-)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 190a4cfa8a5e..4a8bb2614468 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -106,7 +106,7 @@ struct nvme_rdma_ctrl {
/* other member variables */
struct blk_mq_tag_set tag_set;
- struct work_struct err_work;
+ struct delayed_work err_work;
struct nvme_rdma_qe async_event_sqe;
@@ -961,7 +961,7 @@ static void nvme_rdma_stop_ctrl(struct nvme_ctrl *nctrl)
{
struct nvme_rdma_ctrl *ctrl = to_rdma_ctrl(nctrl);
- flush_work(&ctrl->err_work);
+ flush_delayed_work(&ctrl->err_work);
cancel_delayed_work_sync(&ctrl->reconnect_work);
}
@@ -1120,11 +1120,46 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
nvme_rdma_reconnect_or_remove(ctrl, ret);
}
+static int nvme_rdma_recover_ctrl(struct nvme_ctrl *ctrl)
+{
+ unsigned long rem;
+
+ if (test_and_clear_bit(NVME_CTRL_RECOVERED, &ctrl->flags)) {
+ dev_info(ctrl->device, "completed time-based recovery\n");
+ goto done;
+ }
+
+ rem = nvme_recover_ctrl(ctrl);
+ if (!rem)
+ goto done;
+
+ if (!ctrl->cqt) {
+ dev_info(ctrl->device,
+ "CCR failed, CQT not supported, skip time-based recovery\n");
+ goto done;
+ }
+
+ dev_info(ctrl->device,
+ "CCR failed, switch to time-based recovery, timeout = %ums\n",
+ jiffies_to_msecs(rem));
+ set_bit(NVME_CTRL_RECOVERED, &ctrl->flags);
+ queue_delayed_work(nvme_reset_wq, &to_rdma_ctrl(ctrl)->err_work, rem);
+ return -EAGAIN;
+
+done:
+ nvme_end_ctrl_recovery(ctrl);
+ return 0;
+}
static void nvme_rdma_error_recovery_work(struct work_struct *work)
{
- struct nvme_rdma_ctrl *ctrl = container_of(work,
+ struct nvme_rdma_ctrl *ctrl = container_of(to_delayed_work(work),
struct nvme_rdma_ctrl, err_work);
+ if (nvme_ctrl_state(&ctrl->ctrl) == NVME_CTRL_RECOVERING) {
+ if (nvme_rdma_recover_ctrl(&ctrl->ctrl))
+ return;
+ }
+
nvme_stop_keep_alive(&ctrl->ctrl);
flush_work(&ctrl->ctrl.async_event_work);
nvme_rdma_teardown_io_queues(ctrl, false);
@@ -1147,11 +1182,12 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
{
- if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
+ if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RECOVERING) &&
+ !nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
return;
dev_warn(ctrl->ctrl.device, "starting error recovery\n");
- queue_work(nvme_reset_wq, &ctrl->err_work);
+ queue_delayed_work(nvme_reset_wq, &ctrl->err_work, 0);
}
static void nvme_rdma_end_request(struct nvme_rdma_request *req)
@@ -1955,6 +1991,7 @@ static enum blk_eh_timer_return nvme_rdma_timeout(struct request *rq)
struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
struct nvme_rdma_queue *queue = req->queue;
struct nvme_rdma_ctrl *ctrl = queue->ctrl;
+ enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
struct nvme_command *cmd = req->req.cmd;
int qid = nvme_rdma_queue_idx(queue);
@@ -1963,7 +2000,7 @@ static enum blk_eh_timer_return nvme_rdma_timeout(struct request *rq)
rq->tag, nvme_cid(rq), cmd->common.opcode,
nvme_fabrics_opcode_str(qid, cmd), qid);
- if (nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_LIVE) {
+ if (state != NVME_CTRL_LIVE && state != NVME_CTRL_RECOVERING) {
/*
* If we are resetting, connecting or deleting we should
* complete immediately because we may block controller
@@ -2280,7 +2317,7 @@ static struct nvme_rdma_ctrl *nvme_rdma_alloc_ctrl(struct device *dev,
INIT_DELAYED_WORK(&ctrl->reconnect_work,
nvme_rdma_reconnect_ctrl_work);
- INIT_WORK(&ctrl->err_work, nvme_rdma_error_recovery_work);
+ INIT_DELAYED_WORK(&ctrl->err_work, nvme_rdma_error_recovery_work);
INIT_WORK(&ctrl->ctrl.reset_work, nvme_rdma_reset_ctrl_work);
ctrl->ctrl.queue_count = opts->nr_io_queues + opts->nr_write_queues +
--
2.51.2
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [RFC PATCH 12/14] nvme-fc: Decouple error recovery from controller reset
2025-11-26 2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (10 preceding siblings ...)
2025-11-26 2:11 ` [RFC PATCH 11/14] nvme-rdma: " Mohamed Khalfella
@ 2025-11-26 2:11 ` Mohamed Khalfella
2025-12-19 2:59 ` Randy Jennings
2025-11-26 2:12 ` [RFC PATCH 13/14] nvme-fc: Use CCR to recover controller that hits an error Mohamed Khalfella
2025-11-26 2:12 ` [RFC PATCH 14/14] nvme-fc: Hold inflight requests while in RECOVERING state Mohamed Khalfella
13 siblings, 1 reply; 68+ messages in thread
From: Mohamed Khalfella @ 2025-11-26 2:11 UTC (permalink / raw)
To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel, Mohamed Khalfella
nvme_fc_error_recovery() called from nvme_fc_timeout() while controller
in CONNECTING state results in deadlock reported in link below. Update
nvme_fc_timeout() to schedule error recovery to avoid the deadlock.
Previous to this change, if controller was LIVE, error recovery resets
the controller. This did not match nvme-tcp and nvme-rdma. Decouple
error recovery from controller reset to match other fabric transports.
Link: https://lore.kernel.org/all/20250529214928.2112990-1-mkhalfella@purestorage.com/
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/fc.c | 94 ++++++++++++++++++------------------------
1 file changed, 41 insertions(+), 53 deletions(-)
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 03987f497a5b..8b6a7c80015c 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -227,6 +227,8 @@ static DEFINE_IDA(nvme_fc_ctrl_cnt);
static struct device *fc_udev_device;
static void nvme_fc_complete_rq(struct request *rq);
+static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
+ char *errmsg);
/* *********************** FC-NVME Port Management ************************ */
@@ -786,7 +788,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
"Reconnect", ctrl->cnum);
set_bit(ASSOC_FAILED, &ctrl->flags);
- nvme_reset_ctrl(&ctrl->ctrl);
+ nvme_fc_start_ioerr_recovery(ctrl, "Connectivity Loss");
}
/**
@@ -983,7 +985,7 @@ fc_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
-static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg);
+static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl);
static void
__nvme_fc_finish_ls_req(struct nvmefc_ls_req_op *lsop)
@@ -1563,9 +1565,8 @@ nvme_fc_ls_disconnect_assoc(struct nvmefc_ls_rcv_op *lsop)
* for the association have been ABTS'd by
* nvme_fc_delete_association().
*/
-
- /* fail the association */
- nvme_fc_error_recovery(ctrl, "Disconnect Association LS received");
+ nvme_fc_start_ioerr_recovery(ctrl,
+ "Disconnect Association LS received");
/* release the reference taken by nvme_fc_match_disconn_ls() */
nvme_fc_ctrl_put(ctrl);
@@ -1867,7 +1868,7 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
struct nvme_fc_ctrl *ctrl =
container_of(work, struct nvme_fc_ctrl, ioerr_work);
- nvme_fc_error_recovery(ctrl, "transport detected io error");
+ nvme_fc_error_recovery(ctrl);
}
/*
@@ -1888,6 +1889,17 @@ char *nvme_fc_io_getuuid(struct nvmefc_fcp_req *req)
}
EXPORT_SYMBOL_GPL(nvme_fc_io_getuuid);
+static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
+ char *errmsg)
+{
+ if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
+ return;
+
+ dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error recovery %s\n",
+ ctrl->cnum, errmsg);
+ queue_delayed_work(nvme_reset_wq, &ctrl->ioerr_work, 0);
+}
+
static void
nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
{
@@ -2045,9 +2057,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
nvme_fc_complete_rq(rq);
check_error:
- if (terminate_assoc &&
- nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_RESETTING)
- queue_work(nvme_reset_wq, &ctrl->ioerr_work);
+ if (terminate_assoc)
+ nvme_fc_start_ioerr_recovery(ctrl, "io error");
}
static int
@@ -2497,39 +2508,6 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
nvme_unquiesce_admin_queue(&ctrl->ctrl);
}
-static void
-nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
-{
- enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
-
- /*
- * if an error (io timeout, etc) while (re)connecting, the remote
- * port requested terminating of the association (disconnect_ls)
- * or an error (timeout or abort) occurred on an io while creating
- * the controller. Abort any ios on the association and let the
- * create_association error path resolve things.
- */
- if (state == NVME_CTRL_CONNECTING) {
- __nvme_fc_abort_outstanding_ios(ctrl, true);
- dev_warn(ctrl->ctrl.device,
- "NVME-FC{%d}: transport error during (re)connect\n",
- ctrl->cnum);
- return;
- }
-
- /* Otherwise, only proceed if in LIVE state - e.g. on first error */
- if (state != NVME_CTRL_LIVE)
- return;
-
- dev_warn(ctrl->ctrl.device,
- "NVME-FC{%d}: transport association event: %s\n",
- ctrl->cnum, errmsg);
- dev_warn(ctrl->ctrl.device,
- "NVME-FC{%d}: resetting controller\n", ctrl->cnum);
-
- nvme_reset_ctrl(&ctrl->ctrl);
-}
-
static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
{
struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(rq);
@@ -2538,24 +2516,14 @@ static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
struct nvme_command *sqe = &cmdiu->sqe;
- /*
- * Attempt to abort the offending command. Command completion
- * will detect the aborted io and will fail the connection.
- */
dev_info(ctrl->ctrl.device,
"NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
"x%08x/x%08x\n",
ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
nvme_fabrics_opcode_str(qnum, sqe),
sqe->common.cdw10, sqe->common.cdw11);
- if (__nvme_fc_abort_op(ctrl, op))
- nvme_fc_error_recovery(ctrl, "io timeout abort failed");
- /*
- * the io abort has been initiated. Have the reset timer
- * restarted and the abort completion will complete the io
- * shortly. Avoids a synchronous wait while the abort finishes.
- */
+ nvme_fc_start_ioerr_recovery(ctrl, "io timeout");
return BLK_EH_RESET_TIMER;
}
@@ -3347,6 +3315,26 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
}
}
+static void
+nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
+{
+ nvme_stop_keep_alive(&ctrl->ctrl);
+ nvme_stop_ctrl(&ctrl->ctrl);
+
+ /* will block while waiting for io to terminate */
+ nvme_fc_delete_association(ctrl);
+
+ /* Do not reconnect if controller is being deleted */
+ if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
+ return;
+
+ if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
+ queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
+ return;
+ }
+
+ nvme_fc_reconnect_or_delete(ctrl, -ENOTCONN);
+}
static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
.name = "fc",
--
2.51.2
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [RFC PATCH 13/14] nvme-fc: Use CCR to recover controller that hits an error
2025-11-26 2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (11 preceding siblings ...)
2025-11-26 2:11 ` [RFC PATCH 12/14] nvme-fc: Decouple error recovery from controller reset Mohamed Khalfella
@ 2025-11-26 2:12 ` Mohamed Khalfella
2025-12-20 1:21 ` Randy Jennings
2025-11-26 2:12 ` [RFC PATCH 14/14] nvme-fc: Hold inflight requests while in RECOVERING state Mohamed Khalfella
13 siblings, 1 reply; 68+ messages in thread
From: Mohamed Khalfella @ 2025-11-26 2:12 UTC (permalink / raw)
To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel, Mohamed Khalfella
An alive nvme controller that hits an error will now move to RECOVERING
state instead of RESETTING state. In RECOVERING state, ctrl->err_work
will attempt to use cross-controller recovery to terminate inflight IOs
on the controller. If CCR succeeds, then switch to RESETTING state and
continue error recovery as usuall by tearing down the controller, and
attempting reconnect to target. If CCR fails, the behavior of recovery
depends on whether CQT is supported or not. If CQT is supported, switch
to time-based recovery by holding inflight IOs until it is safe for them
to be retried. If CQT is not supported proceed to retry requests
immediately, as the code currently does.
Currently, inflight IOs can get completed during time-based recovery.
This will be addressed in the next patch.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/fc.c | 52 ++++++++++++++++++++++++++++++++++++------
1 file changed, 45 insertions(+), 7 deletions(-)
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 8b6a7c80015c..0e4d271bb4b6 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -166,7 +166,7 @@ struct nvme_fc_ctrl {
struct blk_mq_tag_set admin_tag_set;
struct blk_mq_tag_set tag_set;
- struct work_struct ioerr_work;
+ struct delayed_work ioerr_work;
struct delayed_work connect_work;
struct kref ref;
@@ -1862,11 +1862,48 @@ __nvme_fc_fcpop_chk_teardowns(struct nvme_fc_ctrl *ctrl,
}
}
+static int nvme_fc_recover_ctrl(struct nvme_ctrl *ctrl)
+{
+ unsigned long rem;
+
+ if (test_and_clear_bit(NVME_CTRL_RECOVERED, &ctrl->flags)) {
+ dev_info(ctrl->device, "completed time-based recovery\n");
+ goto done;
+ }
+
+ rem = nvme_recover_ctrl(ctrl);
+ if (!rem)
+ goto done;
+
+ if (!ctrl->cqt) {
+ dev_info(ctrl->device,
+ "CCR failed, CQT not supported, skip time-based recovery\n");
+ goto done;
+ }
+
+ dev_info(ctrl->device,
+ "CCR failed, switch to time-based recovery, timeout = %ums\n",
+ jiffies_to_msecs(rem));
+
+ set_bit(NVME_CTRL_RECOVERED, &ctrl->flags);
+ queue_delayed_work(nvme_reset_wq, &to_fc_ctrl(ctrl)->ioerr_work, rem);
+ return -EAGAIN;
+
+done:
+ nvme_end_ctrl_recovery(ctrl);
+ return 0;
+}
+
static void
nvme_fc_ctrl_ioerr_work(struct work_struct *work)
{
- struct nvme_fc_ctrl *ctrl =
- container_of(work, struct nvme_fc_ctrl, ioerr_work);
+ struct nvme_fc_ctrl *ctrl = container_of(to_delayed_work(work),
+ struct nvme_fc_ctrl, ioerr_work);
+
+ if (nvme_ctrl_state(&ctrl->ctrl) == NVME_CTRL_RECOVERING) {
+ if (nvme_fc_recover_ctrl(&ctrl->ctrl))
+ return;
+ }
nvme_fc_error_recovery(ctrl);
}
@@ -1892,7 +1929,8 @@ EXPORT_SYMBOL_GPL(nvme_fc_io_getuuid);
static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
char *errmsg)
{
- if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
+ if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RECOVERING) &&
+ !nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
return;
dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error recovery %s\n",
@@ -3227,7 +3265,7 @@ nvme_fc_delete_ctrl(struct nvme_ctrl *nctrl)
{
struct nvme_fc_ctrl *ctrl = to_fc_ctrl(nctrl);
- cancel_work_sync(&ctrl->ioerr_work);
+ cancel_delayed_work_sync(&ctrl->ioerr_work);
cancel_delayed_work_sync(&ctrl->connect_work);
/*
* kill the association on the link side. this will block
@@ -3465,7 +3503,7 @@ nvme_fc_alloc_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
INIT_DELAYED_WORK(&ctrl->connect_work, nvme_fc_connect_ctrl_work);
- INIT_WORK(&ctrl->ioerr_work, nvme_fc_ctrl_ioerr_work);
+ INIT_DELAYED_WORK(&ctrl->ioerr_work, nvme_fc_ctrl_ioerr_work);
spin_lock_init(&ctrl->lock);
/* io queue count */
@@ -3563,7 +3601,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
fail_ctrl:
nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_DELETING);
- cancel_work_sync(&ctrl->ioerr_work);
+ cancel_delayed_work_sync(&ctrl->ioerr_work);
cancel_work_sync(&ctrl->ctrl.reset_work);
cancel_delayed_work_sync(&ctrl->connect_work);
--
2.51.2
^ permalink raw reply related [flat|nested] 68+ messages in thread
* [RFC PATCH 14/14] nvme-fc: Hold inflight requests while in RECOVERING state
2025-11-26 2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
` (12 preceding siblings ...)
2025-11-26 2:12 ` [RFC PATCH 13/14] nvme-fc: Use CCR to recover controller that hits an error Mohamed Khalfella
@ 2025-11-26 2:12 ` Mohamed Khalfella
2025-12-20 1:44 ` Randy Jennings
13 siblings, 1 reply; 68+ messages in thread
From: Mohamed Khalfella @ 2025-11-26 2:12 UTC (permalink / raw)
To: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel, Mohamed Khalfella
nvme_fc_delete_association() called from error recovery codepath waits
for all requests to be completed. In RECOVERING state inflight IOs
should be held until it is safe to for them to be retried. Update
nvme_fc_fcpio_done() to not complete requests while in RECOVERING state.
Update recovery codepath to cancel inflight requests similar to what
nvme-tcp and nvme-rdma do today.
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
drivers/nvme/host/fc.c | 50 +++++++++++++++++++++++++++++++++---------
1 file changed, 40 insertions(+), 10 deletions(-)
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 0e4d271bb4b6..1b4f42358f37 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -171,7 +171,7 @@ struct nvme_fc_ctrl {
struct kref ref;
unsigned long flags;
- u32 iocnt;
+ atomic_t iocnt;
wait_queue_head_t ioabort_wait;
struct nvme_fc_fcp_op aen_ops[NVME_NR_AEN_COMMANDS];
@@ -1816,7 +1816,7 @@ __nvme_fc_abort_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_fcp_op *op)
atomic_set(&op->state, opstate);
else if (test_bit(FCCTRL_TERMIO, &ctrl->flags)) {
op->flags |= FCOP_FLAGS_TERMIO;
- ctrl->iocnt++;
+ atomic_inc(&ctrl->iocnt);
}
spin_unlock_irqrestore(&ctrl->lock, flags);
@@ -1846,20 +1846,29 @@ nvme_fc_abort_aen_ops(struct nvme_fc_ctrl *ctrl)
}
static inline void
+__nvme_fc_fcpop_count_one_down(struct nvme_fc_ctrl *ctrl)
+{
+ if (atomic_dec_return(&ctrl->iocnt) == 0)
+ wake_up(&ctrl->ioabort_wait);
+}
+
+static inline bool
__nvme_fc_fcpop_chk_teardowns(struct nvme_fc_ctrl *ctrl,
struct nvme_fc_fcp_op *op, int opstate)
{
unsigned long flags;
+ bool ret = false;
if (opstate == FCPOP_STATE_ABORTED) {
spin_lock_irqsave(&ctrl->lock, flags);
if (test_bit(FCCTRL_TERMIO, &ctrl->flags) &&
op->flags & FCOP_FLAGS_TERMIO) {
- if (!--ctrl->iocnt)
- wake_up(&ctrl->ioabort_wait);
+ ret = true;
}
spin_unlock_irqrestore(&ctrl->lock, flags);
}
+
+ return ret;
}
static int nvme_fc_recover_ctrl(struct nvme_ctrl *ctrl)
@@ -1950,7 +1959,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
struct nvme_command *sqe = &op->cmd_iu.sqe;
__le16 status = cpu_to_le16(NVME_SC_SUCCESS << 1);
union nvme_result result;
- bool terminate_assoc = true;
+ bool op_term, terminate_assoc = true;
int opstate;
/*
@@ -2083,17 +2092,34 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
done:
if (op->flags & FCOP_FLAGS_AEN) {
nvme_complete_async_event(&queue->ctrl->ctrl, status, &result);
- __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
+ if (__nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate))
+ __nvme_fc_fcpop_count_one_down(ctrl);
atomic_set(&op->state, FCPOP_STATE_IDLE);
op->flags = FCOP_FLAGS_AEN; /* clear other flags */
nvme_fc_ctrl_put(ctrl);
goto check_error;
}
- __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
+ /*
+ * We can not access op after the request is completed because it can
+ * be reused immediately. At the same time we want to wakeup the thread
+ * waiting for ongoing IOs _after_ requests are completed. This is
+ * necessary because that thread will start canceling inflight IOs
+ * and we want to avoid request completion racing with cancellation.
+ */
+ op_term = __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
+
+ /* Error recovery completes inflight reqeusts when it is safe */
+ if (nvme_ctrl_state(&ctrl->ctrl) == NVME_CTRL_RECOVERING)
+ goto check_op_term;
+
if (!nvme_try_complete_req(rq, status, result))
nvme_fc_complete_rq(rq);
+check_op_term:
+ if (op_term)
+ __nvme_fc_fcpop_count_one_down(ctrl);
+
check_error:
if (terminate_assoc)
nvme_fc_start_ioerr_recovery(ctrl, "io error");
@@ -2737,7 +2763,8 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
* cmd with the csn was supposed to arrive.
*/
opstate = atomic_xchg(&op->state, FCPOP_STATE_COMPLETE);
- __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
+ if (__nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate))
+ __nvme_fc_fcpop_count_one_down(ctrl);
if (!(op->flags & FCOP_FLAGS_AEN)) {
nvme_fc_unmap_data(ctrl, op->rq, op);
@@ -3206,7 +3233,7 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl)
spin_lock_irqsave(&ctrl->lock, flags);
set_bit(FCCTRL_TERMIO, &ctrl->flags);
- ctrl->iocnt = 0;
+ atomic_set(&ctrl->iocnt, 0);
spin_unlock_irqrestore(&ctrl->lock, flags);
__nvme_fc_abort_outstanding_ios(ctrl, false);
@@ -3215,8 +3242,8 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl)
nvme_fc_abort_aen_ops(ctrl);
/* wait for all io that had to be aborted */
+ wait_event(ctrl->ioabort_wait, atomic_read(&ctrl->iocnt) == 0);
spin_lock_irq(&ctrl->lock);
- wait_event_lock_irq(ctrl->ioabort_wait, ctrl->iocnt == 0, ctrl->lock);
clear_bit(FCCTRL_TERMIO, &ctrl->flags);
spin_unlock_irq(&ctrl->lock);
@@ -3362,6 +3389,9 @@ nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
/* will block while waiting for io to terminate */
nvme_fc_delete_association(ctrl);
+ nvme_cancel_tagset(&ctrl->ctrl);
+ nvme_cancel_admin_tagset(&ctrl->ctrl);
+
/* Do not reconnect if controller is being deleted */
if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
return;
--
2.51.2
^ permalink raw reply related [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields
2025-11-26 2:11 ` [RFC PATCH 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
@ 2025-12-16 1:35 ` Randy Jennings
0 siblings, 0 replies; 68+ messages in thread
From: Randy Jennings @ 2025-12-16 1:35 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> TP8028 Rapid Path Failure Recovery defined new fields in controller
> identify response.
Reviewed-by: Randy Jennings <randyj@purestorage.com>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 02/14] nvmet/debugfs: Add ctrl uniquifier and random values
2025-11-26 2:11 ` [RFC PATCH 02/14] nvmet/debugfs: Add ctrl uniquifier and random values Mohamed Khalfella
@ 2025-12-16 1:43 ` Randy Jennings
0 siblings, 0 replies; 68+ messages in thread
From: Randy Jennings @ 2025-12-16 1:43 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> Export ctrl->random and ctrl->uniquifier as debugfs files under
> controller debugfs directory.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 03/14] nvmet: Implement CCR nvme command
2025-11-26 2:11 ` [RFC PATCH 03/14] nvmet: Implement CCR nvme command Mohamed Khalfella
@ 2025-12-16 3:01 ` Randy Jennings
2025-12-31 21:14 ` Mohamed Khalfella
2025-12-25 13:14 ` Sagi Grimberg
1 sibling, 1 reply; 68+ messages in thread
From: Randy Jennings @ 2025-12-16 3:01 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
> Reset) command is an nvme command the is issued to source controller by
> initiator to reset impacted controller. Implement CCR command for linux
> nvme target.
Remove extraneous "the is" in second line.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 04/14] nvmet: Implement CCR logpage
2025-11-26 2:11 ` [RFC PATCH 04/14] nvmet: Implement CCR logpage Mohamed Khalfella
@ 2025-12-16 3:11 ` Randy Jennings
0 siblings, 0 replies; 68+ messages in thread
From: Randy Jennings @ 2025-12-16 3:11 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
> Reset) log page contains an entry for each CCR request submitted to
> source controller. Implement CCR logpage for nvme linux target.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 05/14] nvmet: Send an AEN on CCR completion
2025-11-26 2:11 ` [RFC PATCH 05/14] nvmet: Send an AEN on CCR completion Mohamed Khalfella
@ 2025-12-16 3:31 ` Randy Jennings
2025-12-25 13:23 ` Sagi Grimberg
1 sibling, 0 replies; 68+ messages in thread
From: Randy Jennings @ 2025-12-16 3:31 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> Send an AEN to initiator when impacted controller exists. The
> notification points to CCR log page that initiator can read to check
> which CCR operation completed.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 06/14] nvme: Rapid Path Failure Recovery read controller identify fields
2025-11-26 2:11 ` [RFC PATCH 06/14] nvme: Rapid Path Failure Recovery read controller identify fields Mohamed Khalfella
@ 2025-12-18 15:22 ` Randy Jennings
2025-12-31 22:26 ` Mohamed Khalfella
0 siblings, 1 reply; 68+ messages in thread
From: Randy Jennings @ 2025-12-18 15:22 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> TP2028 Rapid path failure added new fileds to controller identify
TP8028
> response. Read CIU (Controller Instance Uniquifier), CIRN (Controller
> Instance Random Number), and CCRL (Cross-Controller Reset Limit) from
> controller identify response. Expose CIU and CIRN as sysfs attributes
> so the values can be used directrly by user if needed.
>
> TP4129 KATO Corrections and Clarifications defined CQT (Command Quiesce
> Time) which is used along with KATO (Keep Alive Timeout) to set an upper
> limite for attempting Cross-Controller Recovery.
"limite" -> "limit"
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/host/core.c | 5 +++++
> drivers/nvme/host/nvme.h | 11 +++++++++++
> drivers/nvme/host/sysfs.c | 23 +++++++++++++++++++++++
> 3 files changed, 39 insertions(+)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fa4181d7de73..aa007a7b9606 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3572,12 +3572,17 @@ static int nvme_init_identify(struct nvme_ctrl *ctrl)
> ctrl->crdt[1] = le16_to_cpu(id->crdt2);
> ctrl->crdt[2] = le16_to_cpu(id->crdt3);
>
> + ctrl->ciu = id->ciu;
> + ctrl->cirn = le64_to_cpu(id->cirn);
> + atomic_set(&ctrl->ccr_limit, id->ccrl);
Seems like it would be good for the target & init to use the same
name for these fields. I have a preference for these over
instance_uniquifier and random because they are more concise, but
the preference is not strong.
> +
> ctrl->oacs = le16_to_cpu(id->oacs);
> ctrl->oncs = le16_to_cpu(id->oncs);
> ctrl->mtfa = le16_to_cpu(id->mtfa);
> ctrl->oaes = le32_to_cpu(id->oaes);
> ctrl->wctemp = le16_to_cpu(id->wctemp);
> ctrl->cctemp = le16_to_cpu(id->cctemp);
> + ctrl->cqt = le16_to_cpu(id->cqt);
>
> atomic_set(&ctrl->abort_limit, id->acl + 1);
> ctrl->vwc = id->vwc;
I cannot discern an ordering to the attributes set here. Any
particular reason, you placed cqt away from the others you added?
> diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
> index 29430949ce2f..ae36249ad61e 100644
> --- a/drivers/nvme/host/sysfs.c
> +++ b/drivers/nvme/host/sysfs.c
> @@ -388,6 +388,27 @@ nvme_show_int_function(queue_count);
> nvme_show_int_function(sqsize);
> nvme_show_int_function(kato);
>
> +static ssize_t nvme_sysfs_uniquifier_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
> +
> + return sysfs_emit(buf, "%02x\n", ctrl->ciu);
> +}
> +static DEVICE_ATTR(uniquifier, S_IRUGO, nvme_sysfs_uniquifier_show, NULL);
> +
> +static ssize_t nvme_sysfs_random_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
> +
> + return sysfs_emit(buf, "%016llx\n", ctrl->cirn);
> +}
> +static DEVICE_ATTR(random, S_IRUGO, nvme_sysfs_random_show, NULL);
> +
> +
> static ssize_t nvme_sysfs_delete(struct device *dev,
> struct device_attribute *attr, const char *buf,
> size_t count)
> @@ -734,6 +755,8 @@ static struct attribute *nvme_dev_attrs[] = {
> &dev_attr_numa_node.attr,
> &dev_attr_queue_count.attr,
> &dev_attr_sqsize.attr,
> + &dev_attr_uniquifier.attr,
> + &dev_attr_random.attr,
> &dev_attr_hostnqn.attr,
> &dev_attr_hostid.attr,
> &dev_attr_ctrl_loss_tmo.attr,
> --
> 2.51.2
>
These are the names used in the target code (uniquifer & random.
I'd rather have them match (identify structure will have spec's
abbreviations; ctrl & debug/sysfs for target & initiator either be
ciu/cirn or uniquifer/random.
But this is small stuff.
Reviewed-by: Randy Jennings <randyj@purestorage.com>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state
2025-11-26 2:11 ` [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state Mohamed Khalfella
@ 2025-12-18 23:18 ` Randy Jennings
2025-12-19 1:39 ` Randy Jennings
2025-12-25 13:29 ` Sagi Grimberg
1 sibling, 1 reply; 68+ messages in thread
From: Randy Jennings @ 2025-12-18 23:18 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> Add NVME_CTRL_RECOVERING as a new controller state to be used when
> impacted controller is being recovered. A LIVE controller enters
> RECOVERING state when an IO error is encountered. While recovering
> inflight IOs will not be canceled if they timeout. These IOs will be
> canceled after recovery finishes. Also, while recovering a controller
> can not be reset or deleted. This is intentional because reset or delete
> will result in canceling inflight IOs. When recovery finishes, the
> impacted controller transitions from RECOVERING state to RESETTING state.
> Reset codepath takes care of queues teardown and inflight requests
> cancellation.
>
> Note, there is no transition from RECOVERING to RESETTING added to
> nvme_change_ctrl_state(). The reason is that user should not be allowed
> to reset or delete a controller that is being recovered.
>
> Add NVME_CTRL_RECOVERED controller flag. This flag is set on a controller
> about to schedule delayed work for time based recovery.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery
2025-11-26 2:11 ` [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery Mohamed Khalfella
@ 2025-12-19 1:21 ` Randy Jennings
2025-12-27 10:14 ` Sagi Grimberg
1 sibling, 0 replies; 68+ messages in thread
From: Randy Jennings @ 2025-12-19 1:21 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> A host that has more than one path connecting to an nvme subsystem
> typically has an nvme controller associated with every path. This is
> mostly applicable to nvmeof. If one path goes down, inflight IOs on that
> path should not be retried immediately on another path because this
> could lead to data corruption as described in TP4129. TP8028 defines
> cross-controller reset mechanism that can be used by host to terminate
> IOs on the failed path using one of the remaining healthy paths. Only
> after IOs are terminated, or long enough time passes as defined by
> TP4129, inflight IOs should be retried on another path. Implement core
> cross-controller reset shared logic to be used by the transports.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> +unsigned long nvme_recover_ctrl(struct nvme_ctrl *ictrl)
> + now = jiffies;
> + deadline = now + msecs_to_jiffies(timeout);
> + while (time_before(now, deadline)) {
> + now = jiffies;
> + }
I would use a for-loop to keep the advancing statement close to the condition.
Reviewed-by: Randy Jennings <randyj@purestorage.com>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 09/14] nvme: Implement cross-controller reset completion
2025-11-26 2:11 ` [RFC PATCH 09/14] nvme: Implement cross-controller reset completion Mohamed Khalfella
@ 2025-12-19 1:31 ` Randy Jennings
2025-12-27 10:24 ` Sagi Grimberg
1 sibling, 0 replies; 68+ messages in thread
From: Randy Jennings @ 2025-12-19 1:31 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> An nvme source controller that issues CCR command expects to receive an
> NVME_AER_NOTICE_CCR_COMPLETED when pending CCR succeeds or fails. Add
> sctrl->ccr_work to read NVME_LOG_CCR logpage and wakeup any thread
> waiting on CCR completion.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state
2025-12-18 23:18 ` Randy Jennings
@ 2025-12-19 1:39 ` Randy Jennings
0 siblings, 0 replies; 68+ messages in thread
From: Randy Jennings @ 2025-12-19 1:39 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Thu, Dec 18, 2025 at 3:18 PM Randy Jennings <randyj@purestorage.com> wrote:
>
> On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
> <mkhalfella@purestorage.com> wrote:
> >
> > Reset codepath takes care of queues teardown and inflight requests
> > cancellation.
Note: Tearing down the connection (with the queues) after going through
CCR-based recovery or time-based recovery is late. CCR will trigger a
disconnect on the host side, and, because we stop traffic to the nvme
controller, KATO should kick in if CCR does not, so I accept your
argument that tearing down the connections first could be considered
an optimization that can get implemented later.
Sincerely,
Randy Jennings
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error
2025-11-26 2:11 ` [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error Mohamed Khalfella
@ 2025-12-19 2:06 ` Randy Jennings
2026-01-01 0:04 ` Mohamed Khalfella
2025-12-27 10:35 ` Sagi Grimberg
1 sibling, 1 reply; 68+ messages in thread
From: Randy Jennings @ 2025-12-19 2:06 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> An alive nvme controller that hits an error now will move to RECOVERING
> state instead of RESETTING state. In RECOVERING state ctrl->err_work
> will attempt to use cross-controller recovery to terminate inflight IOs
> on the controller. If CCR succeeds, then switch to RESETTING state and
> continue error recovery as usuall by tearing down controller and attempt
> reconnecting to target. If CCR fails, then the behavior of recovery
"usuall" -> "usual"
"attempt reconnecting" -> "attempting to reconnect"
it would read better with "the" added:
"tearing down the controller"
"reconnect to the target"
> depends on whether CQT is supported or not. If CQT is supported, switch
> to time-based recovery by holding inflight IOs until it is safe for them
> to be retried. If CQT is not supported proceed to retry requests
> immediately, as the code currently does.
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> +static int nvme_tcp_recover_ctrl(struct nvme_ctrl *ctrl)
> + dev_info(ctrl->device,
> + "CCR failed, switch to time-based recovery, timeout = %ums\n",
> + jiffies_to_msecs(rem));
> + set_bit(NVME_CTRL_RECOVERED, &ctrl->flags);
> + queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, rem);
> + return -EAGAIN;
I see how setting this bit before the delayed work executes works
to complete recovery, but it is kindof weird that the bit is called
RECOVERED. I do not have a better name. TIME_BASED_RECOVERY?
RECOVERY_WAIT?
> static void nvme_tcp_error_recovery_work(struct work_struct *work)
> {
> - struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> + struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
> struct nvme_tcp_ctrl, err_work);
> struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
>
> + if (nvme_ctrl_state(ctrl) == NVME_CTRL_RECOVERING) {
> + if (nvme_tcp_recover_ctrl(ctrl))
> + return;
> + }
> +
> if (nvme_tcp_key_revoke_needed(ctrl))
> nvme_auth_revoke_tls_key(ctrl);
> nvme_stop_keep_alive(ctrl);
The state of the controller should not be LIVE while waiting for
recovery, so I do not think we will succeed in sending keep alives,
but I think this should move to before (or inside of)
nvme_tcp_recover_ctrl().
Sincerely,
Randy Jennings
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 11/14] nvme-rdma: Use CCR to recover controller that hits an error
2025-11-26 2:11 ` [RFC PATCH 11/14] nvme-rdma: " Mohamed Khalfella
@ 2025-12-19 2:16 ` Randy Jennings
2025-12-27 10:36 ` Sagi Grimberg
1 sibling, 0 replies; 68+ messages in thread
From: Randy Jennings @ 2025-12-19 2:16 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> An alive nvme controller that hits an error will now move to RECOVERING
> state instead of RESETTING state. In RECOVERING state, ctrl->err_work
> will attempt to use cross-controller recovery to terminate inflight IOs
> on the controller. If CCR succeeds, then switch to RESETTING state and
> continue error recovery as usuall by tearing down the controller, and
> attempting reconnect to target. If CCR fails, the behavior of recovery
"usuall" -> "usual"
"attempt reconnecting" -> "attempting to reconnect"
it would read better with "the" added:
"reconnect to the target"
> depends on whether CQT is supported or not. If CQT is supported, switch
> to time-based recovery by holding inflight IOs until it is safe for them
> to be retried. If CQT is not supported proceed to retry requests
> immediately, as the code currently does.
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 190a4cfa8a5e..4a8bb2614468 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> +static int nvme_rdma_recover_ctrl(struct nvme_ctrl *ctrl)
> + queue_delayed_work(nvme_reset_wq, &to_rdma_ctrl(ctrl)->err_work, rem);
nvme_rdma_recover_ctrl is exactly the same as
nvme_tcp_recover_ctrl. Seems like a core.c function
nvme_recover_ctrl could take a delayed work queue,
unifying the code.
> static void nvme_rdma_error_recovery_work(struct work_struct *work)
> {
> - struct nvme_rdma_ctrl *ctrl = container_of(work,
> + struct nvme_rdma_ctrl *ctrl = container_of(to_delayed_work(work),
> struct nvme_rdma_ctrl, err_work);
>
> + if (nvme_ctrl_state(&ctrl->ctrl) == NVME_CTRL_RECOVERING) {
> + if (nvme_rdma_recover_ctrl(&ctrl->ctrl))
> + return;
> + }
> +
> nvme_stop_keep_alive(&ctrl->ctrl);
The state of the controller should not be LIVE while waiting for
recovery, so I do not think we will succeed in sending keep alives,
but I think this should move to before (or inside of)
nvme_tcp_recover_ctrl().
Sincerely,
Randy Jennings
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 12/14] nvme-fc: Decouple error recovery from controller reset
2025-11-26 2:11 ` [RFC PATCH 12/14] nvme-fc: Decouple error recovery from controller reset Mohamed Khalfella
@ 2025-12-19 2:59 ` Randy Jennings
0 siblings, 0 replies; 68+ messages in thread
From: Randy Jennings @ 2025-12-19 2:59 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> + char *errmsg);
>
> /* *********************** FC-NVME Port Management ************************ */
> @@ -983,7 +985,7 @@ fc_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
> static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
>
> -static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg);
> +static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl);
does it make sense for nvme_fc_error_recovery() to move to be with
nvme_fc_start_ioerr_recovery()?
> @@ -2497,39 +2508,6 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
> nvme_unquiesce_admin_queue(&ctrl->ctrl);
> }
>
> -static void
> -nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
> @@ -3347,6 +3315,26 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
> }
> }
>
> +static void
> +nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
I'm curious about the motivation for moving the function.
> +{
> + nvme_stop_keep_alive(&ctrl->ctrl);
This is a new addition. I think it is a good one; FC can use
keep alives, and nvme_stop_keep_alive() is called in one
other place, but I do not think the Linux nvme-fc driver
uses keep alives.
Reviewed-by: Randy Jennings <randyj@purestorage.com>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 13/14] nvme-fc: Use CCR to recover controller that hits an error
2025-11-26 2:12 ` [RFC PATCH 13/14] nvme-fc: Use CCR to recover controller that hits an error Mohamed Khalfella
@ 2025-12-20 1:21 ` Randy Jennings
0 siblings, 0 replies; 68+ messages in thread
From: Randy Jennings @ 2025-12-20 1:21 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> An alive nvme controller that hits an error will now move to RECOVERING
> state instead of RESETTING state. In RECOVERING state, ctrl->err_work
> will attempt to use cross-controller recovery to terminate inflight IOs
> on the controller. If CCR succeeds, then switch to RESETTING state and
> continue error recovery as usuall by tearing down the controller, and
> attempting reconnect to target. If CCR fails, the behavior of recovery
"usuall" -> "usual"
"attempt reconnecting" -> "attempting to reconnect"
it would read better with "the" added:
"reconnect to the target"
> depends on whether CQT is supported or not. If CQT is supported, switch
> to time-based recovery by holding inflight IOs until it is safe for them
> to be retried. If CQT is not supported proceed to retry requests
> immediately, as the code currently does.
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> @@ -1862,11 +1862,48 @@ __nvme_fc_fcpop_chk_teardowns(struct nvme_fc_ctrl *ctrl,
> +static int nvme_fc_recover_ctrl(struct nvme_ctrl *ctrl)
> + queue_delayed_work(nvme_reset_wq, &to_fc_ctrl(ctrl)->ioerr_work, rem);
Just like nvme_rdma_recover_ctrl,
nvme_fc_recover_ctrl is exactly the same as
nvme_tcp_recover_ctrl. Seems like a core.c function
nvme_recover_ctrl could take a delayed work queue,
unifying the code.
> nvme_fc_ctrl_ioerr_work(struct work_struct *work)
> {
> + if (nvme_ctrl_state(&ctrl->ctrl) == NVME_CTRL_RECOVERING) {
> + if (nvme_fc_recover_ctrl(&ctrl->ctrl))
> + return;
> + }
>
> nvme_fc_error_recovery(ctrl);
Inside of nvme_fc_error_recovery(), we call nvme_stop_keep_alive().
The state of the controller should not be LIVE while waiting for
recovery, so I do not think we will succeed in sending keep alives,
but I think this should move to before (or inside of)
nvme_fc_recover_ctrl(). You have replaced all the calls to
nvme_fc_error_recovery() with nvme_fc_start_ioerr_recovery(),
so that might be okay.
Sincerely,
Randy Jennings
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 14/14] nvme-fc: Hold inflight requests while in RECOVERING state
2025-11-26 2:12 ` [RFC PATCH 14/14] nvme-fc: Hold inflight requests while in RECOVERING state Mohamed Khalfella
@ 2025-12-20 1:44 ` Randy Jennings
0 siblings, 0 replies; 68+ messages in thread
From: Randy Jennings @ 2025-12-20 1:44 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> @@ -2083,17 +2092,34 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
> + /*
> + * We can not access op after the request is completed because it can
> + * be reused immediately. At the same time we want to wakeup the thread
> + * waiting for ongoing IOs _after_ requests are completed. This is
> + * necessary because that thread will start canceling inflight IOs
> + * and we want to avoid request completion racing with cancellation.
> + */
> + op_term = __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
> +
> + /* Error recovery completes inflight reqeusts when it is safe */
"reqeusts" -> "requests"
> + if (nvme_ctrl_state(&ctrl->ctrl) == NVME_CTRL_RECOVERING)
> + goto check_op_term;
> +
> if (!nvme_try_complete_req(rq, status, result))
> nvme_fc_complete_rq(rq);
>
> +check_op_term:
> + if (op_term)
> + __nvme_fc_fcpop_count_one_down(ctrl);
I think it is easier to grok:
+ op_term = __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
+
+ /* Error recovery completes inflight reqeusts when it is safe */
+ if (nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_RECOVERING &&
+ !nvme_try_complete_req(rq, status, result))
nvme_fc_complete_rq(rq);
+
+ if (op_term)
+ __nvme_fc_fcpop_count_one_down(ctrl);
Reviewed-by: Randy Jennings <randyj@purestorage.com>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 03/14] nvmet: Implement CCR nvme command
2025-11-26 2:11 ` [RFC PATCH 03/14] nvmet: Implement CCR nvme command Mohamed Khalfella
2025-12-16 3:01 ` Randy Jennings
@ 2025-12-25 13:14 ` Sagi Grimberg
2025-12-25 17:33 ` Mohamed Khalfella
1 sibling, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2025-12-25 13:14 UTC (permalink / raw)
To: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
Jens Axboe, Keith Busch
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On 26/11/2025 4:11, Mohamed Khalfella wrote:
> Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
> Reset) command is an nvme command the is issued to source controller by
> initiator to reset impacted controller. Implement CCR command for linux
> nvme target.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/target/admin-cmd.c | 79 +++++++++++++++++++++++++++++++++
> drivers/nvme/target/core.c | 69 ++++++++++++++++++++++++++++
> drivers/nvme/target/nvmet.h | 13 ++++++
> include/linux/nvme.h | 23 ++++++++++
> 4 files changed, 184 insertions(+)
>
> diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
> index aaceb697e4d2..a55ca010d34f 100644
> --- a/drivers/nvme/target/admin-cmd.c
> +++ b/drivers/nvme/target/admin-cmd.c
> @@ -376,7 +376,9 @@ static void nvmet_get_cmd_effects_admin(struct nvmet_ctrl *ctrl,
> log->acs[nvme_admin_get_features] =
> log->acs[nvme_admin_async_event] =
> log->acs[nvme_admin_keep_alive] =
> + log->acs[nvme_admin_cross_ctrl_reset] =
> cpu_to_le32(NVME_CMD_EFFECTS_CSUPP);
> +
> }
>
> static void nvmet_get_cmd_effects_nvm(struct nvme_effects_log *log)
> @@ -1615,6 +1617,80 @@ void nvmet_execute_keep_alive(struct nvmet_req *req)
> nvmet_req_complete(req, status);
> }
>
> +void nvmet_execute_cross_ctrl_reset(struct nvmet_req *req)
> +{
> + struct nvmet_ctrl *ictrl, *ctrl = req->sq->ctrl;
> + struct nvme_command *cmd = req->cmd;
> + struct nvmet_ccr *ccr, *new_ccr;
> + int ccr_active, ccr_total;
> + u16 cntlid, status = 0;
> +
> + cntlid = le16_to_cpu(cmd->ccr.icid);
> + if (ctrl->cntlid == cntlid) {
> + req->error_loc =
> + offsetof(struct nvme_cross_ctrl_reset_cmd, icid);
> + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR;
> + goto out;
> + }
> +
> + ictrl = nvmet_ctrl_find_get_ccr(ctrl->subsys, ctrl->hostnqn,
What does the 'i' stand for?
> + cmd->ccr.ciu, cntlid,
> + le64_to_cpu(cmd->ccr.cirn));
> + if (!ictrl) {
> + /* Immediate Reset Successful */
> + nvmet_set_result(req, 1);
> + status = NVME_SC_SUCCESS;
> + goto out;
> + }
> +
> + new_ccr = kmalloc(sizeof(*ccr), GFP_KERNEL);
> + if (!new_ccr) {
> + status = NVME_SC_INTERNAL;
> + goto out_put_ctrl;
> + }
Allocating this later when you actually use it would probably simplify
error path.
> +
> + ccr_total = ccr_active = 0;
> + mutex_lock(&ctrl->lock);
> + list_for_each_entry(ccr, &ctrl->ccrs, entry) {
> + if (ccr->ctrl == ictrl) {
> + status = NVME_SC_CCR_IN_PROGRESS | NVME_STATUS_DNR;
> + goto out_unlock;
> + }
> +
> + ccr_total++;
> + if (ccr->ctrl)
> + ccr_active++;
> + }
> +
> + if (ccr_active >= NVMF_CCR_LIMIT) {
> + status = NVME_SC_CCR_LIMIT_EXCEEDED;
> + goto out_unlock;
> + }
> + if (ccr_total >= NVMF_CCR_PER_PAGE) {
> + status = NVME_SC_CCR_LOGPAGE_FULL;
> + goto out_unlock;
> + }
> +
> + new_ccr->ciu = cmd->ccr.ciu;
> + new_ccr->icid = cntlid;
> + new_ccr->ctrl = ictrl;
> + list_add_tail(&new_ccr->entry, &ctrl->ccrs);
> + mutex_unlock(&ctrl->lock);
> +
> + nvmet_ctrl_fatal_error(ictrl);
Don't you need to wait for it to complete?
e.g. flush_work(&ictrl->fatal_err_work);
Or is that done async? will need to look downstream...
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 05/14] nvmet: Send an AEN on CCR completion
2025-11-26 2:11 ` [RFC PATCH 05/14] nvmet: Send an AEN on CCR completion Mohamed Khalfella
2025-12-16 3:31 ` Randy Jennings
@ 2025-12-25 13:23 ` Sagi Grimberg
2025-12-25 18:13 ` Mohamed Khalfella
1 sibling, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2025-12-25 13:23 UTC (permalink / raw)
To: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
Jens Axboe, Keith Busch
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On 26/11/2025 4:11, Mohamed Khalfella wrote:
> Send an AEN to initiator when impacted controller exists. The
> notification points to CCR log page that initiator can read to check
> which CCR operation completed.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/target/core.c | 27 +++++++++++++++++++++++----
> drivers/nvme/target/nvmet.h | 3 ++-
> include/linux/nvme.h | 3 +++
> 3 files changed, 28 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> index 7dbe9255ff42..60173833c3eb 100644
> --- a/drivers/nvme/target/core.c
> +++ b/drivers/nvme/target/core.c
> @@ -202,7 +202,7 @@ static void nvmet_async_event_work(struct work_struct *work)
> nvmet_async_events_process(ctrl);
> }
>
> -void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> +static void nvmet_add_async_event_locked(struct nvmet_ctrl *ctrl, u8 event_type,
> u8 event_info, u8 log_page)
> {
> struct nvmet_async_event *aen;
> @@ -215,12 +215,17 @@ void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> aen->event_info = event_info;
> aen->log_page = log_page;
>
> - mutex_lock(&ctrl->lock);
> list_add_tail(&aen->entry, &ctrl->async_events);
> - mutex_unlock(&ctrl->lock);
>
> queue_work(nvmet_wq, &ctrl->async_event_work);
> }
> +void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> + u8 event_info, u8 log_page)
> +{
> + mutex_lock(&ctrl->lock);
> + nvmet_add_async_event_locked(ctrl, event_type, event_info, log_page);
> + mutex_unlock(&ctrl->lock);
> +}
>
> static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
> {
> @@ -1788,6 +1793,18 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
> }
> EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
>
> +static void nvmet_ctrl_notify_ccr(struct nvmet_ctrl *ctrl)
> +{
> + lockdep_assert_held(&ctrl->lock);
> +
> + if (nvmet_aen_bit_disabled(ctrl, NVME_AEN_BIT_CCR_COMPLETE))
> + return;
> +
> + nvmet_add_async_event_locked(ctrl, NVME_AER_NOTICE,
> + NVME_AER_NOTICE_CCR_COMPLETED,
> + NVME_LOG_CCR);
> +}
> +
> static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
> {
> struct nvmet_subsys *subsys = ctrl->subsys;
> @@ -1801,8 +1818,10 @@ static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
> list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
> mutex_lock(&sctrl->lock);
> list_for_each_entry(ccr, &sctrl->ccrs, entry) {
> - if (ccr->ctrl == ctrl)
> + if (ccr->ctrl == ctrl) {
> + nvmet_ctrl_notify_ccr(sctrl);
> ccr->ctrl = NULL;
> + }
Is this double loop necessary? Would you have more than one controller
cross resetting the same
controller? Won't it be better to install a callback+opaque that the
controller removal will call?
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state
2025-11-26 2:11 ` [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state Mohamed Khalfella
2025-12-18 23:18 ` Randy Jennings
@ 2025-12-25 13:29 ` Sagi Grimberg
2025-12-25 17:17 ` Mohamed Khalfella
1 sibling, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2025-12-25 13:29 UTC (permalink / raw)
To: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
Jens Axboe, Keith Busch
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On 26/11/2025 4:11, Mohamed Khalfella wrote:
> Add NVME_CTRL_RECOVERING as a new controller state to be used when
> impacted controller is being recovered. A LIVE controller enters
> RECOVERING state when an IO error is encountered. While recovering
> inflight IOs will not be canceled if they timeout. These IOs will be
> canceled after recovery finishes. Also, while recovering a controller
> can not be reset or deleted. This is intentional because reset or delete
> will result in canceling inflight IOs. When recovery finishes, the
> impacted controller transitions from RECOVERING state to RESETTING state.
> Reset codepath takes care of queues teardown and inflight requests
> cancellation.
Is RECOVERING really capturing the nature of this state? Maybe RESETTLING?
or QUIESCING?
>
> Note, there is no transition from RECOVERING to RESETTING added to
> nvme_change_ctrl_state(). The reason is that user should not be allowed
> to reset or delete a controller that is being recovered.
>
> Add NVME_CTRL_RECOVERED controller flag. This flag is set on a controller
> about to schedule delayed work for time based recovery.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/host/core.c | 10 ++++++++++
> drivers/nvme/host/nvme.h | 2 ++
> drivers/nvme/host/sysfs.c | 1 +
> 3 files changed, 13 insertions(+)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index aa007a7b9606..f5b84bc327d3 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -574,6 +574,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
> break;
> }
> break;
> + case NVME_CTRL_RECOVERING:
> + switch (old_state) {
> + case NVME_CTRL_LIVE:
> + changed = true;
> + fallthrough;
> + default:
> + break;
> + }
> + break;
That is a strange transition...
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state
2025-12-25 13:29 ` Sagi Grimberg
@ 2025-12-25 17:17 ` Mohamed Khalfella
2025-12-27 9:52 ` Sagi Grimberg
2025-12-27 9:55 ` Sagi Grimberg
0 siblings, 2 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2025-12-25 17:17 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Thu 2025-12-25 15:29:52 +0200, Sagi Grimberg wrote:
>
>
> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> > Add NVME_CTRL_RECOVERING as a new controller state to be used when
> > impacted controller is being recovered. A LIVE controller enters
> > RECOVERING state when an IO error is encountered. While recovering
> > inflight IOs will not be canceled if they timeout. These IOs will be
> > canceled after recovery finishes. Also, while recovering a controller
> > can not be reset or deleted. This is intentional because reset or delete
> > will result in canceling inflight IOs. When recovery finishes, the
> > impacted controller transitions from RECOVERING state to RESETTING state.
> > Reset codepath takes care of queues teardown and inflight requests
> > cancellation.
>
> Is RECOVERING really capturing the nature of this state? Maybe RESETTLING?
> or QUIESCING?
Naming is hard. QUIESCING sounds better, I will renaming it to
QUIESCING.
>
> >
> > Note, there is no transition from RECOVERING to RESETTING added to
> > nvme_change_ctrl_state(). The reason is that user should not be allowed
> > to reset or delete a controller that is being recovered.
> >
> > Add NVME_CTRL_RECOVERED controller flag. This flag is set on a controller
> > about to schedule delayed work for time based recovery.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> > drivers/nvme/host/core.c | 10 ++++++++++
> > drivers/nvme/host/nvme.h | 2 ++
> > drivers/nvme/host/sysfs.c | 1 +
> > 3 files changed, 13 insertions(+)
> >
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index aa007a7b9606..f5b84bc327d3 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -574,6 +574,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
> > break;
> > }
> > break;
> > + case NVME_CTRL_RECOVERING:
> > + switch (old_state) {
> > + case NVME_CTRL_LIVE:
> > + changed = true;
> > + fallthrough;
> > + default:
> > + break;
> > + }
> > + break;
>
> That is a strange transition...
Why is it strange?
We transition to RECOVERING state only if controller is LIVE. This is
when we expect to have inflight user IOs to be quiesced by CCR. We do
not care about inflight requests in other states.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 03/14] nvmet: Implement CCR nvme command
2025-12-25 13:14 ` Sagi Grimberg
@ 2025-12-25 17:33 ` Mohamed Khalfella
2025-12-27 9:39 ` Sagi Grimberg
0 siblings, 1 reply; 68+ messages in thread
From: Mohamed Khalfella @ 2025-12-25 17:33 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Thu 2025-12-25 15:14:31 +0200, Sagi Grimberg wrote:
>
>
> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> > Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
> > Reset) command is an nvme command the is issued to source controller by
> > initiator to reset impacted controller. Implement CCR command for linux
> > nvme target.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> > drivers/nvme/target/admin-cmd.c | 79 +++++++++++++++++++++++++++++++++
> > drivers/nvme/target/core.c | 69 ++++++++++++++++++++++++++++
> > drivers/nvme/target/nvmet.h | 13 ++++++
> > include/linux/nvme.h | 23 ++++++++++
> > 4 files changed, 184 insertions(+)
> >
> > diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
> > index aaceb697e4d2..a55ca010d34f 100644
> > --- a/drivers/nvme/target/admin-cmd.c
> > +++ b/drivers/nvme/target/admin-cmd.c
> > @@ -376,7 +376,9 @@ static void nvmet_get_cmd_effects_admin(struct nvmet_ctrl *ctrl,
> > log->acs[nvme_admin_get_features] =
> > log->acs[nvme_admin_async_event] =
> > log->acs[nvme_admin_keep_alive] =
> > + log->acs[nvme_admin_cross_ctrl_reset] =
> > cpu_to_le32(NVME_CMD_EFFECTS_CSUPP);
> > +
> > }
> >
> > static void nvmet_get_cmd_effects_nvm(struct nvme_effects_log *log)
> > @@ -1615,6 +1617,80 @@ void nvmet_execute_keep_alive(struct nvmet_req *req)
> > nvmet_req_complete(req, status);
> > }
> >
> > +void nvmet_execute_cross_ctrl_reset(struct nvmet_req *req)
> > +{
> > + struct nvmet_ctrl *ictrl, *ctrl = req->sq->ctrl;
> > + struct nvme_command *cmd = req->cmd;
> > + struct nvmet_ccr *ccr, *new_ccr;
> > + int ccr_active, ccr_total;
> > + u16 cntlid, status = 0;
> > +
> > + cntlid = le16_to_cpu(cmd->ccr.icid);
> > + if (ctrl->cntlid == cntlid) {
> > + req->error_loc =
> > + offsetof(struct nvme_cross_ctrl_reset_cmd, icid);
> > + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR;
> > + goto out;
> > + }
> > +
> > + ictrl = nvmet_ctrl_find_get_ccr(ctrl->subsys, ctrl->hostnqn,
>
> What does the 'i' stand for?
'i' stands for impacted controller. Also, if you see sctrl the 's'
stands for source controller. These terms are from TP8028.
>
> > + cmd->ccr.ciu, cntlid,
> > + le64_to_cpu(cmd->ccr.cirn));
> > + if (!ictrl) {
> > + /* Immediate Reset Successful */
> > + nvmet_set_result(req, 1);
> > + status = NVME_SC_SUCCESS;
> > + goto out;
> > + }
> > +
> > + new_ccr = kmalloc(sizeof(*ccr), GFP_KERNEL);
> > + if (!new_ccr) {
> > + status = NVME_SC_INTERNAL;
> > + goto out_put_ctrl;
> > + }
>
> Allocating this later when you actually use it would probably simplify
> error path.
Right, it will save us kfree(). Will do that.
>
> > +
> > + ccr_total = ccr_active = 0;
> > + mutex_lock(&ctrl->lock);
> > + list_for_each_entry(ccr, &ctrl->ccrs, entry) {
> > + if (ccr->ctrl == ictrl) {
> > + status = NVME_SC_CCR_IN_PROGRESS | NVME_STATUS_DNR;
> > + goto out_unlock;
> > + }
> > +
> > + ccr_total++;
> > + if (ccr->ctrl)
> > + ccr_active++;
> > + }
> > +
> > + if (ccr_active >= NVMF_CCR_LIMIT) {
> > + status = NVME_SC_CCR_LIMIT_EXCEEDED;
> > + goto out_unlock;
> > + }
> > + if (ccr_total >= NVMF_CCR_PER_PAGE) {
> > + status = NVME_SC_CCR_LOGPAGE_FULL;
> > + goto out_unlock;
> > + }
> > +
> > + new_ccr->ciu = cmd->ccr.ciu;
> > + new_ccr->icid = cntlid;
> > + new_ccr->ctrl = ictrl;
> > + list_add_tail(&new_ccr->entry, &ctrl->ccrs);
> > + mutex_unlock(&ctrl->lock);
> > +
> > + nvmet_ctrl_fatal_error(ictrl);
>
> Don't you need to wait for it to complete?
> e.g. flush_work(&ictrl->fatal_err_work);
>
> Or is that done async? will need to look downstream...
No, we do not need to wait for ictrl->fatal_err_work to complete. An AEN
will be sent when ictrl exits. It is okay if AEN is sent before CCR
request is completed. The initiator should expect this behavior and deal
with it.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 05/14] nvmet: Send an AEN on CCR completion
2025-12-25 13:23 ` Sagi Grimberg
@ 2025-12-25 18:13 ` Mohamed Khalfella
2025-12-27 9:48 ` Sagi Grimberg
0 siblings, 1 reply; 68+ messages in thread
From: Mohamed Khalfella @ 2025-12-25 18:13 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Thu 2025-12-25 15:23:51 +0200, Sagi Grimberg wrote:
>
>
> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> > Send an AEN to initiator when impacted controller exists. The
> > notification points to CCR log page that initiator can read to check
> > which CCR operation completed.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> > drivers/nvme/target/core.c | 27 +++++++++++++++++++++++----
> > drivers/nvme/target/nvmet.h | 3 ++-
> > include/linux/nvme.h | 3 +++
> > 3 files changed, 28 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> > index 7dbe9255ff42..60173833c3eb 100644
> > --- a/drivers/nvme/target/core.c
> > +++ b/drivers/nvme/target/core.c
> > @@ -202,7 +202,7 @@ static void nvmet_async_event_work(struct work_struct *work)
> > nvmet_async_events_process(ctrl);
> > }
> >
> > -void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> > +static void nvmet_add_async_event_locked(struct nvmet_ctrl *ctrl, u8 event_type,
> > u8 event_info, u8 log_page)
> > {
> > struct nvmet_async_event *aen;
> > @@ -215,12 +215,17 @@ void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> > aen->event_info = event_info;
> > aen->log_page = log_page;
> >
> > - mutex_lock(&ctrl->lock);
> > list_add_tail(&aen->entry, &ctrl->async_events);
> > - mutex_unlock(&ctrl->lock);
> >
> > queue_work(nvmet_wq, &ctrl->async_event_work);
> > }
> > +void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> > + u8 event_info, u8 log_page)
> > +{
> > + mutex_lock(&ctrl->lock);
> > + nvmet_add_async_event_locked(ctrl, event_type, event_info, log_page);
> > + mutex_unlock(&ctrl->lock);
> > +}
> >
> > static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
> > {
> > @@ -1788,6 +1793,18 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
> > }
> > EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
> >
> > +static void nvmet_ctrl_notify_ccr(struct nvmet_ctrl *ctrl)
> > +{
> > + lockdep_assert_held(&ctrl->lock);
> > +
> > + if (nvmet_aen_bit_disabled(ctrl, NVME_AEN_BIT_CCR_COMPLETE))
> > + return;
> > +
> > + nvmet_add_async_event_locked(ctrl, NVME_AER_NOTICE,
> > + NVME_AER_NOTICE_CCR_COMPLETED,
> > + NVME_LOG_CCR);
> > +}
> > +
> > static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
> > {
> > struct nvmet_subsys *subsys = ctrl->subsys;
> > @@ -1801,8 +1818,10 @@ static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
> > list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
> > mutex_lock(&sctrl->lock);
> > list_for_each_entry(ccr, &sctrl->ccrs, entry) {
> > - if (ccr->ctrl == ctrl)
> > + if (ccr->ctrl == ctrl) {
> > + nvmet_ctrl_notify_ccr(sctrl);
> > ccr->ctrl = NULL;
> > + }
>
> Is this double loop necessary? Would you have more than one controller
> cross resetting the same
As it is implemented now CCRs are linked to sctrl. This decision can be
revisited if found suboptimal. At some point I had CCRs linked to
ctrl->subsys but that led to lock ordering issues. Double loop is
necessary to find all CCRs in all controllers and mark them done.
Yes, it is possible to have more than one sctrl resetting the same
ictrl.
> controller? Won't it be better to install a callback+opaque that the
> controller removal will call?
Can you elaborate more on that? Better in what terms?
nvmet_ctrl_complete_pending_ccr() is called from nvmet_ctrl_free() when
we know that ctrl->ref is zero and no new CCRs will be added to this
controller because nvmet_ctrl_find_get_ccr() will not be able to get it.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 03/14] nvmet: Implement CCR nvme command
2025-12-25 17:33 ` Mohamed Khalfella
@ 2025-12-27 9:39 ` Sagi Grimberg
2025-12-31 21:35 ` Mohamed Khalfella
0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2025-12-27 9:39 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
>>> +void nvmet_execute_cross_ctrl_reset(struct nvmet_req *req)
>>> +{
>>> + struct nvmet_ctrl *ictrl, *ctrl = req->sq->ctrl;
>>> + struct nvme_command *cmd = req->cmd;
>>> + struct nvmet_ccr *ccr, *new_ccr;
>>> + int ccr_active, ccr_total;
>>> + u16 cntlid, status = 0;
>>> +
>>> + cntlid = le16_to_cpu(cmd->ccr.icid);
>>> + if (ctrl->cntlid == cntlid) {
>>> + req->error_loc =
>>> + offsetof(struct nvme_cross_ctrl_reset_cmd, icid);
>>> + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR;
>>> + goto out;
>>> + }
>>> +
>>> + ictrl = nvmet_ctrl_find_get_ccr(ctrl->subsys, ctrl->hostnqn,
>> What does the 'i' stand for?
> 'i' stands for impacted controller. Also, if you see sctrl the 's'
> stands for source controller. These terms are from TP8028.
Can you perhaps add a comment on this?
>>> + new_ccr->ciu = cmd->ccr.ciu;
>>> + new_ccr->icid = cntlid;
>>> + new_ccr->ctrl = ictrl;
>>> + list_add_tail(&new_ccr->entry, &ctrl->ccrs);
>>> + mutex_unlock(&ctrl->lock);
>>> +
>>> + nvmet_ctrl_fatal_error(ictrl);
>> Don't you need to wait for it to complete?
>> e.g. flush_work(&ictrl->fatal_err_work);
>>
>> Or is that done async? will need to look downstream...
> No, we do not need to wait for ictrl->fatal_err_work to complete. An AEN
> will be sent when ictrl exits. It is okay if AEN is sent before CCR
> request is completed. The initiator should expect this behavior and deal
> with it.
Yes, saw that in a later patch (didn't get to do a full review yet)
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 05/14] nvmet: Send an AEN on CCR completion
2025-12-25 18:13 ` Mohamed Khalfella
@ 2025-12-27 9:48 ` Sagi Grimberg
2025-12-31 22:00 ` Mohamed Khalfella
0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2025-12-27 9:48 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On 25/12/2025 20:13, Mohamed Khalfella wrote:
> On Thu 2025-12-25 15:23:51 +0200, Sagi Grimberg wrote:
>>
>> On 26/11/2025 4:11, Mohamed Khalfella wrote:
>>> Send an AEN to initiator when impacted controller exists. The
>>> notification points to CCR log page that initiator can read to check
>>> which CCR operation completed.
>>>
>>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
>>> ---
>>> drivers/nvme/target/core.c | 27 +++++++++++++++++++++++----
>>> drivers/nvme/target/nvmet.h | 3 ++-
>>> include/linux/nvme.h | 3 +++
>>> 3 files changed, 28 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
>>> index 7dbe9255ff42..60173833c3eb 100644
>>> --- a/drivers/nvme/target/core.c
>>> +++ b/drivers/nvme/target/core.c
>>> @@ -202,7 +202,7 @@ static void nvmet_async_event_work(struct work_struct *work)
>>> nvmet_async_events_process(ctrl);
>>> }
>>>
>>> -void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
>>> +static void nvmet_add_async_event_locked(struct nvmet_ctrl *ctrl, u8 event_type,
>>> u8 event_info, u8 log_page)
>>> {
>>> struct nvmet_async_event *aen;
>>> @@ -215,12 +215,17 @@ void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
>>> aen->event_info = event_info;
>>> aen->log_page = log_page;
>>>
>>> - mutex_lock(&ctrl->lock);
>>> list_add_tail(&aen->entry, &ctrl->async_events);
>>> - mutex_unlock(&ctrl->lock);
>>>
>>> queue_work(nvmet_wq, &ctrl->async_event_work);
>>> }
>>> +void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
>>> + u8 event_info, u8 log_page)
>>> +{
>>> + mutex_lock(&ctrl->lock);
>>> + nvmet_add_async_event_locked(ctrl, event_type, event_info, log_page);
>>> + mutex_unlock(&ctrl->lock);
>>> +}
>>>
>>> static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
>>> {
>>> @@ -1788,6 +1793,18 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
>>> }
>>> EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
>>>
>>> +static void nvmet_ctrl_notify_ccr(struct nvmet_ctrl *ctrl)
>>> +{
>>> + lockdep_assert_held(&ctrl->lock);
>>> +
>>> + if (nvmet_aen_bit_disabled(ctrl, NVME_AEN_BIT_CCR_COMPLETE))
>>> + return;
>>> +
>>> + nvmet_add_async_event_locked(ctrl, NVME_AER_NOTICE,
>>> + NVME_AER_NOTICE_CCR_COMPLETED,
>>> + NVME_LOG_CCR);
>>> +}
>>> +
>>> static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
>>> {
>>> struct nvmet_subsys *subsys = ctrl->subsys;
>>> @@ -1801,8 +1818,10 @@ static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
>>> list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
>>> mutex_lock(&sctrl->lock);
>>> list_for_each_entry(ccr, &sctrl->ccrs, entry) {
>>> - if (ccr->ctrl == ctrl)
>>> + if (ccr->ctrl == ctrl) {
>>> + nvmet_ctrl_notify_ccr(sctrl);
>>> ccr->ctrl = NULL;
>>> + }
>> Is this double loop necessary? Would you have more than one controller
>> cross resetting the same
> As it is implemented now CCRs are linked to sctrl. This decision can be
> revisited if found suboptimal. At some point I had CCRs linked to
> ctrl->subsys but that led to lock ordering issues. Double loop is
> necessary to find all CCRs in all controllers and mark them done.
> Yes, it is possible to have more than one sctrl resetting the same
> ictrl.
I'm more interested in simplifying.
>
>> controller? Won't it be better to install a callback+opaque that the
>> controller removal will call?
> Can you elaborate more on that? Better in what terms?
>
> nvmet_ctrl_complete_pending_ccr() is called from nvmet_ctrl_free() when
> we know that ctrl->ref is zero and no new CCRs will be added to this
> controller because nvmet_ctrl_find_get_ccr() will not be able to get it.
In nvmet, the controller is serving a single host. Hence I am not sure I
understand how multiple source controllers will try to reset the impacted
controller. So, if there is a 1-1 relationship between source and impacted
controller, I'd perhaps suggest to simplify and install on the impacted
controller
callback+opaque (e.g. void *data) instead of having it iterate and then
actually send
the AEN from the impacted controller.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state
2025-12-25 17:17 ` Mohamed Khalfella
@ 2025-12-27 9:52 ` Sagi Grimberg
2025-12-31 22:45 ` Mohamed Khalfella
2025-12-27 9:55 ` Sagi Grimberg
1 sibling, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2025-12-27 9:52 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On 25/12/2025 19:17, Mohamed Khalfella wrote:
> On Thu 2025-12-25 15:29:52 +0200, Sagi Grimberg wrote:
>>
>> On 26/11/2025 4:11, Mohamed Khalfella wrote:
>>> Add NVME_CTRL_RECOVERING as a new controller state to be used when
>>> impacted controller is being recovered. A LIVE controller enters
>>> RECOVERING state when an IO error is encountered. While recovering
>>> inflight IOs will not be canceled if they timeout. These IOs will be
>>> canceled after recovery finishes. Also, while recovering a controller
>>> can not be reset or deleted. This is intentional because reset or delete
>>> will result in canceling inflight IOs. When recovery finishes, the
>>> impacted controller transitions from RECOVERING state to RESETTING state.
>>> Reset codepath takes care of queues teardown and inflight requests
>>> cancellation.
>> Is RECOVERING really capturing the nature of this state? Maybe RESETTLING?
>> or QUIESCING?
> Naming is hard. QUIESCING sounds better, I will renaming it to
> QUIESCING.
>
>>> Note, there is no transition from RECOVERING to RESETTING added to
>>> nvme_change_ctrl_state(). The reason is that user should not be allowed
>>> to reset or delete a controller that is being recovered.
>>>
>>> Add NVME_CTRL_RECOVERED controller flag. This flag is set on a controller
>>> about to schedule delayed work for time based recovery.
>>>
>>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
>>> ---
>>> drivers/nvme/host/core.c | 10 ++++++++++
>>> drivers/nvme/host/nvme.h | 2 ++
>>> drivers/nvme/host/sysfs.c | 1 +
>>> 3 files changed, 13 insertions(+)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index aa007a7b9606..f5b84bc327d3 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -574,6 +574,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>>> break;
>>> }
>>> break;
>>> + case NVME_CTRL_RECOVERING:
>>> + switch (old_state) {
>>> + case NVME_CTRL_LIVE:
>>> + changed = true;
>>> + fallthrough;
>>> + default:
>>> + break;
>>> + }
>>> + break;
>> That is a strange transition...
> Why is it strange?
>
> We transition to RECOVERING state only if controller is LIVE. This is
> when we expect to have inflight user IOs to be quiesced by CCR. We do
> not care about inflight requests in other states.
Sorry, got confused myself - I read it as the other way around...
I am missing RECOVERING -> RESETTING transition in this patch.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state
2025-12-25 17:17 ` Mohamed Khalfella
2025-12-27 9:52 ` Sagi Grimberg
@ 2025-12-27 9:55 ` Sagi Grimberg
2025-12-31 22:36 ` Mohamed Khalfella
1 sibling, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2025-12-27 9:55 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On 25/12/2025 19:17, Mohamed Khalfella wrote:
> On Thu 2025-12-25 15:29:52 +0200, Sagi Grimberg wrote:
>>
>> On 26/11/2025 4:11, Mohamed Khalfella wrote:
>>> Add NVME_CTRL_RECOVERING as a new controller state to be used when
>>> impacted controller is being recovered. A LIVE controller enters
>>> RECOVERING state when an IO error is encountered. While recovering
>>> inflight IOs will not be canceled if they timeout. These IOs will be
>>> canceled after recovery finishes. Also, while recovering a controller
>>> can not be reset or deleted. This is intentional because reset or delete
>>> will result in canceling inflight IOs. When recovery finishes, the
>>> impacted controller transitions from RECOVERING state to RESETTING state.
>>> Reset codepath takes care of queues teardown and inflight requests
>>> cancellation.
>> Is RECOVERING really capturing the nature of this state? Maybe RESETTLING?
>> or QUIESCING?
> Naming is hard. QUIESCING sounds better, I will renaming it to
> QUIESCING.
I actually think that FENCING is probably best to describe what the
state is used for...
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery
2025-11-26 2:11 ` [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery Mohamed Khalfella
2025-12-19 1:21 ` Randy Jennings
@ 2025-12-27 10:14 ` Sagi Grimberg
2025-12-31 0:04 ` Randy Jennings
2025-12-31 23:43 ` Mohamed Khalfella
1 sibling, 2 replies; 68+ messages in thread
From: Sagi Grimberg @ 2025-12-27 10:14 UTC (permalink / raw)
To: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
Jens Axboe, Keith Busch
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On 26/11/2025 4:11, Mohamed Khalfella wrote:
> A host that has more than one path connecting to an nvme subsystem
> typically has an nvme controller associated with every path. This is
> mostly applicable to nvmeof. If one path goes down, inflight IOs on that
> path should not be retried immediately on another path because this
> could lead to data corruption as described in TP4129. TP8028 defines
> cross-controller reset mechanism that can be used by host to terminate
> IOs on the failed path using one of the remaining healthy paths. Only
> after IOs are terminated, or long enough time passes as defined by
> TP4129, inflight IOs should be retried on another path. Implement core
> cross-controller reset shared logic to be used by the transports.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/host/constants.c | 1 +
> drivers/nvme/host/core.c | 133 ++++++++++++++++++++++++++++++++++
> drivers/nvme/host/nvme.h | 10 +++
> 3 files changed, 144 insertions(+)
>
> diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
> index dc90df9e13a2..f679efd5110e 100644
> --- a/drivers/nvme/host/constants.c
> +++ b/drivers/nvme/host/constants.c
> @@ -46,6 +46,7 @@ static const char * const nvme_admin_ops[] = {
> [nvme_admin_virtual_mgmt] = "Virtual Management",
> [nvme_admin_nvme_mi_send] = "NVMe Send MI",
> [nvme_admin_nvme_mi_recv] = "NVMe Receive MI",
> + [nvme_admin_cross_ctrl_reset] = "Cross Controller Reset",
> [nvme_admin_dbbuf] = "Doorbell Buffer Config",
> [nvme_admin_format_nvm] = "Format NVM",
> [nvme_admin_security_send] = "Security Send",
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f5b84bc327d3..f38b70ca9cee 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -554,6 +554,138 @@ void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl)
> }
> EXPORT_SYMBOL_GPL(nvme_cancel_admin_tagset);
>
> +static struct nvme_ctrl *nvme_find_ccr_ctrl(struct nvme_ctrl *ictrl,
> + u32 min_cntlid)
> +{
> + struct nvme_subsystem *subsys = ictrl->subsys;
> + struct nvme_ctrl *sctrl;
> + unsigned long flags;
> +
> + mutex_lock(&nvme_subsystems_lock);
This looks like the wrong lock to take here?
> + list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
> + if (sctrl->cntlid < min_cntlid)
> + continue;
The use of min_cntlid is not clear to me.
> +
> + if (atomic_dec_if_positive(&sctrl->ccr_limit) < 0)
> + continue;
> +
> + spin_lock_irqsave(&sctrl->lock, flags);
> + if (sctrl->state != NVME_CTRL_LIVE) {
> + spin_unlock_irqrestore(&sctrl->lock, flags);
> + atomic_inc(&sctrl->ccr_limit);
> + continue;
> + }
> +
> + /*
> + * We got a good candidate source controller that is locked and
> + * LIVE. However, no guarantee sctrl will not be deleted after
> + * sctrl->lock is released. Get a ref of both sctrl and admin_q
> + * so they do not disappear until we are done with them.
> + */
> + WARN_ON_ONCE(!blk_get_queue(sctrl->admin_q));
> + nvme_get_ctrl(sctrl);
> + spin_unlock_irqrestore(&sctrl->lock, flags);
> + goto found;
> + }
> + sctrl = NULL;
> +found:
> + mutex_unlock(&nvme_subsystems_lock);
> + return sctrl;
> +}
> +
> +static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
> +{
> + unsigned long flags, tmo, remain;
> + struct nvme_ccr_entry ccr = { };
> + union nvme_result res = { 0 };
> + struct nvme_command c = { };
> + u32 result;
> + int ret = 0;
> +
> + init_completion(&ccr.complete);
> + ccr.ictrl = ictrl;
> +
> + spin_lock_irqsave(&sctrl->lock, flags);
> + list_add_tail(&ccr.list, &sctrl->ccrs);
> + spin_unlock_irqrestore(&sctrl->lock, flags);
> +
> + c.ccr.opcode = nvme_admin_cross_ctrl_reset;
> + c.ccr.ciu = ictrl->ciu;
> + c.ccr.icid = cpu_to_le16(ictrl->cntlid);
> + c.ccr.cirn = cpu_to_le64(ictrl->cirn);
> + ret = __nvme_submit_sync_cmd(sctrl->admin_q, &c, &res,
> + NULL, 0, NVME_QID_ANY, 0);
> + if (ret)
> + goto out;
> +
> + result = le32_to_cpu(res.u32);
> + if (result & 0x01) /* Immediate Reset */
> + goto out;
> +
> + tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
> + remain = wait_for_completion_timeout(&ccr.complete, tmo);
> + if (!remain)
I think remain is redundant here.
> + ret = -EAGAIN;
> +out:
> + spin_lock_irqsave(&sctrl->lock, flags);
> + list_del(&ccr.list);
> + spin_unlock_irqrestore(&sctrl->lock, flags);
> + return ccr.ccrs == 1 ? 0 : ret;
Why would you still return 0 and not EAGAIN? you expired on timeout but
still
return success if you have ccrs=1? btw you have ccrs in the ccr struct
and in the controller
as a list. Lets rename to distinguish the two.
> +}
> +
> +unsigned long nvme_recover_ctrl(struct nvme_ctrl *ictrl)
> +{
I'd call it nvme_fence_controller()
> + unsigned long deadline, now, timeout;
> + struct nvme_ctrl *sctrl;
> + u32 min_cntlid = 0;
> + int ret;
> +
> + timeout = nvme_recovery_timeout_ms(ictrl);
> + dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> +
> + now = jiffies;
> + deadline = now + msecs_to_jiffies(timeout);
> + while (time_before(now, deadline)) {
> + sctrl = nvme_find_ccr_ctrl(ictrl, min_cntlid);
> + if (!sctrl) {
> + /* CCR failed, switch to time-based recovery */
> + return deadline - now;
It is not clear what is the return code semantics of this function.
How about making it success/failure and have the caller choose what to do?
> + }
> +
> + ret = nvme_issue_wait_ccr(sctrl, ictrl);
> + atomic_inc(&sctrl->ccr_limit);
inc after you wait for the ccr? shouldn't this be before?
> +
> + if (!ret) {
> + dev_info(ictrl->device, "CCR succeeded using %s\n",
> + dev_name(sctrl->device));
> + blk_put_queue(sctrl->admin_q);
> + nvme_put_ctrl(sctrl);
> + return 0;
> + }
> +
> + /* Try another controller */
> + min_cntlid = sctrl->cntlid + 1;
OK, I see why min_cntlid is used. That is very non-intuitive.
I'm wandering if it will be simpler to take one-shot at ccr and
if it fails fallback to crt. I mean, if the sctrl is alive, and it was
unable
to reset the ictrl in time, how would another ctrl do a better job here?
> + blk_put_queue(sctrl->admin_q);
> + nvme_put_ctrl(sctrl);
> + now = jiffies;
> + }
> +
> + dev_info(ictrl->device, "CCR reached timeout, call it done\n");
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(nvme_recover_ctrl);
> +
> +void nvme_end_ctrl_recovery(struct nvme_ctrl *ctrl)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&ctrl->lock, flags);
> + WRITE_ONCE(ctrl->state, NVME_CTRL_RESETTING);
This needs to be a proper state transition.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 09/14] nvme: Implement cross-controller reset completion
2025-11-26 2:11 ` [RFC PATCH 09/14] nvme: Implement cross-controller reset completion Mohamed Khalfella
2025-12-19 1:31 ` Randy Jennings
@ 2025-12-27 10:24 ` Sagi Grimberg
2025-12-31 23:51 ` Mohamed Khalfella
1 sibling, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2025-12-27 10:24 UTC (permalink / raw)
To: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
Jens Axboe, Keith Busch
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
> + log = kmalloc(sizeof(*log), GFP_KERNEL);
> + if (!log)
> + return;
> +
> + ret = nvme_get_log(ctrl, 0, NVME_LOG_CCR, 0x01,
> + 0x00, log, sizeof(*log), 0);
> + if (ret)
> + goto out;
> +
> + spin_lock_irqsave(&ctrl->lock, flags);
> + for (i = 0; i < le16_to_cpu(log->ne); i++) {
> + entry = &log->entries[i];
> + if (entry->ccrs == 0) /* skip in progress entries */
> + continue;
What does ccrs stand for?
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error
2025-11-26 2:11 ` [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error Mohamed Khalfella
2025-12-19 2:06 ` Randy Jennings
@ 2025-12-27 10:35 ` Sagi Grimberg
2025-12-31 0:13 ` Randy Jennings
2026-01-01 0:27 ` Mohamed Khalfella
1 sibling, 2 replies; 68+ messages in thread
From: Sagi Grimberg @ 2025-12-27 10:35 UTC (permalink / raw)
To: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
Jens Axboe, Keith Busch
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On 26/11/2025 4:11, Mohamed Khalfella wrote:
> An alive nvme controller that hits an error now will move to RECOVERING
> state instead of RESETTING state. In RECOVERING state ctrl->err_work
> will attempt to use cross-controller recovery to terminate inflight IOs
> on the controller. If CCR succeeds, then switch to RESETTING state and
> continue error recovery as usuall by tearing down controller and attempt
> reconnecting to target. If CCR fails, then the behavior of recovery
> depends on whether CQT is supported or not. If CQT is supported, switch
> to time-based recovery by holding inflight IOs until it is safe for them
> to be retried. If CQT is not supported proceed to retry requests
> immediately, as the code currently does.
>
> To support implementing time-based recovery turn ctrl->err_work into
> delayed work. Update nvme_tcp_timeout() to not complete inflight IOs
> while controller in RECOVERING state.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
> drivers/nvme/host/tcp.c | 52 +++++++++++++++++++++++++++++++++++------
> 1 file changed, 45 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 9a96df1a511c..ec9a713490a9 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -193,7 +193,7 @@ struct nvme_tcp_ctrl {
> struct sockaddr_storage src_addr;
> struct nvme_ctrl ctrl;
>
> - struct work_struct err_work;
> + struct delayed_work err_work;
> struct delayed_work connect_work;
> struct nvme_tcp_request async_req;
> u32 io_queues[HCTX_MAX_TYPES];
> @@ -611,11 +611,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
>
> static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
> {
> - if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> + if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RECOVERING) &&
> + !nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
This warrants an explanation. It is not clear at all why we should allow
two different
transitions to allow error recovery to start...
> return;
>
> dev_warn(ctrl->device, "starting error recovery\n");
> - queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
> + queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, 0);
> }
>
> static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
> @@ -2470,12 +2471,48 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
> nvme_tcp_reconnect_or_remove(ctrl, ret);
> }
>
> +static int nvme_tcp_recover_ctrl(struct nvme_ctrl *ctrl)
> +{
> + unsigned long rem;
> +
> + if (test_and_clear_bit(NVME_CTRL_RECOVERED, &ctrl->flags)) {
> + dev_info(ctrl->device, "completed time-based recovery\n");
> + goto done;
> + }
This is also not clear, why should we get here when NVME_CTRL_RECOVERED
is set?
> +
> + rem = nvme_recover_ctrl(ctrl);
> + if (!rem)
> + goto done;
> +
> + if (!ctrl->cqt) {
> + dev_info(ctrl->device,
> + "CCR failed, CQT not supported, skip time-based recovery\n");
> + goto done;
> + }
> +
> + dev_info(ctrl->device,
> + "CCR failed, switch to time-based recovery, timeout = %ums\n",
> + jiffies_to_msecs(rem));
> + set_bit(NVME_CTRL_RECOVERED, &ctrl->flags);
> + queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, rem);
> + return -EAGAIN;
I don't think that reusing the same work to handle two completely
different things
is the right approach here.
How about splitting to fence_work and err_work? That should eliminate
some of the
ctrl state inspections and simplify error recovery.
> +
> +done:
> + nvme_end_ctrl_recovery(ctrl);
> + return 0;
> +}
> +
> static void nvme_tcp_error_recovery_work(struct work_struct *work)
> {
> - struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> + struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
> struct nvme_tcp_ctrl, err_work);
> struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
>
> + if (nvme_ctrl_state(ctrl) == NVME_CTRL_RECOVERING) {
> + if (nvme_tcp_recover_ctrl(ctrl))
> + return;
> + }
> +
Yea, I think we want to rework the current design.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 11/14] nvme-rdma: Use CCR to recover controller that hits an error
2025-11-26 2:11 ` [RFC PATCH 11/14] nvme-rdma: " Mohamed Khalfella
2025-12-19 2:16 ` Randy Jennings
@ 2025-12-27 10:36 ` Sagi Grimberg
1 sibling, 0 replies; 68+ messages in thread
From: Sagi Grimberg @ 2025-12-27 10:36 UTC (permalink / raw)
To: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
Jens Axboe, Keith Busch
Cc: Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
Same comments from nvme-tcp...
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery
2025-12-27 10:14 ` Sagi Grimberg
@ 2025-12-31 0:04 ` Randy Jennings
2026-01-04 21:14 ` Sagi Grimberg
2025-12-31 23:43 ` Mohamed Khalfella
1 sibling, 1 reply; 68+ messages in thread
From: Randy Jennings @ 2025-12-31 0:04 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
Jens Axboe, Keith Busch, Aaron Dailey, John Meneghini,
Hannes Reinecke, linux-nvme, linux-kernel
> > +
> > + if (!ret) {
> > + dev_info(ictrl->device, "CCR succeeded using %s\n",
> > + dev_name(sctrl->device));
> > + blk_put_queue(sctrl->admin_q);
> > + nvme_put_ctrl(sctrl);
> > + return 0;
> > + }
> > +
> > + /* Try another controller */
> > + min_cntlid = sctrl->cntlid + 1;
>
> OK, I see why min_cntlid is used. That is very non-intuitive.
>
> I'm wandering if it will be simpler to take one-shot at ccr and
> if it fails fallback to crt. I mean, if the sctrl is alive, and it was
> unable
> to reset the ictrl in time, how would another ctrl do a better job here?
There are many different kinds of failures we are dealing with here
that result in a dropped connection (association). It could be a problem
with the specific link, or it could be that the node of an HA pair in the
storage array went down. In the case of a specific link problem, maybe
only one of the connections is down and any controller would work.
In the case of the node of an HA pair, roughly half of the connections
are going down, and there is a race between the controllers which
are detected down first. There were some heuristics put into the
spec about deciding which controller to use, but that is more code
and a refinement that could come later (and they are still heuristics;
they may not be helpful).
Because CCR offers a significant win of shortening the recovery time
substantially, it is worth retrying on the other controllers. This time
affects when we can start retrying IO. KATO is in seconds, and
NVMEoF should have the capability of doing a significant amount of
IOs in each of those seconds.
Besides, the alternative is just to wait. Might as well be actively trying
to shorten that wait time. Besides a small increase in code complexity,
is there a downside to doing so?
Sincerely,
Randy Jennings
the time.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error
2025-12-27 10:35 ` Sagi Grimberg
@ 2025-12-31 0:13 ` Randy Jennings
2026-01-04 21:19 ` Sagi Grimberg
2026-01-01 0:27 ` Mohamed Khalfella
1 sibling, 1 reply; 68+ messages in thread
From: Randy Jennings @ 2025-12-31 0:13 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
Jens Axboe, Keith Busch, Aaron Dailey, John Meneghini,
Hannes Reinecke, linux-nvme, linux-kernel
On Sat, Dec 27, 2025 at 2:35 AM Sagi Grimberg <sagi@grimberg.me> wrote:
> On 26/11/2025 4:11, Mohamed Khalfella wrote:
...
> > + dev_info(ctrl->device,
> > + "CCR failed, switch to time-based recovery, timeout = %ums\n",
> > + jiffies_to_msecs(rem));
> > + set_bit(NVME_CTRL_RECOVERED, &ctrl->flags);
> > + queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, rem);
> > + return -EAGAIN;
>
> I don't think that reusing the same work to handle two completely
> different things
> is the right approach here.
>
> How about splitting to fence_work and err_work? That should eliminate
> some of the
> ctrl state inspections and simplify error recovery.
If the work was independent and could happen separately (probably
in parallel), I could understand having separate work structures. But they
are not independent, and they have a definite relationship. Like Mohamed,
I thought of them as different stages of the same work. Having an extra
work item takes up more space (I would be concerned about scalability to
thousands or 10s of thousands of associations and then go one order of
magnitude higher for margin), and it also causes a connection object
(referenced during IO) to take up more cache lines. Is it worth taking up
that space, when the separate work items would be different, dependent
stages in the same process?
Sincerely,
Randy Jennings
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 03/14] nvmet: Implement CCR nvme command
2025-12-16 3:01 ` Randy Jennings
@ 2025-12-31 21:14 ` Mohamed Khalfella
0 siblings, 0 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2025-12-31 21:14 UTC (permalink / raw)
To: Randy Jennings
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Mon 2025-12-15 19:01:30 -0800, Randy Jennings wrote:
> On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
> <mkhalfella@purestorage.com> wrote:
> >
> > Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
> > Reset) command is an nvme command the is issued to source controller by
> > initiator to reset impacted controller. Implement CCR command for linux
> > nvme target.
> Remove extraneous "the is" in second line.
Removed.
>
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
>
> Reviewed-by: Randy Jennings <randyj@purestorage.com>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 03/14] nvmet: Implement CCR nvme command
2025-12-27 9:39 ` Sagi Grimberg
@ 2025-12-31 21:35 ` Mohamed Khalfella
0 siblings, 0 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2025-12-31 21:35 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Sat 2025-12-27 11:39:55 +0200, Sagi Grimberg wrote:
>
> >>> +void nvmet_execute_cross_ctrl_reset(struct nvmet_req *req)
> >>> +{
> >>> + struct nvmet_ctrl *ictrl, *ctrl = req->sq->ctrl;
> >>> + struct nvme_command *cmd = req->cmd;
> >>> + struct nvmet_ccr *ccr, *new_ccr;
> >>> + int ccr_active, ccr_total;
> >>> + u16 cntlid, status = 0;
> >>> +
> >>> + cntlid = le16_to_cpu(cmd->ccr.icid);
> >>> + if (ctrl->cntlid == cntlid) {
> >>> + req->error_loc =
> >>> + offsetof(struct nvme_cross_ctrl_reset_cmd, icid);
> >>> + status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR;
> >>> + goto out;
> >>> + }
> >>> +
> >>> + ictrl = nvmet_ctrl_find_get_ccr(ctrl->subsys, ctrl->hostnqn,
> >> What does the 'i' stand for?
> > 'i' stands for impacted controller. Also, if you see sctrl the 's'
> > stands for source controller. These terms are from TP8028.
>
> Can you perhaps add a comment on this?
Okay, will do that.
>
> >>> + new_ccr->ciu = cmd->ccr.ciu;
> >>> + new_ccr->icid = cntlid;
> >>> + new_ccr->ctrl = ictrl;
> >>> + list_add_tail(&new_ccr->entry, &ctrl->ccrs);
> >>> + mutex_unlock(&ctrl->lock);
> >>> +
> >>> + nvmet_ctrl_fatal_error(ictrl);
> >> Don't you need to wait for it to complete?
> >> e.g. flush_work(&ictrl->fatal_err_work);
> >>
> >> Or is that done async? will need to look downstream...
> > No, we do not need to wait for ictrl->fatal_err_work to complete. An AEN
> > will be sent when ictrl exits. It is okay if AEN is sent before CCR
> > request is completed. The initiator should expect this behavior and deal
> > with it.
>
> Yes, saw that in a later patch (didn't get to do a full review yet)
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 05/14] nvmet: Send an AEN on CCR completion
2025-12-27 9:48 ` Sagi Grimberg
@ 2025-12-31 22:00 ` Mohamed Khalfella
2026-01-04 21:09 ` Sagi Grimberg
0 siblings, 1 reply; 68+ messages in thread
From: Mohamed Khalfella @ 2025-12-31 22:00 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Sat 2025-12-27 11:48:49 +0200, Sagi Grimberg wrote:
>
>
> On 25/12/2025 20:13, Mohamed Khalfella wrote:
> > On Thu 2025-12-25 15:23:51 +0200, Sagi Grimberg wrote:
> >>
> >> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> >>> Send an AEN to initiator when impacted controller exists. The
> >>> notification points to CCR log page that initiator can read to check
> >>> which CCR operation completed.
> >>>
> >>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> >>> ---
> >>> drivers/nvme/target/core.c | 27 +++++++++++++++++++++++----
> >>> drivers/nvme/target/nvmet.h | 3 ++-
> >>> include/linux/nvme.h | 3 +++
> >>> 3 files changed, 28 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> >>> index 7dbe9255ff42..60173833c3eb 100644
> >>> --- a/drivers/nvme/target/core.c
> >>> +++ b/drivers/nvme/target/core.c
> >>> @@ -202,7 +202,7 @@ static void nvmet_async_event_work(struct work_struct *work)
> >>> nvmet_async_events_process(ctrl);
> >>> }
> >>>
> >>> -void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> >>> +static void nvmet_add_async_event_locked(struct nvmet_ctrl *ctrl, u8 event_type,
> >>> u8 event_info, u8 log_page)
> >>> {
> >>> struct nvmet_async_event *aen;
> >>> @@ -215,12 +215,17 @@ void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> >>> aen->event_info = event_info;
> >>> aen->log_page = log_page;
> >>>
> >>> - mutex_lock(&ctrl->lock);
> >>> list_add_tail(&aen->entry, &ctrl->async_events);
> >>> - mutex_unlock(&ctrl->lock);
> >>>
> >>> queue_work(nvmet_wq, &ctrl->async_event_work);
> >>> }
> >>> +void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> >>> + u8 event_info, u8 log_page)
> >>> +{
> >>> + mutex_lock(&ctrl->lock);
> >>> + nvmet_add_async_event_locked(ctrl, event_type, event_info, log_page);
> >>> + mutex_unlock(&ctrl->lock);
> >>> +}
> >>>
> >>> static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
> >>> {
> >>> @@ -1788,6 +1793,18 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
> >>> }
> >>> EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
> >>>
> >>> +static void nvmet_ctrl_notify_ccr(struct nvmet_ctrl *ctrl)
> >>> +{
> >>> + lockdep_assert_held(&ctrl->lock);
> >>> +
> >>> + if (nvmet_aen_bit_disabled(ctrl, NVME_AEN_BIT_CCR_COMPLETE))
> >>> + return;
> >>> +
> >>> + nvmet_add_async_event_locked(ctrl, NVME_AER_NOTICE,
> >>> + NVME_AER_NOTICE_CCR_COMPLETED,
> >>> + NVME_LOG_CCR);
> >>> +}
> >>> +
> >>> static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
> >>> {
> >>> struct nvmet_subsys *subsys = ctrl->subsys;
> >>> @@ -1801,8 +1818,10 @@ static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
> >>> list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
> >>> mutex_lock(&sctrl->lock);
> >>> list_for_each_entry(ccr, &sctrl->ccrs, entry) {
> >>> - if (ccr->ctrl == ctrl)
> >>> + if (ccr->ctrl == ctrl) {
> >>> + nvmet_ctrl_notify_ccr(sctrl);
> >>> ccr->ctrl = NULL;
> >>> + }
> >> Is this double loop necessary? Would you have more than one controller
> >> cross resetting the same
> > As it is implemented now CCRs are linked to sctrl. This decision can be
> > revisited if found suboptimal. At some point I had CCRs linked to
> > ctrl->subsys but that led to lock ordering issues. Double loop is
> > necessary to find all CCRs in all controllers and mark them done.
> > Yes, it is possible to have more than one sctrl resetting the same
> > ictrl.
>
> I'm more interested in simplifying.
>
> >
> >> controller? Won't it be better to install a callback+opaque that the
> >> controller removal will call?
> > Can you elaborate more on that? Better in what terms?
> >
> > nvmet_ctrl_complete_pending_ccr() is called from nvmet_ctrl_free() when
> > we know that ctrl->ref is zero and no new CCRs will be added to this
> > controller because nvmet_ctrl_find_get_ccr() will not be able to get it.
>
> In nvmet, the controller is serving a single host. Hence I am not sure I
> understand how multiple source controllers will try to reset the impacted
> controller. So, if there is a 1-1 relationship between source and impacted
> controller, I'd perhaps suggest to simplify and install on the impacted
> controller
> callback+opaque (e.g. void *data) instead of having it iterate and then
> actually send
> the AEN from the impacted controller.
A controller is serving a single path for a given host. A host that is
connected to nvme subsystem via multiple paths will have more than one
controller. I can think of two reasons why we need to support resetting
an impacted controller from multiple source controllers.
- It is possible for multiple paths to go down at the same time. The
first source controller we use for CCR, even though we check to see if
LIVE, might have lost connection to subsystem. It is a matter of time
for it to see keepalive timeout and fail too. If CCR fails using this
controller we should not give up. We need to try other paths.
- Some nvme subsystems might support resetting impacted controller from
a subset of controllers connected to the host. An array that has
multiple frontend engines might not support resetting controllers
across engines. In fact, TP8028 allows for subsystem to suggest to
host to use another source controller in Alternate Controller ID
(ACID) fied on CCR logpage (not implemente in this patchset).
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 06/14] nvme: Rapid Path Failure Recovery read controller identify fields
2025-12-18 15:22 ` Randy Jennings
@ 2025-12-31 22:26 ` Mohamed Khalfella
2026-01-02 19:06 ` Mohamed Khalfella
0 siblings, 1 reply; 68+ messages in thread
From: Mohamed Khalfella @ 2025-12-31 22:26 UTC (permalink / raw)
To: Randy Jennings
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Thu 2025-12-18 07:22:41 -0800, Randy Jennings wrote:
> On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
> <mkhalfella@purestorage.com> wrote:
> >
> > TP2028 Rapid path failure added new fileds to controller identify
> TP8028
Fixed.
> > response. Read CIU (Controller Instance Uniquifier), CIRN (Controller
> > Instance Random Number), and CCRL (Cross-Controller Reset Limit) from
> > controller identify response. Expose CIU and CIRN as sysfs attributes
> > so the values can be used directrly by user if needed.
> >
> > TP4129 KATO Corrections and Clarifications defined CQT (Command Quiesce
> > Time) which is used along with KATO (Keep Alive Timeout) to set an upper
> > limite for attempting Cross-Controller Recovery.
> "limite" -> "limit"
Fixed.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> > drivers/nvme/host/core.c | 5 +++++
> > drivers/nvme/host/nvme.h | 11 +++++++++++
> > drivers/nvme/host/sysfs.c | 23 +++++++++++++++++++++++
> > 3 files changed, 39 insertions(+)
> >
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index fa4181d7de73..aa007a7b9606 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -3572,12 +3572,17 @@ static int nvme_init_identify(struct nvme_ctrl *ctrl)
> > ctrl->crdt[1] = le16_to_cpu(id->crdt2);
> > ctrl->crdt[2] = le16_to_cpu(id->crdt3);
> >
> > + ctrl->ciu = id->ciu;
> > + ctrl->cirn = le64_to_cpu(id->cirn);
> > + atomic_set(&ctrl->ccr_limit, id->ccrl);
> Seems like it would be good for the target & init to use the same
> name for these fields. I have a preference for these over
> instance_uniquifier and random because they are more concise, but
> the preference is not strong.
The field names in the spec are concise, but they are also cryptic.
>
> > +
> > ctrl->oacs = le16_to_cpu(id->oacs);
> > ctrl->oncs = le16_to_cpu(id->oncs);
> > ctrl->mtfa = le16_to_cpu(id->mtfa);
> > ctrl->oaes = le32_to_cpu(id->oaes);
> > ctrl->wctemp = le16_to_cpu(id->wctemp);
> > ctrl->cctemp = le16_to_cpu(id->cctemp);
> > + ctrl->cqt = le16_to_cpu(id->cqt);
> >
> > atomic_set(&ctrl->abort_limit, id->acl + 1);
> > ctrl->vwc = id->vwc;
> I cannot discern an ordering to the attributes set here. Any
> particular reason, you placed cqt away from the others you added?
No reason. Moved ctrl->cqt initialization up with other fields.
>
> > diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
> > index 29430949ce2f..ae36249ad61e 100644
> > --- a/drivers/nvme/host/sysfs.c
> > +++ b/drivers/nvme/host/sysfs.c
> > @@ -388,6 +388,27 @@ nvme_show_int_function(queue_count);
> > nvme_show_int_function(sqsize);
> > nvme_show_int_function(kato);
> >
> > +static ssize_t nvme_sysfs_uniquifier_show(struct device *dev,
> > + struct device_attribute *attr,
> > + char *buf)
> > +{
> > + struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
> > +
> > + return sysfs_emit(buf, "%02x\n", ctrl->ciu);
> > +}
> > +static DEVICE_ATTR(uniquifier, S_IRUGO, nvme_sysfs_uniquifier_show, NULL);
> > +
> > +static ssize_t nvme_sysfs_random_show(struct device *dev,
> > + struct device_attribute *attr,
> > + char *buf)
> > +{
> > + struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
> > +
> > + return sysfs_emit(buf, "%016llx\n", ctrl->cirn);
> > +}
> > +static DEVICE_ATTR(random, S_IRUGO, nvme_sysfs_random_show, NULL);
> > +
> > +
> > static ssize_t nvme_sysfs_delete(struct device *dev,
> > struct device_attribute *attr, const char *buf,
> > size_t count)
> > @@ -734,6 +755,8 @@ static struct attribute *nvme_dev_attrs[] = {
> > &dev_attr_numa_node.attr,
> > &dev_attr_queue_count.attr,
> > &dev_attr_sqsize.attr,
> > + &dev_attr_uniquifier.attr,
> > + &dev_attr_random.attr,
> > &dev_attr_hostnqn.attr,
> > &dev_attr_hostid.attr,
> > &dev_attr_ctrl_loss_tmo.attr,
> > --
> > 2.51.2
> >
>
> These are the names used in the target code (uniquifer & random.
> I'd rather have them match (identify structure will have spec's
> abbreviations; ctrl & debug/sysfs for target & initiator either be
> ciu/cirn or uniquifer/random.
I think it matters for sysfs attributes. I do not know the right thing
to do. Should we use spec names like "cirn" or call it "random"?
>
> But this is small stuff.
>
> Reviewed-by: Randy Jennings <randyj@purestorage.com>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state
2025-12-27 9:55 ` Sagi Grimberg
@ 2025-12-31 22:36 ` Mohamed Khalfella
2025-12-31 23:04 ` Mohamed Khalfella
0 siblings, 1 reply; 68+ messages in thread
From: Mohamed Khalfella @ 2025-12-31 22:36 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Sat 2025-12-27 11:55:01 +0200, Sagi Grimberg wrote:
>
>
> On 25/12/2025 19:17, Mohamed Khalfella wrote:
> > On Thu 2025-12-25 15:29:52 +0200, Sagi Grimberg wrote:
> >>
> >> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> >>> Add NVME_CTRL_RECOVERING as a new controller state to be used when
> >>> impacted controller is being recovered. A LIVE controller enters
> >>> RECOVERING state when an IO error is encountered. While recovering
> >>> inflight IOs will not be canceled if they timeout. These IOs will be
> >>> canceled after recovery finishes. Also, while recovering a controller
> >>> can not be reset or deleted. This is intentional because reset or delete
> >>> will result in canceling inflight IOs. When recovery finishes, the
> >>> impacted controller transitions from RECOVERING state to RESETTING state.
> >>> Reset codepath takes care of queues teardown and inflight requests
> >>> cancellation.
> >> Is RECOVERING really capturing the nature of this state? Maybe RESETTLING?
> >> or QUIESCING?
> > Naming is hard. QUIESCING sounds better, I will renaming it to
> > QUIESCING.
>
> I actually think that FENCING is probably best to describe what the
> state is used for...
FENCING is used in HA clusters with persistent reservations. I find it
confusing to use it here. Let me know if you have strong preference.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state
2025-12-27 9:52 ` Sagi Grimberg
@ 2025-12-31 22:45 ` Mohamed Khalfella
0 siblings, 0 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2025-12-31 22:45 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Sat 2025-12-27 11:52:31 +0200, Sagi Grimberg wrote:
>
>
> On 25/12/2025 19:17, Mohamed Khalfella wrote:
> > On Thu 2025-12-25 15:29:52 +0200, Sagi Grimberg wrote:
> >>
> >> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> >>> Add NVME_CTRL_RECOVERING as a new controller state to be used when
> >>> impacted controller is being recovered. A LIVE controller enters
> >>> RECOVERING state when an IO error is encountered. While recovering
> >>> inflight IOs will not be canceled if they timeout. These IOs will be
> >>> canceled after recovery finishes. Also, while recovering a controller
> >>> can not be reset or deleted. This is intentional because reset or delete
> >>> will result in canceling inflight IOs. When recovery finishes, the
> >>> impacted controller transitions from RECOVERING state to RESETTING state.
> >>> Reset codepath takes care of queues teardown and inflight requests
> >>> cancellation.
> >> Is RECOVERING really capturing the nature of this state? Maybe RESETTLING?
> >> or QUIESCING?
> > Naming is hard. QUIESCING sounds better, I will renaming it to
> > QUIESCING.
> >
> >>> Note, there is no transition from RECOVERING to RESETTING added to
> >>> nvme_change_ctrl_state(). The reason is that user should not be allowed
> >>> to reset or delete a controller that is being recovered.
> >>>
> >>> Add NVME_CTRL_RECOVERED controller flag. This flag is set on a controller
> >>> about to schedule delayed work for time based recovery.
> >>>
> >>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> >>> ---
> >>> drivers/nvme/host/core.c | 10 ++++++++++
> >>> drivers/nvme/host/nvme.h | 2 ++
> >>> drivers/nvme/host/sysfs.c | 1 +
> >>> 3 files changed, 13 insertions(+)
> >>>
> >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >>> index aa007a7b9606..f5b84bc327d3 100644
> >>> --- a/drivers/nvme/host/core.c
> >>> +++ b/drivers/nvme/host/core.c
> >>> @@ -574,6 +574,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
> >>> break;
> >>> }
> >>> break;
> >>> + case NVME_CTRL_RECOVERING:
> >>> + switch (old_state) {
> >>> + case NVME_CTRL_LIVE:
> >>> + changed = true;
> >>> + fallthrough;
> >>> + default:
> >>> + break;
> >>> + }
> >>> + break;
> >> That is a strange transition...
> > Why is it strange?
> >
> > We transition to RECOVERING state only if controller is LIVE. This is
> > when we expect to have inflight user IOs to be quiesced by CCR. We do
> > not care about inflight requests in other states.
>
> Sorry, got confused myself - I read it as the other way around...
> I am missing RECOVERING -> RESETTING transition in this patch.
This is in patch 8 ("nvme: Implement cross-controller reset recovery").
It was not added to nvme_change_ctrl_state() because we do not want the
controller to be reset while in RECOVERING state.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state
2025-12-31 22:36 ` Mohamed Khalfella
@ 2025-12-31 23:04 ` Mohamed Khalfella
0 siblings, 0 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2025-12-31 23:04 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Wed 2025-12-31 14:36:53 -0800, Mohamed Khalfella wrote:
> On Sat 2025-12-27 11:55:01 +0200, Sagi Grimberg wrote:
> >
> >
> > On 25/12/2025 19:17, Mohamed Khalfella wrote:
> > > On Thu 2025-12-25 15:29:52 +0200, Sagi Grimberg wrote:
> > >>
> > >> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> > >>> Add NVME_CTRL_RECOVERING as a new controller state to be used when
> > >>> impacted controller is being recovered. A LIVE controller enters
> > >>> RECOVERING state when an IO error is encountered. While recovering
> > >>> inflight IOs will not be canceled if they timeout. These IOs will be
> > >>> canceled after recovery finishes. Also, while recovering a controller
> > >>> can not be reset or deleted. This is intentional because reset or delete
> > >>> will result in canceling inflight IOs. When recovery finishes, the
> > >>> impacted controller transitions from RECOVERING state to RESETTING state.
> > >>> Reset codepath takes care of queues teardown and inflight requests
> > >>> cancellation.
> > >> Is RECOVERING really capturing the nature of this state? Maybe RESETTLING?
> > >> or QUIESCING?
> > > Naming is hard. QUIESCING sounds better, I will renaming it to
> > > QUIESCING.
> >
> > I actually think that FENCING is probably best to describe what the
> > state is used for...
>
> FENCING is used in HA clusters with persistent reservations. I find it
> confusing to use it here. Let me know if you have strong preference.
Nevermind, I will rename it to FENCING.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery
2025-12-27 10:14 ` Sagi Grimberg
2025-12-31 0:04 ` Randy Jennings
@ 2025-12-31 23:43 ` Mohamed Khalfella
2026-01-04 21:39 ` Sagi Grimberg
1 sibling, 1 reply; 68+ messages in thread
From: Mohamed Khalfella @ 2025-12-31 23:43 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Sat 2025-12-27 12:14:11 +0200, Sagi Grimberg wrote:
>
>
> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> > A host that has more than one path connecting to an nvme subsystem
> > typically has an nvme controller associated with every path. This is
> > mostly applicable to nvmeof. If one path goes down, inflight IOs on that
> > path should not be retried immediately on another path because this
> > could lead to data corruption as described in TP4129. TP8028 defines
> > cross-controller reset mechanism that can be used by host to terminate
> > IOs on the failed path using one of the remaining healthy paths. Only
> > after IOs are terminated, or long enough time passes as defined by
> > TP4129, inflight IOs should be retried on another path. Implement core
> > cross-controller reset shared logic to be used by the transports.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> > drivers/nvme/host/constants.c | 1 +
> > drivers/nvme/host/core.c | 133 ++++++++++++++++++++++++++++++++++
> > drivers/nvme/host/nvme.h | 10 +++
> > 3 files changed, 144 insertions(+)
> >
> > diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
> > index dc90df9e13a2..f679efd5110e 100644
> > --- a/drivers/nvme/host/constants.c
> > +++ b/drivers/nvme/host/constants.c
> > @@ -46,6 +46,7 @@ static const char * const nvme_admin_ops[] = {
> > [nvme_admin_virtual_mgmt] = "Virtual Management",
> > [nvme_admin_nvme_mi_send] = "NVMe Send MI",
> > [nvme_admin_nvme_mi_recv] = "NVMe Receive MI",
> > + [nvme_admin_cross_ctrl_reset] = "Cross Controller Reset",
> > [nvme_admin_dbbuf] = "Doorbell Buffer Config",
> > [nvme_admin_format_nvm] = "Format NVM",
> > [nvme_admin_security_send] = "Security Send",
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index f5b84bc327d3..f38b70ca9cee 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -554,6 +554,138 @@ void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl)
> > }
> > EXPORT_SYMBOL_GPL(nvme_cancel_admin_tagset);
> >
> > +static struct nvme_ctrl *nvme_find_ccr_ctrl(struct nvme_ctrl *ictrl,
> > + u32 min_cntlid)
> > +{
> > + struct nvme_subsystem *subsys = ictrl->subsys;
> > + struct nvme_ctrl *sctrl;
> > + unsigned long flags;
> > +
> > + mutex_lock(&nvme_subsystems_lock);
>
> This looks like the wrong lock to take here?
This is similar to nvme_validate_cntlid()?
What is the correct lock to use?
>
> > + list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
> > + if (sctrl->cntlid < min_cntlid)
> > + continue;
>
> The use of min_cntlid is not clear to me.
>
> > +
> > + if (atomic_dec_if_positive(&sctrl->ccr_limit) < 0)
> > + continue;
> > +
> > + spin_lock_irqsave(&sctrl->lock, flags);
> > + if (sctrl->state != NVME_CTRL_LIVE) {
> > + spin_unlock_irqrestore(&sctrl->lock, flags);
> > + atomic_inc(&sctrl->ccr_limit);
> > + continue;
> > + }
> > +
> > + /*
> > + * We got a good candidate source controller that is locked and
> > + * LIVE. However, no guarantee sctrl will not be deleted after
> > + * sctrl->lock is released. Get a ref of both sctrl and admin_q
> > + * so they do not disappear until we are done with them.
> > + */
> > + WARN_ON_ONCE(!blk_get_queue(sctrl->admin_q));
> > + nvme_get_ctrl(sctrl);
> > + spin_unlock_irqrestore(&sctrl->lock, flags);
> > + goto found;
> > + }
> > + sctrl = NULL;
> > +found:
> > + mutex_unlock(&nvme_subsystems_lock);
> > + return sctrl;
> > +}
> > +
> > +static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
> > +{
> > + unsigned long flags, tmo, remain;
> > + struct nvme_ccr_entry ccr = { };
> > + union nvme_result res = { 0 };
> > + struct nvme_command c = { };
> > + u32 result;
> > + int ret = 0;
> > +
> > + init_completion(&ccr.complete);
> > + ccr.ictrl = ictrl;
> > +
> > + spin_lock_irqsave(&sctrl->lock, flags);
> > + list_add_tail(&ccr.list, &sctrl->ccrs);
> > + spin_unlock_irqrestore(&sctrl->lock, flags);
> > +
> > + c.ccr.opcode = nvme_admin_cross_ctrl_reset;
> > + c.ccr.ciu = ictrl->ciu;
> > + c.ccr.icid = cpu_to_le16(ictrl->cntlid);
> > + c.ccr.cirn = cpu_to_le64(ictrl->cirn);
> > + ret = __nvme_submit_sync_cmd(sctrl->admin_q, &c, &res,
> > + NULL, 0, NVME_QID_ANY, 0);
> > + if (ret)
> > + goto out;
> > +
> > + result = le32_to_cpu(res.u32);
> > + if (result & 0x01) /* Immediate Reset */
> > + goto out;
> > +
> > + tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
> > + remain = wait_for_completion_timeout(&ccr.complete, tmo);
> > + if (!remain)
>
> I think remain is redundant here.
Deleted 'remain'.
>
> > + ret = -EAGAIN;
> > +out:
> > + spin_lock_irqsave(&sctrl->lock, flags);
> > + list_del(&ccr.list);
> > + spin_unlock_irqrestore(&sctrl->lock, flags);
> > + return ccr.ccrs == 1 ? 0 : ret;
>
> Why would you still return 0 and not EAGAIN? you expired on timeout but
> still
> return success if you have ccrs=1? btw you have ccrs in the ccr struct
> and in the controller
> as a list. Lets rename to distinguish the two.
True, we did expire timeout here. However, after we removed the ccr
entry we found that it was marked as completed. We return success in
this case even though we hit timeout.
Renamed ctrl->ccrs to ctrl->ccr_list.
>
> > +}
> > +
> > +unsigned long nvme_recover_ctrl(struct nvme_ctrl *ictrl)
> > +{
>
> I'd call it nvme_fence_controller()
Okay. I will do that. I will also rename the controller state FENCING.
>
> > + unsigned long deadline, now, timeout;
> > + struct nvme_ctrl *sctrl;
> > + u32 min_cntlid = 0;
> > + int ret;
> > +
> > + timeout = nvme_recovery_timeout_ms(ictrl);
> > + dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> > +
> > + now = jiffies;
> > + deadline = now + msecs_to_jiffies(timeout);
> > + while (time_before(now, deadline)) {
> > + sctrl = nvme_find_ccr_ctrl(ictrl, min_cntlid);
> > + if (!sctrl) {
> > + /* CCR failed, switch to time-based recovery */
> > + return deadline - now;
>
> It is not clear what is the return code semantics of this function.
> How about making it success/failure and have the caller choose what to do?
The function returns 0 on success. On failure it returns the time in
jiffies to hold requests for before they are canceled. On failure the
returned time is essentially the hold time defined in TP4129 minus the
time it took to attempt CCR.
>
> > + }
> > +
> > + ret = nvme_issue_wait_ccr(sctrl, ictrl);
> > + atomic_inc(&sctrl->ccr_limit);
>
> inc after you wait for the ccr? shouldn't this be before?
I think it should be after we wait for CCR. sctrl->ccr_limit is the
number of concurrent CCRs the controller supports. Only after we are
done with CCR on this controller we increment it.
>
> > +
> > + if (!ret) {
> > + dev_info(ictrl->device, "CCR succeeded using %s\n",
> > + dev_name(sctrl->device));
> > + blk_put_queue(sctrl->admin_q);
> > + nvme_put_ctrl(sctrl);
> > + return 0;
> > + }
> > +
> > + /* Try another controller */
> > + min_cntlid = sctrl->cntlid + 1;
>
> OK, I see why min_cntlid is used. That is very non-intuitive.
>
> I'm wandering if it will be simpler to take one-shot at ccr and
> if it fails fallback to crt. I mean, if the sctrl is alive, and it was
> unable
> to reset the ictrl in time, how would another ctrl do a better job here?
We need to attempt CCR from multiple controllers for reason explained in
another response. As you figured out min_cntlid is needed in order to
not loop controller list forever. Do you have a better idea?
>
> > + blk_put_queue(sctrl->admin_q);
> > + nvme_put_ctrl(sctrl);
> > + now = jiffies;
> > + }
> > +
> > + dev_info(ictrl->device, "CCR reached timeout, call it done\n");
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(nvme_recover_ctrl);
> > +
> > +void nvme_end_ctrl_recovery(struct nvme_ctrl *ctrl)
> > +{
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&ctrl->lock, flags);
> > + WRITE_ONCE(ctrl->state, NVME_CTRL_RESETTING);
>
> This needs to be a proper state transition.
We do not want to have proper transition from RECOVERING to RESETTING.
The reason is that we do not want the controller to be reset while it is
being recovered/fenced because requests should not be canceled. One way
to keep the transitions in nvme_change_ctrl_state() is to use two
states. Say FENCING and FENCED.
The allowed transitions are
- LIVE -> FENCING
- FENCING -> FENCED
- FENCED -> (RESETTING, DELETING)
This will also git rid of NVME_CTRL_RECOVERED
Does this sound good?
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 09/14] nvme: Implement cross-controller reset completion
2025-12-27 10:24 ` Sagi Grimberg
@ 2025-12-31 23:51 ` Mohamed Khalfella
2026-01-04 21:15 ` Sagi Grimberg
0 siblings, 1 reply; 68+ messages in thread
From: Mohamed Khalfella @ 2025-12-31 23:51 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Sat 2025-12-27 12:24:17 +0200, Sagi Grimberg wrote:
>
> > + log = kmalloc(sizeof(*log), GFP_KERNEL);
> > + if (!log)
> > + return;
> > +
> > + ret = nvme_get_log(ctrl, 0, NVME_LOG_CCR, 0x01,
> > + 0x00, log, sizeof(*log), 0);
> > + if (ret)
> > + goto out;
> > +
> > + spin_lock_irqsave(&ctrl->lock, flags);
> > + for (i = 0; i < le16_to_cpu(log->ne); i++) {
> > + entry = &log->entries[i];
> > + if (entry->ccrs == 0) /* skip in progress entries */
> > + continue;
>
> What does ccrs stand for?
Cross-Controller Reset Status
0x00 -> In Progress
0x01 -> Success
0x02 -> Failed
0x03 - 0xff -> Reserved
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error
2025-12-19 2:06 ` Randy Jennings
@ 2026-01-01 0:04 ` Mohamed Khalfella
0 siblings, 0 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2026-01-01 0:04 UTC (permalink / raw)
To: Randy Jennings
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Thu 2025-12-18 18:06:02 -0800, Randy Jennings wrote:
> On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
> <mkhalfella@purestorage.com> wrote:
> >
> > An alive nvme controller that hits an error now will move to RECOVERING
> > state instead of RESETTING state. In RECOVERING state ctrl->err_work
> > will attempt to use cross-controller recovery to terminate inflight IOs
> > on the controller. If CCR succeeds, then switch to RESETTING state and
> > continue error recovery as usuall by tearing down controller and attempt
> > reconnecting to target. If CCR fails, then the behavior of recovery
> "usuall" -> "usual"
> "attempt reconnecting" -> "attempting to reconnect"
>
> it would read better with "the" added:
> "tearing down the controller"
> "reconnect to the target"
Updated as suggested.
>
> > depends on whether CQT is supported or not. If CQT is supported, switch
> > to time-based recovery by holding inflight IOs until it is safe for them
> > to be retried. If CQT is not supported proceed to retry requests
> > immediately, as the code currently does.
>
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>
> > +static int nvme_tcp_recover_ctrl(struct nvme_ctrl *ctrl)
>
> > + dev_info(ctrl->device,
> > + "CCR failed, switch to time-based recovery, timeout = %ums\n",
> > + jiffies_to_msecs(rem));
> > + set_bit(NVME_CTRL_RECOVERED, &ctrl->flags);
> > + queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, rem);
> > + return -EAGAIN;
> I see how setting this bit before the delayed work executes works
> to complete recovery, but it is kindof weird that the bit is called
> RECOVERED. I do not have a better name. TIME_BASED_RECOVERY?
> RECOVERY_WAIT?
Agree. It does look weird. If we agree to add two states FENCING and
FENCED then the flag might not be needed.
>
> > static void nvme_tcp_error_recovery_work(struct work_struct *work)
> > {
> > - struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> > + struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
> > struct nvme_tcp_ctrl, err_work);
> > struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> >
> > + if (nvme_ctrl_state(ctrl) == NVME_CTRL_RECOVERING) {
> > + if (nvme_tcp_recover_ctrl(ctrl))
> > + return;
> > + }
> > +
> > if (nvme_tcp_key_revoke_needed(ctrl))
> > nvme_auth_revoke_tls_key(ctrl);
> > nvme_stop_keep_alive(ctrl);
> The state of the controller should not be LIVE while waiting for
> recovery, so I do not think we will succeed in sending keep alives,
> but I think this should move to before (or inside of)
> nvme_tcp_recover_ctrl().
This is correct, no keepalive traffic will be sent in RECOVERING state.
If we split fencing work from existing error recovery work then this
should removed. I think we are going in that direction.
>
> Sincerely,
> Randy Jennings
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error
2025-12-27 10:35 ` Sagi Grimberg
2025-12-31 0:13 ` Randy Jennings
@ 2026-01-01 0:27 ` Mohamed Khalfella
1 sibling, 0 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2026-01-01 0:27 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Sat 2025-12-27 12:35:23 +0200, Sagi Grimberg wrote:
>
>
> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> > An alive nvme controller that hits an error now will move to RECOVERING
> > state instead of RESETTING state. In RECOVERING state ctrl->err_work
> > will attempt to use cross-controller recovery to terminate inflight IOs
> > on the controller. If CCR succeeds, then switch to RESETTING state and
> > continue error recovery as usuall by tearing down controller and attempt
> > reconnecting to target. If CCR fails, then the behavior of recovery
> > depends on whether CQT is supported or not. If CQT is supported, switch
> > to time-based recovery by holding inflight IOs until it is safe for them
> > to be retried. If CQT is not supported proceed to retry requests
> > immediately, as the code currently does.
> >
> > To support implementing time-based recovery turn ctrl->err_work into
> > delayed work. Update nvme_tcp_timeout() to not complete inflight IOs
> > while controller in RECOVERING state.
> >
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> > drivers/nvme/host/tcp.c | 52 +++++++++++++++++++++++++++++++++++------
> > 1 file changed, 45 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index 9a96df1a511c..ec9a713490a9 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -193,7 +193,7 @@ struct nvme_tcp_ctrl {
> > struct sockaddr_storage src_addr;
> > struct nvme_ctrl ctrl;
> >
> > - struct work_struct err_work;
> > + struct delayed_work err_work;
> > struct delayed_work connect_work;
> > struct nvme_tcp_request async_req;
> > u32 io_queues[HCTX_MAX_TYPES];
> > @@ -611,11 +611,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
> >
> > static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
> > {
> > - if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> > + if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RECOVERING) &&
> > + !nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
>
> This warrants an explanation. It is not clear at all why we should allow
> two different
> transitions to allow error recovery to start...
The behavior of the ctrl->err_work depends on the controller state. We
go to RECOVERING only if the controller is LIVE. Otherwise, we attempt
to got to RESETTING.
>
> > return;
> >
> > dev_warn(ctrl->device, "starting error recovery\n");
> > - queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
> > + queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, 0);
> > }
> >
> > static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
> > @@ -2470,12 +2471,48 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
> > nvme_tcp_reconnect_or_remove(ctrl, ret);
> > }
> >
> > +static int nvme_tcp_recover_ctrl(struct nvme_ctrl *ctrl)
> > +{
> > + unsigned long rem;
> > +
> > + if (test_and_clear_bit(NVME_CTRL_RECOVERED, &ctrl->flags)) {
> > + dev_info(ctrl->device, "completed time-based recovery\n");
> > + goto done;
> > + }
>
> This is also not clear, why should we get here when NVME_CTRL_RECOVERED
> is set?
NVME_CTRL_RECOVERED flag is set before scheduling ctrl->err_work as
delayed work. This is how how time-based recovery is implemented.
We get here when ctrl->err_work runs for the second time, and at this
point we know that it is safe to just reset the controller and cancel
inflight requests.
> > +
> > + rem = nvme_recover_ctrl(ctrl);
> > + if (!rem)
> > + goto done;
> > +
> > + if (!ctrl->cqt) {
> > + dev_info(ctrl->device,
> > + "CCR failed, CQT not supported, skip time-based recovery\n");
> > + goto done;
> > + }
> > +
> > + dev_info(ctrl->device,
> > + "CCR failed, switch to time-based recovery, timeout = %ums\n",
> > + jiffies_to_msecs(rem));
> > + set_bit(NVME_CTRL_RECOVERED, &ctrl->flags);
> > + queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, rem);
> > + return -EAGAIN;
>
> I don't think that reusing the same work to handle two completely
> different things
> is the right approach here.
>
> How about splitting to fence_work and err_work? That should eliminate
> some of the
> ctrl state inspections and simplify error recovery.
>
> > +
> > +done:
> > + nvme_end_ctrl_recovery(ctrl);
> > + return 0;
> > +}
> > +
> > static void nvme_tcp_error_recovery_work(struct work_struct *work)
> > {
> > - struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> > + struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
> > struct nvme_tcp_ctrl, err_work);
> > struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> >
> > + if (nvme_ctrl_state(ctrl) == NVME_CTRL_RECOVERING) {
> > + if (nvme_tcp_recover_ctrl(ctrl))
> > + return;
> > + }
> > +
>
> Yea, I think we want to rework the current design.
Good point. Splitting ctrl->fence_work simplifies things. The if
condition above will be moved to fence_work. However, we will still need
to reschedule ctrl->fence_work from within its self to implement
time-based recovery. Is this good option?
If not, and we prefer to drop NVME_CTRL_RECOVERED flag above and not
reschedule ctrl->fence_work from within its self, then we can add
another ctr->fenced_work. How about that?
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 06/14] nvme: Rapid Path Failure Recovery read controller identify fields
2025-12-31 22:26 ` Mohamed Khalfella
@ 2026-01-02 19:06 ` Mohamed Khalfella
0 siblings, 0 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2026-01-02 19:06 UTC (permalink / raw)
To: Randy Jennings
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Sagi Grimberg, Aaron Dailey, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Wed 2025-12-31 14:26:39 -0800, Mohamed Khalfella wrote:
> On Thu 2025-12-18 07:22:41 -0800, Randy Jennings wrote:
> > On Tue, Nov 25, 2025 at 6:13 PM Mohamed Khalfella
> > <mkhalfella@purestorage.com> wrote:
> > >
> > > TP2028 Rapid path failure added new fileds to controller identify
> > TP8028
>
> Fixed.
>
> > > response. Read CIU (Controller Instance Uniquifier), CIRN (Controller
> > > Instance Random Number), and CCRL (Cross-Controller Reset Limit) from
> > > controller identify response. Expose CIU and CIRN as sysfs attributes
> > > so the values can be used directrly by user if needed.
> > >
> > > TP4129 KATO Corrections and Clarifications defined CQT (Command Quiesce
> > > Time) which is used along with KATO (Keep Alive Timeout) to set an upper
> > > limite for attempting Cross-Controller Recovery.
> > "limite" -> "limit"
>
> Fixed.
>
> > >
> > > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > > ---
> > > drivers/nvme/host/core.c | 5 +++++
> > > drivers/nvme/host/nvme.h | 11 +++++++++++
> > > drivers/nvme/host/sysfs.c | 23 +++++++++++++++++++++++
> > > 3 files changed, 39 insertions(+)
> > >
> > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > index fa4181d7de73..aa007a7b9606 100644
> > > --- a/drivers/nvme/host/core.c
> > > +++ b/drivers/nvme/host/core.c
> > > @@ -3572,12 +3572,17 @@ static int nvme_init_identify(struct nvme_ctrl *ctrl)
> > > ctrl->crdt[1] = le16_to_cpu(id->crdt2);
> > > ctrl->crdt[2] = le16_to_cpu(id->crdt3);
> > >
> > > + ctrl->ciu = id->ciu;
> > > + ctrl->cirn = le64_to_cpu(id->cirn);
> > > + atomic_set(&ctrl->ccr_limit, id->ccrl);
> > Seems like it would be good for the target & init to use the same
> > name for these fields. I have a preference for these over
> > instance_uniquifier and random because they are more concise, but
> > the preference is not strong.
>
> The field names in the spec are concise, but they are also cryptic.
>
> >
> > > +
> > > ctrl->oacs = le16_to_cpu(id->oacs);
> > > ctrl->oncs = le16_to_cpu(id->oncs);
> > > ctrl->mtfa = le16_to_cpu(id->mtfa);
> > > ctrl->oaes = le32_to_cpu(id->oaes);
> > > ctrl->wctemp = le16_to_cpu(id->wctemp);
> > > ctrl->cctemp = le16_to_cpu(id->cctemp);
> > > + ctrl->cqt = le16_to_cpu(id->cqt);
> > >
> > > atomic_set(&ctrl->abort_limit, id->acl + 1);
> > > ctrl->vwc = id->vwc;
> > I cannot discern an ordering to the attributes set here. Any
> > particular reason, you placed cqt away from the others you added?
>
> No reason. Moved ctrl->cqt initialization up with other fields.
>
> >
> > > diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
> > > index 29430949ce2f..ae36249ad61e 100644
> > > --- a/drivers/nvme/host/sysfs.c
> > > +++ b/drivers/nvme/host/sysfs.c
> > > @@ -388,6 +388,27 @@ nvme_show_int_function(queue_count);
> > > nvme_show_int_function(sqsize);
> > > nvme_show_int_function(kato);
> > >
> > > +static ssize_t nvme_sysfs_uniquifier_show(struct device *dev,
> > > + struct device_attribute *attr,
> > > + char *buf)
> > > +{
> > > + struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
> > > +
> > > + return sysfs_emit(buf, "%02x\n", ctrl->ciu);
> > > +}
> > > +static DEVICE_ATTR(uniquifier, S_IRUGO, nvme_sysfs_uniquifier_show, NULL);
> > > +
> > > +static ssize_t nvme_sysfs_random_show(struct device *dev,
> > > + struct device_attribute *attr,
> > > + char *buf)
> > > +{
> > > + struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
> > > +
> > > + return sysfs_emit(buf, "%016llx\n", ctrl->cirn);
> > > +}
> > > +static DEVICE_ATTR(random, S_IRUGO, nvme_sysfs_random_show, NULL);
> > > +
> > > +
> > > static ssize_t nvme_sysfs_delete(struct device *dev,
> > > struct device_attribute *attr, const char *buf,
> > > size_t count)
> > > @@ -734,6 +755,8 @@ static struct attribute *nvme_dev_attrs[] = {
> > > &dev_attr_numa_node.attr,
> > > &dev_attr_queue_count.attr,
> > > &dev_attr_sqsize.attr,
> > > + &dev_attr_uniquifier.attr,
> > > + &dev_attr_random.attr,
> > > &dev_attr_hostnqn.attr,
> > > &dev_attr_hostid.attr,
> > > &dev_attr_ctrl_loss_tmo.attr,
> > > --
> > > 2.51.2
> > >
> >
> > These are the names used in the target code (uniquifer & random.
> > I'd rather have them match (identify structure will have spec's
> > abbreviations; ctrl & debug/sysfs for target & initiator either be
> > ciu/cirn or uniquifer/random.
>
> I think it matters for sysfs attributes. I do not know the right thing
> to do. Should we use spec names like "cirn" or call it "random"?
Now I am thinking about it, I think sticking to the spec's abbreviations
makes more sense here. names like "random" and "uniquifier" in sysfs are
not descriptive enough. I changed struct member names, sysfs, and
debugfs file names to match the spec.
>
> >
> > But this is small stuff.
> >
> > Reviewed-by: Randy Jennings <randyj@purestorage.com>
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 05/14] nvmet: Send an AEN on CCR completion
2025-12-31 22:00 ` Mohamed Khalfella
@ 2026-01-04 21:09 ` Sagi Grimberg
2026-01-07 2:58 ` Randy Jennings
2026-01-30 22:31 ` Mohamed Khalfella
0 siblings, 2 replies; 68+ messages in thread
From: Sagi Grimberg @ 2026-01-04 21:09 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On 01/01/2026 0:00, Mohamed Khalfella wrote:
> On Sat 2025-12-27 11:48:49 +0200, Sagi Grimberg wrote:
>> On 25/12/2025 20:13, Mohamed Khalfella wrote:
>>> On Thu 2025-12-25 15:23:51 +0200, Sagi Grimberg wrote:
>>>> On 26/11/2025 4:11, Mohamed Khalfella wrote:
>>>>> Send an AEN to initiator when impacted controller exists. The
>>>>> notification points to CCR log page that initiator can read to check
>>>>> which CCR operation completed.
>>>>>
>>>>> Signed-off-by: Mohamed Khalfella<mkhalfella@purestorage.com>
>>>>> ---
>>>>> drivers/nvme/target/core.c | 27 +++++++++++++++++++++++----
>>>>> drivers/nvme/target/nvmet.h | 3 ++-
>>>>> include/linux/nvme.h | 3 +++
>>>>> 3 files changed, 28 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
>>>>> index 7dbe9255ff42..60173833c3eb 100644
>>>>> --- a/drivers/nvme/target/core.c
>>>>> +++ b/drivers/nvme/target/core.c
>>>>> @@ -202,7 +202,7 @@ static void nvmet_async_event_work(struct work_struct *work)
>>>>> nvmet_async_events_process(ctrl);
>>>>> }
>>>>>
>>>>> -void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
>>>>> +static void nvmet_add_async_event_locked(struct nvmet_ctrl *ctrl, u8 event_type,
>>>>> u8 event_info, u8 log_page)
>>>>> {
>>>>> struct nvmet_async_event *aen;
>>>>> @@ -215,12 +215,17 @@ void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
>>>>> aen->event_info = event_info;
>>>>> aen->log_page = log_page;
>>>>>
>>>>> - mutex_lock(&ctrl->lock);
>>>>> list_add_tail(&aen->entry, &ctrl->async_events);
>>>>> - mutex_unlock(&ctrl->lock);
>>>>>
>>>>> queue_work(nvmet_wq, &ctrl->async_event_work);
>>>>> }
>>>>> +void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
>>>>> + u8 event_info, u8 log_page)
>>>>> +{
>>>>> + mutex_lock(&ctrl->lock);
>>>>> + nvmet_add_async_event_locked(ctrl, event_type, event_info, log_page);
>>>>> + mutex_unlock(&ctrl->lock);
>>>>> +}
>>>>>
>>>>> static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
>>>>> {
>>>>> @@ -1788,6 +1793,18 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
>>>>>
>>>>> +static void nvmet_ctrl_notify_ccr(struct nvmet_ctrl *ctrl)
>>>>> +{
>>>>> + lockdep_assert_held(&ctrl->lock);
>>>>> +
>>>>> + if (nvmet_aen_bit_disabled(ctrl, NVME_AEN_BIT_CCR_COMPLETE))
>>>>> + return;
>>>>> +
>>>>> + nvmet_add_async_event_locked(ctrl, NVME_AER_NOTICE,
>>>>> + NVME_AER_NOTICE_CCR_COMPLETED,
>>>>> + NVME_LOG_CCR);
>>>>> +}
>>>>> +
>>>>> static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
>>>>> {
>>>>> struct nvmet_subsys *subsys = ctrl->subsys;
>>>>> @@ -1801,8 +1818,10 @@ static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
>>>>> list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
>>>>> mutex_lock(&sctrl->lock);
>>>>> list_for_each_entry(ccr, &sctrl->ccrs, entry) {
>>>>> - if (ccr->ctrl == ctrl)
>>>>> + if (ccr->ctrl == ctrl) {
>>>>> + nvmet_ctrl_notify_ccr(sctrl);
>>>>> ccr->ctrl = NULL;
>>>>> + }
>>>> Is this double loop necessary? Would you have more than one controller
>>>> cross resetting the same
>>> As it is implemented now CCRs are linked to sctrl. This decision can be
>>> revisited if found suboptimal. At some point I had CCRs linked to
>>> ctrl->subsys but that led to lock ordering issues. Double loop is
>>> necessary to find all CCRs in all controllers and mark them done.
>>> Yes, it is possible to have more than one sctrl resetting the same
>>> ictrl.
>> I'm more interested in simplifying.
>>
>>>> controller? Won't it be better to install a callback+opaque that the
>>>> controller removal will call?
>>> Can you elaborate more on that? Better in what terms?
>>>
>>> nvmet_ctrl_complete_pending_ccr() is called from nvmet_ctrl_free() when
>>> we know that ctrl->ref is zero and no new CCRs will be added to this
>>> controller because nvmet_ctrl_find_get_ccr() will not be able to get it.
>> In nvmet, the controller is serving a single host. Hence I am not sure I
>> understand how multiple source controllers will try to reset the impacted
>> controller. So, if there is a 1-1 relationship between source and impacted
>> controller, I'd perhaps suggest to simplify and install on the impacted
>> controller
>> callback+opaque (e.g. void *data) instead of having it iterate and then
>> actually send
>> the AEN from the impacted controller.
> A controller is serving a single path for a given host. A host that is
> connected to nvme subsystem via multiple paths will have more than one
> controller. I can think of two reasons why we need to support resetting
> an impacted controller from multiple source controllers.
>
> - It is possible for multiple paths to go down at the same time. The
> first source controller we use for CCR, even though we check to see if
> LIVE, might have lost connection to subsystem. It is a matter of time
> for it to see keepalive timeout and fail too. If CCR fails using this
> controller we should not give up. We need to try other paths.
But the host is doing the cross-reset synchronously... it waits for
kato for a completion of the reset, and then gives up, its not like it
is sitting there waiting for the AEN...
Generally the fact that the spec states a capability/flexibility, it is
still Linux's
choice to choose weather to implement it. I'm trying to understand if we can
simplify Linux host and controller in this non-trivial error recovery flow.
What is your expectation to happen in general? what are your expected
kato/cqt
values? how many attempts do we want the host to do?
> - Some nvme subsystems might support resetting impacted controller from
> a subset of controllers connected to the host. An array that has
> multiple frontend engines might not support resetting controllers
> across engines. In fact, TP8028 allows for subsystem to suggest to
> host to use another source controller in Alternate Controller ID
> (ACID) fied on CCR logpage (not implemente in this patchset).
It is not a case though where the impacted controller will be reset from
multiple
source controller at the same time...
I'd also say that if indeed there are subsystems that require specific
controllers to do
cross recovery, they won't be able to use this at all... Are there any
such arrays?
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery
2025-12-31 0:04 ` Randy Jennings
@ 2026-01-04 21:14 ` Sagi Grimberg
2026-01-07 3:16 ` Randy Jennings
0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2026-01-04 21:14 UTC (permalink / raw)
To: Randy Jennings
Cc: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
Jens Axboe, Keith Busch, Aaron Dailey, John Meneghini,
Hannes Reinecke, linux-nvme, linux-kernel
On 31/12/2025 2:04, Randy Jennings wrote:
>>> +
>>> + if (!ret) {
>>> + dev_info(ictrl->device, "CCR succeeded using %s\n",
>>> + dev_name(sctrl->device));
>>> + blk_put_queue(sctrl->admin_q);
>>> + nvme_put_ctrl(sctrl);
>>> + return 0;
>>> + }
>>> +
>>> + /* Try another controller */
>>> + min_cntlid = sctrl->cntlid + 1;
>> OK, I see why min_cntlid is used. That is very non-intuitive.
>>
>> I'm wandering if it will be simpler to take one-shot at ccr and
>> if it fails fallback to crt. I mean, if the sctrl is alive, and it was
>> unable
>> to reset the ictrl in time, how would another ctrl do a better job here?
> There are many different kinds of failures we are dealing with here
> that result in a dropped connection (association). It could be a problem
> with the specific link, or it could be that the node of an HA pair in the
> storage array went down. In the case of a specific link problem, maybe
> only one of the connections is down and any controller would work.
> In the case of the node of an HA pair, roughly half of the connections
> are going down, and there is a race between the controllers which
> are detected down first. There were some heuristics put into the
> spec about deciding which controller to use, but that is more code
> and a refinement that could come later (and they are still heuristics;
> they may not be helpful).
>
> Because CCR offers a significant win of shortening the recovery time
> substantially, it is worth retrying on the other controllers. This time
> affects when we can start retrying IO. KATO is in seconds, and
> NVMEoF should have the capability of doing a significant amount of
> IOs in each of those seconds.
But it doesn't actually do I/O, it issues I/O and then wait for it to
time out.
>
> Besides, the alternative is just to wait. Might as well be actively trying
> to shorten that wait time. Besides a small increase in code complexity,
> is there a downside to doing so?
Simplicity is very important when it comes to non-trivial code paths
like error recovery.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 09/14] nvme: Implement cross-controller reset completion
2025-12-31 23:51 ` Mohamed Khalfella
@ 2026-01-04 21:15 ` Sagi Grimberg
2026-01-30 22:32 ` Mohamed Khalfella
0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2026-01-04 21:15 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On 01/01/2026 1:51, Mohamed Khalfella wrote:
> On Sat 2025-12-27 12:24:17 +0200, Sagi Grimberg wrote:
>>> + log = kmalloc(sizeof(*log), GFP_KERNEL);
>>> + if (!log)
>>> + return;
>>> +
>>> + ret = nvme_get_log(ctrl, 0, NVME_LOG_CCR, 0x01,
>>> + 0x00, log, sizeof(*log), 0);
>>> + if (ret)
>>> + goto out;
>>> +
>>> + spin_lock_irqsave(&ctrl->lock, flags);
>>> + for (i = 0; i < le16_to_cpu(log->ne); i++) {
>>> + entry = &log->entries[i];
>>> + if (entry->ccrs == 0) /* skip in progress entries */
>>> + continue;
>> What does ccrs stand for?
> Cross-Controller Reset Status
>
> 0x00 -> In Progress
> 0x01 -> Success
> 0x02 -> Failed
> 0x03 - 0xff -> Reserved
Let's add it as a proper enumeration please.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error
2025-12-31 0:13 ` Randy Jennings
@ 2026-01-04 21:19 ` Sagi Grimberg
0 siblings, 0 replies; 68+ messages in thread
From: Sagi Grimberg @ 2026-01-04 21:19 UTC (permalink / raw)
To: Randy Jennings
Cc: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
Jens Axboe, Keith Busch, Aaron Dailey, John Meneghini,
Hannes Reinecke, linux-nvme, linux-kernel
On 31/12/2025 2:13, Randy Jennings wrote:
> On Sat, Dec 27, 2025 at 2:35 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> ...
>>> + dev_info(ctrl->device,
>>> + "CCR failed, switch to time-based recovery, timeout = %ums\n",
>>> + jiffies_to_msecs(rem));
>>> + set_bit(NVME_CTRL_RECOVERED, &ctrl->flags);
>>> + queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, rem);
>>> + return -EAGAIN;
>> I don't think that reusing the same work to handle two completely
>> different things
>> is the right approach here.
>>
>> How about splitting to fence_work and err_work? That should eliminate
>> some of the
>> ctrl state inspections and simplify error recovery.
> If the work was independent and could happen separately (probably
> in parallel), I could understand having separate work structures. But they
> are not independent, and they have a definite relationship.
The relationship that is defined here is that error recovery does not start
before fencing is completed.
> Like Mohamed,
> I thought of them as different stages of the same work. Having an extra
> work item takes up more space (I would be concerned about scalability to
> thousands or 10s of thousands of associations and then go one order of
> magnitude higher for margin), and it also causes a connection object
> (referenced during IO) to take up more cache lines. Is it worth taking up
> that space, when the separate work items would be different, dependent
> stages in the same process?
Yes, IMO the added space of an additional work_struct is much better than
adding more state around a single work handler that is queued up multiple
times doing effectively different things.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery
2025-12-31 23:43 ` Mohamed Khalfella
@ 2026-01-04 21:39 ` Sagi Grimberg
2026-01-30 22:01 ` Mohamed Khalfella
0 siblings, 1 reply; 68+ messages in thread
From: Sagi Grimberg @ 2026-01-04 21:39 UTC (permalink / raw)
To: Mohamed Khalfella
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On 01/01/2026 1:43, Mohamed Khalfella wrote:
> On Sat 2025-12-27 12:14:11 +0200, Sagi Grimberg wrote:
>>
>> On 26/11/2025 4:11, Mohamed Khalfella wrote:
>>> A host that has more than one path connecting to an nvme subsystem
>>> typically has an nvme controller associated with every path. This is
>>> mostly applicable to nvmeof. If one path goes down, inflight IOs on that
>>> path should not be retried immediately on another path because this
>>> could lead to data corruption as described in TP4129. TP8028 defines
>>> cross-controller reset mechanism that can be used by host to terminate
>>> IOs on the failed path using one of the remaining healthy paths. Only
>>> after IOs are terminated, or long enough time passes as defined by
>>> TP4129, inflight IOs should be retried on another path. Implement core
>>> cross-controller reset shared logic to be used by the transports.
>>>
>>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
>>> ---
>>> drivers/nvme/host/constants.c | 1 +
>>> drivers/nvme/host/core.c | 133 ++++++++++++++++++++++++++++++++++
>>> drivers/nvme/host/nvme.h | 10 +++
>>> 3 files changed, 144 insertions(+)
>>>
>>> diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
>>> index dc90df9e13a2..f679efd5110e 100644
>>> --- a/drivers/nvme/host/constants.c
>>> +++ b/drivers/nvme/host/constants.c
>>> @@ -46,6 +46,7 @@ static const char * const nvme_admin_ops[] = {
>>> [nvme_admin_virtual_mgmt] = "Virtual Management",
>>> [nvme_admin_nvme_mi_send] = "NVMe Send MI",
>>> [nvme_admin_nvme_mi_recv] = "NVMe Receive MI",
>>> + [nvme_admin_cross_ctrl_reset] = "Cross Controller Reset",
>>> [nvme_admin_dbbuf] = "Doorbell Buffer Config",
>>> [nvme_admin_format_nvm] = "Format NVM",
>>> [nvme_admin_security_send] = "Security Send",
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index f5b84bc327d3..f38b70ca9cee 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -554,6 +554,138 @@ void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl)
>>> }
>>> EXPORT_SYMBOL_GPL(nvme_cancel_admin_tagset);
>>>
>>> +static struct nvme_ctrl *nvme_find_ccr_ctrl(struct nvme_ctrl *ictrl,
>>> + u32 min_cntlid)
>>> +{
>>> + struct nvme_subsystem *subsys = ictrl->subsys;
>>> + struct nvme_ctrl *sctrl;
>>> + unsigned long flags;
>>> +
>>> + mutex_lock(&nvme_subsystems_lock);
>> This looks like the wrong lock to take here?
> This is similar to nvme_validate_cntlid()?
> What is the correct lock to use?
Not really, its only because it is called from nvme_init_subsystem which
spans
subsystems.
>
>>> + list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
>>> + if (sctrl->cntlid < min_cntlid)
>>> + continue;
>> The use of min_cntlid is not clear to me.
>>
>>> +
>>> + if (atomic_dec_if_positive(&sctrl->ccr_limit) < 0)
>>> + continue;
>>> +
>>> + spin_lock_irqsave(&sctrl->lock, flags);
>>> + if (sctrl->state != NVME_CTRL_LIVE) {
>>> + spin_unlock_irqrestore(&sctrl->lock, flags);
>>> + atomic_inc(&sctrl->ccr_limit);
>>> + continue;
>>> + }
>>> +
>>> + /*
>>> + * We got a good candidate source controller that is locked and
>>> + * LIVE. However, no guarantee sctrl will not be deleted after
>>> + * sctrl->lock is released. Get a ref of both sctrl and admin_q
>>> + * so they do not disappear until we are done with them.
>>> + */
>>> + WARN_ON_ONCE(!blk_get_queue(sctrl->admin_q));
>>> + nvme_get_ctrl(sctrl);
>>> + spin_unlock_irqrestore(&sctrl->lock, flags);
>>> + goto found;
>>> + }
>>> + sctrl = NULL;
>>> +found:
>>> + mutex_unlock(&nvme_subsystems_lock);
>>> + return sctrl;
>>> +}
>>> +
>>> +static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
>>> +{
>>> + unsigned long flags, tmo, remain;
>>> + struct nvme_ccr_entry ccr = { };
>>> + union nvme_result res = { 0 };
>>> + struct nvme_command c = { };
>>> + u32 result;
>>> + int ret = 0;
>>> +
>>> + init_completion(&ccr.complete);
>>> + ccr.ictrl = ictrl;
>>> +
>>> + spin_lock_irqsave(&sctrl->lock, flags);
>>> + list_add_tail(&ccr.list, &sctrl->ccrs);
>>> + spin_unlock_irqrestore(&sctrl->lock, flags);
>>> +
>>> + c.ccr.opcode = nvme_admin_cross_ctrl_reset;
>>> + c.ccr.ciu = ictrl->ciu;
>>> + c.ccr.icid = cpu_to_le16(ictrl->cntlid);
>>> + c.ccr.cirn = cpu_to_le64(ictrl->cirn);
>>> + ret = __nvme_submit_sync_cmd(sctrl->admin_q, &c, &res,
>>> + NULL, 0, NVME_QID_ANY, 0);
>>> + if (ret)
>>> + goto out;
>>> +
>>> + result = le32_to_cpu(res.u32);
>>> + if (result & 0x01) /* Immediate Reset */
>>> + goto out;
>>> +
>>> + tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
>>> + remain = wait_for_completion_timeout(&ccr.complete, tmo);
>>> + if (!remain)
>> I think remain is redundant here.
> Deleted 'remain'.
>
>>> + ret = -EAGAIN;
>>> +out:
>>> + spin_lock_irqsave(&sctrl->lock, flags);
>>> + list_del(&ccr.list);
>>> + spin_unlock_irqrestore(&sctrl->lock, flags);
>>> + return ccr.ccrs == 1 ? 0 : ret;
>> Why would you still return 0 and not EAGAIN? you expired on timeout but
>> still
>> return success if you have ccrs=1? btw you have ccrs in the ccr struct
>> and in the controller
>> as a list. Lets rename to distinguish the two.
> True, we did expire timeout here. However, after we removed the ccr
> entry we found that it was marked as completed. We return success in
> this case even though we hit timeout.
When does this happen? Why is it worth having the code non-intuitive for
something that effectively never happens (unless I'm missing something?)
>
> Renamed ctrl->ccrs to ctrl->ccr_list.
>
>>> +}
>>> +
>>> +unsigned long nvme_recover_ctrl(struct nvme_ctrl *ictrl)
>>> +{
>> I'd call it nvme_fence_controller()
> Okay. I will do that. I will also rename the controller state FENCING.
>
>>> + unsigned long deadline, now, timeout;
>>> + struct nvme_ctrl *sctrl;
>>> + u32 min_cntlid = 0;
>>> + int ret;
>>> +
>>> + timeout = nvme_recovery_timeout_ms(ictrl);
>>> + dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
>>> +
>>> + now = jiffies;
>>> + deadline = now + msecs_to_jiffies(timeout);
>>> + while (time_before(now, deadline)) {
>>> + sctrl = nvme_find_ccr_ctrl(ictrl, min_cntlid);
>>> + if (!sctrl) {
>>> + /* CCR failed, switch to time-based recovery */
>>> + return deadline - now;
>> It is not clear what is the return code semantics of this function.
>> How about making it success/failure and have the caller choose what to do?
> The function returns 0 on success. On failure it returns the time in
> jiffies to hold requests for before they are canceled. On failure the
> returned time is essentially the hold time defined in TP4129 minus the
> time it took to attempt CCR.
I think it would be cleaner to simple have this function return status
code and
have the caller worry about time spent.
>
>>> + }
>>> +
>>> + ret = nvme_issue_wait_ccr(sctrl, ictrl);
>>> + atomic_inc(&sctrl->ccr_limit);
>> inc after you wait for the ccr? shouldn't this be before?
> I think it should be after we wait for CCR. sctrl->ccr_limit is the
> number of concurrent CCRs the controller supports. Only after we are
> done with CCR on this controller we increment it.
Maybe it should be folded into nvme_issue_wait_ccr for symmetry?
>
>>> +
>>> + if (!ret) {
>>> + dev_info(ictrl->device, "CCR succeeded using %s\n",
>>> + dev_name(sctrl->device));
>>> + blk_put_queue(sctrl->admin_q);
>>> + nvme_put_ctrl(sctrl);
>>> + return 0;
>>> + }
>>> +
>>> + /* Try another controller */
>>> + min_cntlid = sctrl->cntlid + 1;
>> OK, I see why min_cntlid is used. That is very non-intuitive.
>>
>> I'm wandering if it will be simpler to take one-shot at ccr and
>> if it fails fallback to crt. I mean, if the sctrl is alive, and it was
>> unable
>> to reset the ictrl in time, how would another ctrl do a better job here?
> We need to attempt CCR from multiple controllers for reason explained in
> another response. As you figured out min_cntlid is needed in order to
> not loop controller list forever. Do you have a better idea?
No, just know that I don't like it very much :)
>
>>> + blk_put_queue(sctrl->admin_q);
>>> + nvme_put_ctrl(sctrl);
>>> + now = jiffies;
>>> + }
>>> +
>>> + dev_info(ictrl->device, "CCR reached timeout, call it done\n");
>>> + return 0;
>>> +}
>>> +EXPORT_SYMBOL_GPL(nvme_recover_ctrl);
>>> +
>>> +void nvme_end_ctrl_recovery(struct nvme_ctrl *ctrl)
>>> +{
>>> + unsigned long flags;
>>> +
>>> + spin_lock_irqsave(&ctrl->lock, flags);
>>> + WRITE_ONCE(ctrl->state, NVME_CTRL_RESETTING);
>> This needs to be a proper state transition.
> We do not want to have proper transition from RECOVERING to RESETTING.
> The reason is that we do not want the controller to be reset while it is
> being recovered/fenced because requests should not be canceled. One way
> to keep the transitions in nvme_change_ctrl_state() is to use two
> states. Say FENCING and FENCED.
>
> The allowed transitions are
>
> - LIVE -> FENCING
> - FENCING -> FENCED
> - FENCED -> (RESETTING, DELETING)
>
> This will also git rid of NVME_CTRL_RECOVERED
>
> Does this sound good?
We could do what failfast is doing, in case we get transition FENCING ->
RESETTING/DELETING we flush
the fence_work...
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 05/14] nvmet: Send an AEN on CCR completion
2026-01-04 21:09 ` Sagi Grimberg
@ 2026-01-07 2:58 ` Randy Jennings
2026-01-30 22:31 ` Mohamed Khalfella
1 sibling, 0 replies; 68+ messages in thread
From: Randy Jennings @ 2026-01-07 2:58 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
Jens Axboe, Keith Busch, Aaron Dailey, John Meneghini,
Hannes Reinecke, linux-nvme, linux-kernel
On Sun, Jan 4, 2026 at 1:09 PM Sagi Grimberg <sagi@grimberg.me> wrote:
> On 01/01/2026 0:00, Mohamed Khalfella wrote:
> > On Sat 2025-12-27 11:48:49 +0200, Sagi Grimberg wrote:
> >> On 25/12/2025 20:13, Mohamed Khalfella wrote:
> >>> On Thu 2025-12-25 15:23:51 +0200, Sagi Grimberg wrote:
> >>>> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> >>>>> Send an AEN to initiator when impacted controller exists. The
> >>>>> notification points to CCR log page that initiator can read to check
> >>>>> which CCR operation completed.
> >>>>>
> >>>>> Signed-off-by: Mohamed Khalfella<mkhalfella@purestorage.com>
> >>>>> ---
> >>>>> drivers/nvme/target/core.c | 27 +++++++++++++++++++++++----
> >>>>> drivers/nvme/target/nvmet.h | 3 ++-
> >>>>> include/linux/nvme.h | 3 +++
> >>>>> 3 files changed, 28 insertions(+), 5 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> >>>>> index 7dbe9255ff42..60173833c3eb 100644
> >>>>> --- a/drivers/nvme/target/core.c
> >>>>> +++ b/drivers/nvme/target/core.c
> >>>>> @@ -202,7 +202,7 @@ static void nvmet_async_event_work(struct work_struct *work)
> >>>>> nvmet_async_events_process(ctrl);
> >>>>> }
> >>>>>
> >>>>> -void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> >>>>> +static void nvmet_add_async_event_locked(struct nvmet_ctrl *ctrl, u8 event_type,
> >>>>> u8 event_info, u8 log_page)
> >>>>> {
> >>>>> struct nvmet_async_event *aen;
> >>>>> @@ -215,12 +215,17 @@ void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> >>>>> aen->event_info = event_info;
> >>>>> aen->log_page = log_page;
> >>>>>
> >>>>> - mutex_lock(&ctrl->lock);
> >>>>> list_add_tail(&aen->entry, &ctrl->async_events);
> >>>>> - mutex_unlock(&ctrl->lock);
> >>>>>
> >>>>> queue_work(nvmet_wq, &ctrl->async_event_work);
> >>>>> }
> >>>>> +void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> >>>>> + u8 event_info, u8 log_page)
> >>>>> +{
> >>>>> + mutex_lock(&ctrl->lock);
> >>>>> + nvmet_add_async_event_locked(ctrl, event_type, event_info, log_page);
> >>>>> + mutex_unlock(&ctrl->lock);
> >>>>> +}
> >>>>>
> >>>>> static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
> >>>>> {
> >>>>> @@ -1788,6 +1793,18 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
> >>>>> }
> >>>>> EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
> >>>>>
> >>>>> +static void nvmet_ctrl_notify_ccr(struct nvmet_ctrl *ctrl)
> >>>>> +{
> >>>>> + lockdep_assert_held(&ctrl->lock);
> >>>>> +
> >>>>> + if (nvmet_aen_bit_disabled(ctrl, NVME_AEN_BIT_CCR_COMPLETE))
> >>>>> + return;
> >>>>> +
> >>>>> + nvmet_add_async_event_locked(ctrl, NVME_AER_NOTICE,
> >>>>> + NVME_AER_NOTICE_CCR_COMPLETED,
> >>>>> + NVME_LOG_CCR);
> >>>>> +}
> >>>>> +
> >>>>> static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
> >>>>> {
> >>>>> struct nvmet_subsys *subsys = ctrl->subsys;
> >>>>> @@ -1801,8 +1818,10 @@ static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
> >>>>> list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
> >>>>> mutex_lock(&sctrl->lock);
> >>>>> list_for_each_entry(ccr, &sctrl->ccrs, entry) {
> >>>>> - if (ccr->ctrl == ctrl)
> >>>>> + if (ccr->ctrl == ctrl) {
> >>>>> + nvmet_ctrl_notify_ccr(sctrl);
> >>>>> ccr->ctrl = NULL;
> >>>>> + }
> >>>> Is this double loop necessary? Would you have more than one controller
> >>>> cross resetting the same
> >>> As it is implemented now CCRs are linked to sctrl. This decision can be
> >>> revisited if found suboptimal. At some point I had CCRs linked to
> >>> ctrl->subsys but that led to lock ordering issues. Double loop is
> >>> necessary to find all CCRs in all controllers and mark them done.
> >>> Yes, it is possible to have more than one sctrl resetting the same
> >>> ictrl.
> >> I'm more interested in simplifying.
> >>
> >>>> controller? Won't it be better to install a callback+opaque that the
> >>>> controller removal will call?
> >>> Can you elaborate more on that? Better in what terms?
> >>>
> >>> nvmet_ctrl_complete_pending_ccr() is called from nvmet_ctrl_free() when
> >>> we know that ctrl->ref is zero and no new CCRs will be added to this
> >>> controller because nvmet_ctrl_find_get_ccr() will not be able to get it.
> >> In nvmet, the controller is serving a single host. Hence I am not sure I
> >> understand how multiple source controllers will try to reset the impacted
> >> controller. So, if there is a 1-1 relationship between source and impacted
> >> controller, I'd perhaps suggest to simplify and install on the impacted
> >> controller
> >> callback+opaque (e.g. void *data) instead of having it iterate and then
> >> actually send
> >> the AEN from the impacted controller.
> > A controller is serving a single path for a given host. A host that is
> > connected to nvme subsystem via multiple paths will have more than one
> > controller. I can think of two reasons why we need to support resetting
> > an impacted controller from multiple source controllers.
> >
> > - It is possible for multiple paths to go down at the same time. The
> > first source controller we use for CCR, even though we check to see if
> > LIVE, might have lost connection to subsystem. It is a matter of time
> > for it to see keepalive timeout and fail too. If CCR fails using this
> > controller we should not give up. We need to try other paths.
>
> But the host is doing the cross-reset synchronously... it waits for
> kato for a completion of the reset, and then gives up, its not like it
> is sitting there waiting for the AEN...
The Linux host as Mohamed has implemented it is doing the CCR
synchronously. I expect most hosts will also, until they time out one
controller because the connection might be dying, and that timeout
might come sooner than the target. Controllers publish how many
outstanding CCRs they may be the source of, not how many
controllers they may be impacted controllers for. However, it is my
understanding (correct me if I am wrong) that the Linux nvmet
implementation is primarily for testing the Linux nvme host
implementation. Whatever limitations keep the nvmet code
simple can make sense, even if they do not in a production
system.
>
> Generally the fact that the spec states a capability/flexibility, it is
> still Linux's
> choice to choose weather to implement it. I'm trying to understand if we can
> simplify Linux host and controller in this non-trivial error recovery flow.
>
> What is your expectation to happen in general? what are your expected
> kato/cqt
> values? how many attempts do we want the host to do?
For CQT, I see 30-60 seconds. For KATO, I see 5-60 seconds. I see
KATO skewing lower than it should because of the cost to reliably avoid
data corruption.
In general, I expect an HA-pair storage array to benefit most from
implementing this feature. Often, a host would establish an association
over 4 paths, 2 to each HA-pair. It is common for non-disruptive operation
for a member of the pair to go down for software upgrades or hardware
modifications while the other member stays up. In this configuration,
2 paths go down when the member goes down. The host has a 2 in 3
chance of picking a controller to which the host still has a connection.
Given the tremendous reduction in failover time (even with a timeout
on the CCR) and my perception of the complexity cost of a retry, I would
much rather make the host have a 100% chance of using CCR
successfully after a retry.
> > - Some nvme subsystems might support resetting impacted controller from
> > a subset of controllers connected to the host. An array that has
> > multiple frontend engines might not support resetting controllers
> > across engines. In fact, TP8028 allows for subsystem to suggest to
> > host to use another source controller in Alternate Controller ID
> > (ACID) fied on CCR logpage (not implemente in this patchset).
>
> It is not a case though where the impacted controller will be reset from
> multiple
> source controller at the same time...
>
> I'd also say that if indeed there are subsystems that require specific
> controllers to do
> cross recovery, they won't be able to use this at all... Are there any
> such arrays?
There are certainly systems out there (striping, Ceph-like, or stretched to
support non-dispersed hosts) that have many more nodes than an HA-pair
and may have communication barriers that prevent nvme controllers from
being able to handle an CCR for another nvme controller. I still think they
would benefit from retrying a CCR to find a path that can succeed.
Sincerely,
Randy Jennings
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery
2026-01-04 21:14 ` Sagi Grimberg
@ 2026-01-07 3:16 ` Randy Jennings
0 siblings, 0 replies; 68+ messages in thread
From: Randy Jennings @ 2026-01-07 3:16 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Mohamed Khalfella, Chaitanya Kulkarni, Christoph Hellwig,
Jens Axboe, Keith Busch, Aaron Dailey, John Meneghini,
Hannes Reinecke, linux-nvme, linux-kernel
On Sun, Jan 4, 2026 at 1:14 PM Sagi Grimberg <sagi@grimberg.me> wrote:
> On 31/12/2025 2:04, Randy Jennings wrote:
> >>> +
> >>> + if (!ret) {
> >>> + dev_info(ictrl->device, "CCR succeeded using %s\n",
> >>> + dev_name(sctrl->device));
> >>> + blk_put_queue(sctrl->admin_q);
> >>> + nvme_put_ctrl(sctrl);
> >>> + return 0;
> >>> + }
> >>> +
> >>> + /* Try another controller */
> >>> + min_cntlid = sctrl->cntlid + 1;
> >> OK, I see why min_cntlid is used. That is very non-intuitive.
> >>
> >> I'm wandering if it will be simpler to take one-shot at ccr and
> >> if it fails fallback to crt. I mean, if the sctrl is alive, and it was
> >> unable
> >> to reset the ictrl in time, how would another ctrl do a better job here?
> > There are many different kinds of failures we are dealing with here
> > that result in a dropped connection (association). It could be a problem
> > with the specific link, or it could be that the node of an HA pair in the
> > storage array went down. In the case of a specific link problem, maybe
> > only one of the connections is down and any controller would work.
> > In the case of the node of an HA pair, roughly half of the connections
> > are going down, and there is a race between the controllers which
> > are detected down first. There were some heuristics put into the
> > spec about deciding which controller to use, but that is more code
> > and a refinement that could come later (and they are still heuristics;
> > they may not be helpful).
> >
> > Because CCR offers a significant win of shortening the recovery time
> > substantially, it is worth retrying on the other controllers. This time
> > affects when we can start retrying IO. KATO is in seconds, and
> > NVMEoF should have the capability of doing a significant amount of
> > IOs in each of those seconds.
>
> But it doesn't actually do I/O, it issues I/O and then wait for it to
> time out.
Retrying CCR does not actually do I/O (trying to place your antecedent),
but a successful CCR allows the host to get back to doing I/O. Every
second saved can be a significant amount of I/O. If you were given a
choice between a 1 second failover and a 60 second failover, of course,
you would go for the 1 second failover. However, if I was given the
option of a 10 second failover and a 60 second failover, I would still
go for the 10 second failover. 50 seconds is still extremely valuable.
>
> >
> > Besides, the alternative is just to wait. Might as well be actively trying
> > to shorten that wait time. Besides a small increase in code complexity,
> > is there a downside to doing so?
>
> Simplicity is very important when it comes to non-trivial code paths
> like error recovery.
Okay, yes, unwarranted complexity, even with some benefit might not
be worth it. I can see that my comment could be taken as flippant. But
the extra complexity here yields an important and material benefit.
Sincerely,
Randy Jennings
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery
2026-01-04 21:39 ` Sagi Grimberg
@ 2026-01-30 22:01 ` Mohamed Khalfella
0 siblings, 0 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:01 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Sun 2026-01-04 23:39:35 +0200, Sagi Grimberg wrote:
>
>
> On 01/01/2026 1:43, Mohamed Khalfella wrote:
> > On Sat 2025-12-27 12:14:11 +0200, Sagi Grimberg wrote:
> >>
> >> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> >>> A host that has more than one path connecting to an nvme subsystem
> >>> typically has an nvme controller associated with every path. This is
> >>> mostly applicable to nvmeof. If one path goes down, inflight IOs on that
> >>> path should not be retried immediately on another path because this
> >>> could lead to data corruption as described in TP4129. TP8028 defines
> >>> cross-controller reset mechanism that can be used by host to terminate
> >>> IOs on the failed path using one of the remaining healthy paths. Only
> >>> after IOs are terminated, or long enough time passes as defined by
> >>> TP4129, inflight IOs should be retried on another path. Implement core
> >>> cross-controller reset shared logic to be used by the transports.
> >>>
> >>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> >>> ---
> >>> drivers/nvme/host/constants.c | 1 +
> >>> drivers/nvme/host/core.c | 133 ++++++++++++++++++++++++++++++++++
> >>> drivers/nvme/host/nvme.h | 10 +++
> >>> 3 files changed, 144 insertions(+)
> >>>
> >>> diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
> >>> index dc90df9e13a2..f679efd5110e 100644
> >>> --- a/drivers/nvme/host/constants.c
> >>> +++ b/drivers/nvme/host/constants.c
> >>> @@ -46,6 +46,7 @@ static const char * const nvme_admin_ops[] = {
> >>> [nvme_admin_virtual_mgmt] = "Virtual Management",
> >>> [nvme_admin_nvme_mi_send] = "NVMe Send MI",
> >>> [nvme_admin_nvme_mi_recv] = "NVMe Receive MI",
> >>> + [nvme_admin_cross_ctrl_reset] = "Cross Controller Reset",
> >>> [nvme_admin_dbbuf] = "Doorbell Buffer Config",
> >>> [nvme_admin_format_nvm] = "Format NVM",
> >>> [nvme_admin_security_send] = "Security Send",
> >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >>> index f5b84bc327d3..f38b70ca9cee 100644
> >>> --- a/drivers/nvme/host/core.c
> >>> +++ b/drivers/nvme/host/core.c
> >>> @@ -554,6 +554,138 @@ void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl)
> >>> }
> >>> EXPORT_SYMBOL_GPL(nvme_cancel_admin_tagset);
> >>>
> >>> +static struct nvme_ctrl *nvme_find_ccr_ctrl(struct nvme_ctrl *ictrl,
> >>> + u32 min_cntlid)
> >>> +{
> >>> + struct nvme_subsystem *subsys = ictrl->subsys;
> >>> + struct nvme_ctrl *sctrl;
> >>> + unsigned long flags;
> >>> +
> >>> + mutex_lock(&nvme_subsystems_lock);
> >> This looks like the wrong lock to take here?
> > This is similar to nvme_validate_cntlid()?
> > What is the correct lock to use?
>
> Not really, its only because it is called from nvme_init_subsystem which
> spans
> subsystems.
Okay. I will use this lock for now. If this is not the right lock to use
please point me to the right one.
>
> >
> >>> + list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
> >>> + if (sctrl->cntlid < min_cntlid)
> >>> + continue;
> >> The use of min_cntlid is not clear to me.
> >>
> >>> +
> >>> + if (atomic_dec_if_positive(&sctrl->ccr_limit) < 0)
> >>> + continue;
> >>> +
> >>> + spin_lock_irqsave(&sctrl->lock, flags);
> >>> + if (sctrl->state != NVME_CTRL_LIVE) {
> >>> + spin_unlock_irqrestore(&sctrl->lock, flags);
> >>> + atomic_inc(&sctrl->ccr_limit);
> >>> + continue;
> >>> + }
> >>> +
> >>> + /*
> >>> + * We got a good candidate source controller that is locked and
> >>> + * LIVE. However, no guarantee sctrl will not be deleted after
> >>> + * sctrl->lock is released. Get a ref of both sctrl and admin_q
> >>> + * so they do not disappear until we are done with them.
> >>> + */
> >>> + WARN_ON_ONCE(!blk_get_queue(sctrl->admin_q));
> >>> + nvme_get_ctrl(sctrl);
> >>> + spin_unlock_irqrestore(&sctrl->lock, flags);
> >>> + goto found;
> >>> + }
> >>> + sctrl = NULL;
> >>> +found:
> >>> + mutex_unlock(&nvme_subsystems_lock);
> >>> + return sctrl;
> >>> +}
> >>> +
> >>> +static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
> >>> +{
> >>> + unsigned long flags, tmo, remain;
> >>> + struct nvme_ccr_entry ccr = { };
> >>> + union nvme_result res = { 0 };
> >>> + struct nvme_command c = { };
> >>> + u32 result;
> >>> + int ret = 0;
> >>> +
> >>> + init_completion(&ccr.complete);
> >>> + ccr.ictrl = ictrl;
> >>> +
> >>> + spin_lock_irqsave(&sctrl->lock, flags);
> >>> + list_add_tail(&ccr.list, &sctrl->ccrs);
> >>> + spin_unlock_irqrestore(&sctrl->lock, flags);
> >>> +
> >>> + c.ccr.opcode = nvme_admin_cross_ctrl_reset;
> >>> + c.ccr.ciu = ictrl->ciu;
> >>> + c.ccr.icid = cpu_to_le16(ictrl->cntlid);
> >>> + c.ccr.cirn = cpu_to_le64(ictrl->cirn);
> >>> + ret = __nvme_submit_sync_cmd(sctrl->admin_q, &c, &res,
> >>> + NULL, 0, NVME_QID_ANY, 0);
> >>> + if (ret)
> >>> + goto out;
> >>> +
> >>> + result = le32_to_cpu(res.u32);
> >>> + if (result & 0x01) /* Immediate Reset */
> >>> + goto out;
> >>> +
> >>> + tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
> >>> + remain = wait_for_completion_timeout(&ccr.complete, tmo);
> >>> + if (!remain)
> >> I think remain is redundant here.
> > Deleted 'remain'.
> >
> >>> + ret = -EAGAIN;
> >>> +out:
> >>> + spin_lock_irqsave(&sctrl->lock, flags);
> >>> + list_del(&ccr.list);
> >>> + spin_unlock_irqrestore(&sctrl->lock, flags);
> >>> + return ccr.ccrs == 1 ? 0 : ret;
> >> Why would you still return 0 and not EAGAIN? you expired on timeout but
> >> still
> >> return success if you have ccrs=1? btw you have ccrs in the ccr struct
> >> and in the controller
> >> as a list. Lets rename to distinguish the two.
> > True, we did expire timeout here. However, after we removed the ccr
> > entry we found that it was marked as completed. We return success in
> > this case even though we hit timeout.
>
> When does this happen? Why is it worth having the code non-intuitive for
> something that effectively never happens (unless I'm missing something?)
Agree. It is a very low probability. I deleted the check for this
condition.
>
> >
> > Renamed ctrl->ccrs to ctrl->ccr_list.
> >
> >>> +}
> >>> +
> >>> +unsigned long nvme_recover_ctrl(struct nvme_ctrl *ictrl)
> >>> +{
> >> I'd call it nvme_fence_controller()
> > Okay. I will do that. I will also rename the controller state FENCING.
> >
> >>> + unsigned long deadline, now, timeout;
> >>> + struct nvme_ctrl *sctrl;
> >>> + u32 min_cntlid = 0;
> >>> + int ret;
> >>> +
> >>> + timeout = nvme_recovery_timeout_ms(ictrl);
> >>> + dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> >>> +
> >>> + now = jiffies;
> >>> + deadline = now + msecs_to_jiffies(timeout);
> >>> + while (time_before(now, deadline)) {
> >>> + sctrl = nvme_find_ccr_ctrl(ictrl, min_cntlid);
> >>> + if (!sctrl) {
> >>> + /* CCR failed, switch to time-based recovery */
> >>> + return deadline - now;
> >> It is not clear what is the return code semantics of this function.
> >> How about making it success/failure and have the caller choose what to do?
> > The function returns 0 on success. On failure it returns the time in
> > jiffies to hold requests for before they are canceled. On failure the
> > returned time is essentially the hold time defined in TP4129 minus the
> > time it took to attempt CCR.
>
> I think it would be cleaner to simple have this function return status
> code and
> have the caller worry about time spent.
nvme_fence_ctrl() needs to track the time. It needs to be aware of how
much time spent on attempting CCR in order to decide whether to continue
trying CCR or give up.
>
> >
> >>> + }
> >>> +
> >>> + ret = nvme_issue_wait_ccr(sctrl, ictrl);
> >>> + atomic_inc(&sctrl->ccr_limit);
> >> inc after you wait for the ccr? shouldn't this be before?
> > I think it should be after we wait for CCR. sctrl->ccr_limit is the
> > number of concurrent CCRs the controller supports. Only after we are
> > done with CCR on this controller we increment it.
>
> Maybe it should be folded into nvme_issue_wait_ccr for symmetry?
Done.
>
> >
> >>> +
> >>> + if (!ret) {
> >>> + dev_info(ictrl->device, "CCR succeeded using %s\n",
> >>> + dev_name(sctrl->device));
> >>> + blk_put_queue(sctrl->admin_q);
> >>> + nvme_put_ctrl(sctrl);
> >>> + return 0;
> >>> + }
> >>> +
> >>> + /* Try another controller */
> >>> + min_cntlid = sctrl->cntlid + 1;
> >> OK, I see why min_cntlid is used. That is very non-intuitive.
> >>
> >> I'm wandering if it will be simpler to take one-shot at ccr and
> >> if it fails fallback to crt. I mean, if the sctrl is alive, and it was
> >> unable
> >> to reset the ictrl in time, how would another ctrl do a better job here?
> > We need to attempt CCR from multiple controllers for reason explained in
> > another response. As you figured out min_cntlid is needed in order to
> > not loop controller list forever. Do you have a better idea?
>
> No, just know that I don't like it very much :)
>
> >
> >>> + blk_put_queue(sctrl->admin_q);
> >>> + nvme_put_ctrl(sctrl);
> >>> + now = jiffies;
> >>> + }
> >>> +
> >>> + dev_info(ictrl->device, "CCR reached timeout, call it done\n");
> >>> + return 0;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(nvme_recover_ctrl);
> >>> +
> >>> +void nvme_end_ctrl_recovery(struct nvme_ctrl *ctrl)
> >>> +{
> >>> + unsigned long flags;
> >>> +
> >>> + spin_lock_irqsave(&ctrl->lock, flags);
> >>> + WRITE_ONCE(ctrl->state, NVME_CTRL_RESETTING);
> >> This needs to be a proper state transition.
> > We do not want to have proper transition from RECOVERING to RESETTING.
> > The reason is that we do not want the controller to be reset while it is
> > being recovered/fenced because requests should not be canceled. One way
> > to keep the transitions in nvme_change_ctrl_state() is to use two
> > states. Say FENCING and FENCED.
> >
> > The allowed transitions are
> >
> > - LIVE -> FENCING
> > - FENCING -> FENCED
> > - FENCED -> (RESETTING, DELETING)
> >
> > This will also git rid of NVME_CTRL_RECOVERED
> >
> > Does this sound good?
>
> We could do what failfast is doing, in case we get transition FENCING ->
> RESETTING/DELETING we flush
> the fence_work...
Yes. This is what v2 does.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 05/14] nvmet: Send an AEN on CCR completion
2026-01-04 21:09 ` Sagi Grimberg
2026-01-07 2:58 ` Randy Jennings
@ 2026-01-30 22:31 ` Mohamed Khalfella
1 sibling, 0 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:31 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Sun 2026-01-04 23:09:54 +0200, Sagi Grimberg wrote:
>
>
> On 01/01/2026 0:00, Mohamed Khalfella wrote:
> > On Sat 2025-12-27 11:48:49 +0200, Sagi Grimberg wrote:
> >> On 25/12/2025 20:13, Mohamed Khalfella wrote:
> >>> On Thu 2025-12-25 15:23:51 +0200, Sagi Grimberg wrote:
> >>>> On 26/11/2025 4:11, Mohamed Khalfella wrote:
> >>>>> Send an AEN to initiator when impacted controller exists. The
> >>>>> notification points to CCR log page that initiator can read to check
> >>>>> which CCR operation completed.
> >>>>>
> >>>>> Signed-off-by: Mohamed Khalfella<mkhalfella@purestorage.com>
> >>>>> ---
> >>>>> drivers/nvme/target/core.c | 27 +++++++++++++++++++++++----
> >>>>> drivers/nvme/target/nvmet.h | 3 ++-
> >>>>> include/linux/nvme.h | 3 +++
> >>>>> 3 files changed, 28 insertions(+), 5 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> >>>>> index 7dbe9255ff42..60173833c3eb 100644
> >>>>> --- a/drivers/nvme/target/core.c
> >>>>> +++ b/drivers/nvme/target/core.c
> >>>>> @@ -202,7 +202,7 @@ static void nvmet_async_event_work(struct work_struct *work)
> >>>>> nvmet_async_events_process(ctrl);
> >>>>> }
> >>>>>
> >>>>> -void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> >>>>> +static void nvmet_add_async_event_locked(struct nvmet_ctrl *ctrl, u8 event_type,
> >>>>> u8 event_info, u8 log_page)
> >>>>> {
> >>>>> struct nvmet_async_event *aen;
> >>>>> @@ -215,12 +215,17 @@ void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> >>>>> aen->event_info = event_info;
> >>>>> aen->log_page = log_page;
> >>>>>
> >>>>> - mutex_lock(&ctrl->lock);
> >>>>> list_add_tail(&aen->entry, &ctrl->async_events);
> >>>>> - mutex_unlock(&ctrl->lock);
> >>>>>
> >>>>> queue_work(nvmet_wq, &ctrl->async_event_work);
> >>>>> }
> >>>>> +void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> >>>>> + u8 event_info, u8 log_page)
> >>>>> +{
> >>>>> + mutex_lock(&ctrl->lock);
> >>>>> + nvmet_add_async_event_locked(ctrl, event_type, event_info, log_page);
> >>>>> + mutex_unlock(&ctrl->lock);
> >>>>> +}
> >>>>>
> >>>>> static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
> >>>>> {
> >>>>> @@ -1788,6 +1793,18 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
> >>>>> }
> >>>>> EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
> >>>>>
> >>>>> +static void nvmet_ctrl_notify_ccr(struct nvmet_ctrl *ctrl)
> >>>>> +{
> >>>>> + lockdep_assert_held(&ctrl->lock);
> >>>>> +
> >>>>> + if (nvmet_aen_bit_disabled(ctrl, NVME_AEN_BIT_CCR_COMPLETE))
> >>>>> + return;
> >>>>> +
> >>>>> + nvmet_add_async_event_locked(ctrl, NVME_AER_NOTICE,
> >>>>> + NVME_AER_NOTICE_CCR_COMPLETED,
> >>>>> + NVME_LOG_CCR);
> >>>>> +}
> >>>>> +
> >>>>> static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
> >>>>> {
> >>>>> struct nvmet_subsys *subsys = ctrl->subsys;
> >>>>> @@ -1801,8 +1818,10 @@ static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
> >>>>> list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
> >>>>> mutex_lock(&sctrl->lock);
> >>>>> list_for_each_entry(ccr, &sctrl->ccrs, entry) {
> >>>>> - if (ccr->ctrl == ctrl)
> >>>>> + if (ccr->ctrl == ctrl) {
> >>>>> + nvmet_ctrl_notify_ccr(sctrl);
> >>>>> ccr->ctrl = NULL;
> >>>>> + }
> >>>> Is this double loop necessary? Would you have more than one controller
> >>>> cross resetting the same
> >>> As it is implemented now CCRs are linked to sctrl. This decision can be
> >>> revisited if found suboptimal. At some point I had CCRs linked to
> >>> ctrl->subsys but that led to lock ordering issues. Double loop is
> >>> necessary to find all CCRs in all controllers and mark them done.
> >>> Yes, it is possible to have more than one sctrl resetting the same
> >>> ictrl.
> >> I'm more interested in simplifying.
> >>
> >>>> controller? Won't it be better to install a callback+opaque that the
> >>>> controller removal will call?
> >>> Can you elaborate more on that? Better in what terms?
> >>>
> >>> nvmet_ctrl_complete_pending_ccr() is called from nvmet_ctrl_free() when
> >>> we know that ctrl->ref is zero and no new CCRs will be added to this
> >>> controller because nvmet_ctrl_find_get_ccr() will not be able to get it.
> >> In nvmet, the controller is serving a single host. Hence I am not sure I
> >> understand how multiple source controllers will try to reset the impacted
> >> controller. So, if there is a 1-1 relationship between source and impacted
> >> controller, I'd perhaps suggest to simplify and install on the impacted
> >> controller
> >> callback+opaque (e.g. void *data) instead of having it iterate and then
> >> actually send
> >> the AEN from the impacted controller.
> > A controller is serving a single path for a given host. A host that is
> > connected to nvme subsystem via multiple paths will have more than one
> > controller. I can think of two reasons why we need to support resetting
> > an impacted controller from multiple source controllers.
> >
> > - It is possible for multiple paths to go down at the same time. The
> > first source controller we use for CCR, even though we check to see if
> > LIVE, might have lost connection to subsystem. It is a matter of time
> > for it to see keepalive timeout and fail too. If CCR fails using this
> > controller we should not give up. We need to try other paths.
>
> But the host is doing the cross-reset synchronously... it waits for
> kato for a completion of the reset, and then gives up, its not like it
> is sitting there waiting for the AEN...
>
> Generally the fact that the spec states a capability/flexibility, it is
> still Linux's
> choice to choose weather to implement it. I'm trying to understand if we can
> simplify Linux host and controller in this non-trivial error recovery flow.
>
> What is your expectation to happen in general? what are your expected
> kato/cqt
> values? how many attempts do we want the host to do?
For a target that support CCR it is expected to support CQT as well. If
the initiator notices path failure then it either needs to get the
impacted controller reset via CCR or wait for time-based recovery. Time
based recovery adds long delay (n * kato + cqt), CCR is expected to take
way less time. This is why initiator should try CCR for as long as there
is time and paths to do so. The alternative is to sit and wait and do
nothing.
kato defaults to 5 seconds. cqt depends on the implementation on the
target. I tested these changes with 30s of cqt. With traffic based
keepalive time-based recovery takes 3 * 5 + 30 = 45s. In case of partial
path failure CCR finished in milliseconds. This is why it is better to
try every possible path before giving up. There is no fixed number of
attempts. As long as there is time to try CCR and there are paths to try
the initiator should continue trying CCR. Again, we do not want to
fallback to time-based recovery unless there are no options.
tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
ret = -ETIMEDOUT;
goto out;
}
This is the code that waits for CCR to finish after CCR command was
accepted by the target. The specs does not say how much time host should
wait. max(cqt, kato) felt like reasonable value.
>
> > - Some nvme subsystems might support resetting impacted controller from
> > a subset of controllers connected to the host. An array that has
> > multiple frontend engines might not support resetting controllers
> > across engines. In fact, TP8028 allows for subsystem to suggest to
> > host to use another source controller in Alternate Controller ID
> > (ACID) fied on CCR logpage (not implemente in this patchset).
>
> It is not a case though where the impacted controller will be reset from
> multiple
> source controller at the same time...
>
> I'd also say that if indeed there are subsystems that require specific
> controllers to do
> cross recovery, they won't be able to use this at all... Are there any
> such arrays?
Not at all. CCR is still useful in the case of partial connectivity
failure to the same engine. I can not name a product that does that
today.
^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC PATCH 09/14] nvme: Implement cross-controller reset completion
2026-01-04 21:15 ` Sagi Grimberg
@ 2026-01-30 22:32 ` Mohamed Khalfella
0 siblings, 0 replies; 68+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:32 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
Aaron Dailey, Randy Jennings, John Meneghini, Hannes Reinecke,
linux-nvme, linux-kernel
On Sun 2026-01-04 23:15:53 +0200, Sagi Grimberg wrote:
>
>
> On 01/01/2026 1:51, Mohamed Khalfella wrote:
> > On Sat 2025-12-27 12:24:17 +0200, Sagi Grimberg wrote:
> >>> + log = kmalloc(sizeof(*log), GFP_KERNEL);
> >>> + if (!log)
> >>> + return;
> >>> +
> >>> + ret = nvme_get_log(ctrl, 0, NVME_LOG_CCR, 0x01,
> >>> + 0x00, log, sizeof(*log), 0);
> >>> + if (ret)
> >>> + goto out;
> >>> +
> >>> + spin_lock_irqsave(&ctrl->lock, flags);
> >>> + for (i = 0; i < le16_to_cpu(log->ne); i++) {
> >>> + entry = &log->entries[i];
> >>> + if (entry->ccrs == 0) /* skip in progress entries */
> >>> + continue;
> >> What does ccrs stand for?
> > Cross-Controller Reset Status
> >
> > 0x00 -> In Progress
> > 0x01 -> Success
> > 0x02 -> Failed
> > 0x03 - 0xff -> Reserved
>
> Let's add it as a proper enumeration please.
>
Done
^ permalink raw reply [flat|nested] 68+ messages in thread
end of thread, other threads:[~2026-01-30 22:32 UTC | newest]
Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-26 2:11 [RFC PATCH 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
2025-11-26 2:11 ` [RFC PATCH 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
2025-12-16 1:35 ` Randy Jennings
2025-11-26 2:11 ` [RFC PATCH 02/14] nvmet/debugfs: Add ctrl uniquifier and random values Mohamed Khalfella
2025-12-16 1:43 ` Randy Jennings
2025-11-26 2:11 ` [RFC PATCH 03/14] nvmet: Implement CCR nvme command Mohamed Khalfella
2025-12-16 3:01 ` Randy Jennings
2025-12-31 21:14 ` Mohamed Khalfella
2025-12-25 13:14 ` Sagi Grimberg
2025-12-25 17:33 ` Mohamed Khalfella
2025-12-27 9:39 ` Sagi Grimberg
2025-12-31 21:35 ` Mohamed Khalfella
2025-11-26 2:11 ` [RFC PATCH 04/14] nvmet: Implement CCR logpage Mohamed Khalfella
2025-12-16 3:11 ` Randy Jennings
2025-11-26 2:11 ` [RFC PATCH 05/14] nvmet: Send an AEN on CCR completion Mohamed Khalfella
2025-12-16 3:31 ` Randy Jennings
2025-12-25 13:23 ` Sagi Grimberg
2025-12-25 18:13 ` Mohamed Khalfella
2025-12-27 9:48 ` Sagi Grimberg
2025-12-31 22:00 ` Mohamed Khalfella
2026-01-04 21:09 ` Sagi Grimberg
2026-01-07 2:58 ` Randy Jennings
2026-01-30 22:31 ` Mohamed Khalfella
2025-11-26 2:11 ` [RFC PATCH 06/14] nvme: Rapid Path Failure Recovery read controller identify fields Mohamed Khalfella
2025-12-18 15:22 ` Randy Jennings
2025-12-31 22:26 ` Mohamed Khalfella
2026-01-02 19:06 ` Mohamed Khalfella
2025-11-26 2:11 ` [RFC PATCH 07/14] nvme: Add RECOVERING nvme controller state Mohamed Khalfella
2025-12-18 23:18 ` Randy Jennings
2025-12-19 1:39 ` Randy Jennings
2025-12-25 13:29 ` Sagi Grimberg
2025-12-25 17:17 ` Mohamed Khalfella
2025-12-27 9:52 ` Sagi Grimberg
2025-12-31 22:45 ` Mohamed Khalfella
2025-12-27 9:55 ` Sagi Grimberg
2025-12-31 22:36 ` Mohamed Khalfella
2025-12-31 23:04 ` Mohamed Khalfella
2025-11-26 2:11 ` [RFC PATCH 08/14] nvme: Implement cross-controller reset recovery Mohamed Khalfella
2025-12-19 1:21 ` Randy Jennings
2025-12-27 10:14 ` Sagi Grimberg
2025-12-31 0:04 ` Randy Jennings
2026-01-04 21:14 ` Sagi Grimberg
2026-01-07 3:16 ` Randy Jennings
2025-12-31 23:43 ` Mohamed Khalfella
2026-01-04 21:39 ` Sagi Grimberg
2026-01-30 22:01 ` Mohamed Khalfella
2025-11-26 2:11 ` [RFC PATCH 09/14] nvme: Implement cross-controller reset completion Mohamed Khalfella
2025-12-19 1:31 ` Randy Jennings
2025-12-27 10:24 ` Sagi Grimberg
2025-12-31 23:51 ` Mohamed Khalfella
2026-01-04 21:15 ` Sagi Grimberg
2026-01-30 22:32 ` Mohamed Khalfella
2025-11-26 2:11 ` [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error Mohamed Khalfella
2025-12-19 2:06 ` Randy Jennings
2026-01-01 0:04 ` Mohamed Khalfella
2025-12-27 10:35 ` Sagi Grimberg
2025-12-31 0:13 ` Randy Jennings
2026-01-04 21:19 ` Sagi Grimberg
2026-01-01 0:27 ` Mohamed Khalfella
2025-11-26 2:11 ` [RFC PATCH 11/14] nvme-rdma: " Mohamed Khalfella
2025-12-19 2:16 ` Randy Jennings
2025-12-27 10:36 ` Sagi Grimberg
2025-11-26 2:11 ` [RFC PATCH 12/14] nvme-fc: Decouple error recovery from controller reset Mohamed Khalfella
2025-12-19 2:59 ` Randy Jennings
2025-11-26 2:12 ` [RFC PATCH 13/14] nvme-fc: Use CCR to recover controller that hits an error Mohamed Khalfella
2025-12-20 1:21 ` Randy Jennings
2025-11-26 2:12 ` [RFC PATCH 14/14] nvme-fc: Hold inflight requests while in RECOVERING state Mohamed Khalfella
2025-12-20 1:44 ` Randy Jennings
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox