[PATCH v2 00/14] TP8028 Rapid Path Failure Recovery

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery
@ 2026-01-30 22:34 Mohamed Khalfella
  2026-01-30 22:34 ` [PATCH v2 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
                   ` (13 more replies)
  0 siblings, 14 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, Mohamed Khalfella

This patchset adds support for TP8028 Rapid Path Failure Recovery for
both nvme target and initiator. Rapid Path Failure Recovery brings
Cross-Controller Reset (CCR) functionality to nvme. This allows nvme
host to send an nvme command to source nvme controller to reset impacted
nvme controller. Provided that both source and impacted controllers are
in the same nvme subsystem.

The main use of CCR is when one path to nvme subsystem fails. Inflight
IOs on impacted nvme controller need to be terminated first before they
can be retried on another path. Otherwise data corruption may happen.
CCR provides a quick way to terminate these IOs on the unreachable nvme
controller allowing recovery to move quickly and avoiding unnecessary
delays. In case of CCR is not possible, then inflight requests are held
for duration defined by TP4129 KATO Corrections and Clarifications
before they are allowed to be retried.

On the target side:

* New struct members have been added to support CCR. struct nvme_id_ctrl
  has been updated with CIU (Controller Instance Uniquifier), CIRN
  (Controller Instance Random Number), and CQT (Command Quiesce Time).
  The combination of CIU, CNTLID, and CIRN is used to identify impacted
  controller in CCR command.

* CCR nvme command implemented on the target causes impacted controller
  to fail and drop connections to host.

* CCR logpage contains the status of pending CCR requests. An entry is
  added to the logpage after CCR request is validated. Completed CCR
  requests are removed from the logpage when controller becomes ready or
  when requested in get logpage command.

* An AEN is sent when CCR completes to let the host know that it is safe
  to retry inflight requests.

On the host side:

* CIU, CIRN, and CQT have been added to struct nvme_ctrl. CIU and CIRN
  have been added to sysfs to make the values visible to user. CIU and
  CIRN can be used to construct and manually send admin-passthru CCR
  commands.

* New controller state NVME_CTRL_RECOVERING has been added to prevent
  cancelling timed out inflight requests while CCR is in progress.
  Controller flag NVME_CTRL_RECOVERED was also added to signal end of
  time-based recovery.

* Controller recovery in nvme_recover_ctrl() is invoked when LIVE
  controller hits an error or when a request times out. CCR is attempted
  to reset impacted controller.

* Updated nvme fabric transports nvme-tcp, nvme-rdma, and nvme-fc to use
  CCR recovery.

Ideally all inflight requests should be held during controller recovery
and only retried after recovery is done. However, there are known
situations that is not the case in this implementation. These gaps will
be addressed in future patches:

* Manual controller reset from sysfs will result in controller going to
  RESETTING state and all inflight requests to be canceled immediately
  and maybe retried on another path.

* Manual controller delete from sysfs will also result in all inflight
  requests to be canceled immediately and maybe retried on another path.

* In nvme-fc nvme controller will be deleted if remote port disappears
  with no timeout specified. This results in immediate cancellation of
  requests that maybe retried on another path.

* In nvme-rdma if HCA is removed all nvme controllers will be deleted.
  This results in canceling inflight IOs and maybe they will be retred
  on another path.

Changes from v1:

* nvmet: Rapid Path Failure Recovery set controller identify fields
  - Added subsys->cqt defaults to 0 to maintain current behavior.
  - subsys->cqt is configurable via configfs
  - Added ctrl->cqt initialized from subsys->cqt.
  - Renamed ctrl->uniquifier to ctrl->ciu, ctrl->random to ctrl->cirn.

* nvmet: Implement CCR nvme command
  - Refactored nvmet_execute_cross_ctrl_reset() for simpler error handling
  - Renamed CCR list from ctrl->ccrs to ctrl->ccr_list.

* nvmet: Implement CCR logpage
  - Added CCR status and flags enums

* nvme: Rapid Path Failure Recovery read controller identify fields
  - Renamed ctrl sysfs attributes uniquifier -> ciu, random -> cirn

* nvme: Introduce FENCING and FENCED controller states
  - Added two states (FENCING and FENCED) instead of (RECOVERING and
    controller flag RECOVERED)
  - Updated __nvme_check_ready() such that fabric controller in FENCING
    state is not ready to send requests. Also a request sent while
    controller in FENCING state is completed with host path error
    instead of returning BLK_STS_RESOURCE.

* nvme: Implement cross-controller reset recovery
  - Renamed nvme_find_ccr_ctrl() to *nvme_find_ctrl_ccr() to pair with
    newly added nvme_put_ctrl_ccr(). The later handles releasing source
    controller used to issue CCR command.
  - Renamed nvme_recover_ctrl() to nvme_fence_ctrl().
  - Deleted nvme_end_ctrl_recovery() because the state change has been
    moved to nvme_change_ctrl_state().
  - Renamed CCR list from ctrl->ccrs to ctrl->ccr_list.

* nvme-tcp: Use CCR to recover controller that hits an error
  - Added ctrl->fencing_work and ctrl->fenced_work instead of changing
    ctrl->err_work and using it for fencing purpose.

* nvme-rdma: Use CCR to recover controller that hits an error
  - Similar change to nvme-tcp.

* nvme-fc: Use CCR to recover controller that hits an error
  - Similar to nvme-rdma and nvme-tcp.

* nvme-fc: Hold inflight requests while in RECOVERING state
  - Updated nvme_fc_fcpio_done() to hold the first request that starts
    error recovery. That was one of the limitations mentioned in the cover
    letter of v1.

v1: https://lore.kernel.org/all/20251126021250.2583630-1-mkhalfella@purestorage.com/

Mohamed Khalfella (14):
  nvmet: Rapid Path Failure Recovery set controller identify fields
  nvmet/debugfs: Add ctrl uniquifier and random values
  nvmet: Implement CCR nvme command
  nvmet: Implement CCR logpage
  nvmet: Send an AEN on CCR completion
  nvme: Rapid Path Failure Recovery read controller identify fields
  nvme: Introduce FENCING and FENCED controller states
  nvme: Implement cross-controller reset recovery
  nvme: Implement cross-controller reset completion
  nvme-tcp: Use CCR to recover controller that hits an error
  nvme-rdma: Use CCR to recover controller that hits an error
  nvme-fc: Decouple error recovery from controller reset
  nvme-fc: Use CCR to recover controller that hits an error
  nvme-fc: Hold inflight requests while in FENCING state

 drivers/nvme/host/constants.c   |   1 +
 drivers/nvme/host/core.c        | 208 +++++++++++++++++++++++++++++-
 drivers/nvme/host/fc.c          | 215 ++++++++++++++++++++++----------
 drivers/nvme/host/nvme.h        |  25 ++++
 drivers/nvme/host/rdma.c        |  62 ++++++++-
 drivers/nvme/host/sysfs.c       |  25 ++++
 drivers/nvme/host/tcp.c         |  62 ++++++++-
 drivers/nvme/target/admin-cmd.c | 124 ++++++++++++++++++
 drivers/nvme/target/configfs.c  |  31 +++++
 drivers/nvme/target/core.c      | 108 +++++++++++++++-
 drivers/nvme/target/debugfs.c   |  21 ++++
 drivers/nvme/target/nvmet.h     |  20 ++-
 include/linux/nvme.h            |  70 ++++++++++-
 13 files changed, 897 insertions(+), 75 deletions(-)

base-commit: 8dfce8991b95d8625d0a1d2896e42f93b9d7f68d
-- 
2.52.0

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v2 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields
  2026-01-30 22:34 [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
@ 2026-01-30 22:34 ` Mohamed Khalfella
  2026-02-03  3:03   ` Hannes Reinecke
  2026-01-30 22:34 ` [PATCH v2 02/14] nvmet/debugfs: Add ctrl uniquifier and random values Mohamed Khalfella
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, Mohamed Khalfella

TP8028 Rapid Path Failure Recovery defined new fields in controller
identify response. The newly defined fields are:

- CIU (Controller Instance UNIQUIFIER): is an 8bit non-zero value that
is assigned a random value when controller first created. The value is
expected to be incremented when RDY bit in CSTS register is asserted
- CIRN (Controller Instance Random Number): is 64bit random value that
gets generated when controller is crated. CIRN is regenerated everytime
RDY bit is CSTS register is asserted.
- CCRL (Cross-Controller Reset Limit) is an 8bit value that defines the
maximum number of in-progress controller reset operations. CCRL is
hardcoded to 4 as recommended by TP8028.

TP4129 KATO Corrections and Clarifications defined CQT (Command Quiesce
Time) which is used along with KATO (Keep Alive Timeout) to set an upper
time limit for attempting Cross-Controller Recovery. For NVME subsystem
CQT is set to 0 by default to keep the current behavior. The value can
be set from configfs if needed.

Make the new fields available for IO controllers only since TP8028 is
not very useful for discovery controllers.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/nvme/target/admin-cmd.c |  6 ++++++
 drivers/nvme/target/configfs.c  | 31 +++++++++++++++++++++++++++++++
 drivers/nvme/target/core.c      | 12 ++++++++++++
 drivers/nvme/target/nvmet.h     |  4 ++++
 include/linux/nvme.h            | 15 ++++++++++++---
 5 files changed, 65 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 3da31bb1183e..ade1145df72d 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -696,6 +696,12 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
 
 	id->cntlid = cpu_to_le16(ctrl->cntlid);
 	id->ver = cpu_to_le32(ctrl->subsys->ver);
+	if (!nvmet_is_disc_subsys(ctrl->subsys)) {
+		id->cqt = cpu_to_le16(ctrl->cqt);
+		id->ciu = ctrl->ciu;
+		id->cirn = cpu_to_le64(ctrl->cirn);
+		id->ccrl = NVMF_CCR_LIMIT;
+	}
 
 	/* XXX: figure out what to do about RTD3R/RTD3 */
 	id->oaes = cpu_to_le32(NVMET_AEN_CFG_OPTIONAL);
diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index e44ef69dffc2..035f6e75a818 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -1636,6 +1636,36 @@ static ssize_t nvmet_subsys_attr_pi_enable_store(struct config_item *item,
 CONFIGFS_ATTR(nvmet_subsys_, attr_pi_enable);
 #endif
 
+static ssize_t nvmet_subsys_attr_cqt_show(struct config_item *item,
+					  char *page)
+{
+	return snprintf(page, PAGE_SIZE, "%u\n", to_subsys(item)->cqt);
+}
+
+static ssize_t nvmet_subsys_attr_cqt_store(struct config_item *item,
+					   const char *page, size_t cnt)
+{
+	struct nvmet_subsys *subsys = to_subsys(item);
+	struct nvmet_ctrl *ctrl;
+	u16 cqt;
+
+	if (sscanf(page, "%hu\n", &cqt) != 1)
+		return -EINVAL;
+
+	down_write(&nvmet_config_sem);
+	if (subsys->cqt == cqt)
+		goto out;
+
+	subsys->cqt = cqt;
+	/* Force reconnect */
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
+		ctrl->ops->delete_ctrl(ctrl);
+out:
+	up_write(&nvmet_config_sem);
+	return cnt;
+}
+CONFIGFS_ATTR(nvmet_subsys_, attr_cqt);
+
 static ssize_t nvmet_subsys_attr_qid_max_show(struct config_item *item,
 					      char *page)
 {
@@ -1676,6 +1706,7 @@ static struct configfs_attribute *nvmet_subsys_attrs[] = {
 	&nvmet_subsys_attr_attr_vendor_id,
 	&nvmet_subsys_attr_attr_subsys_vendor_id,
 	&nvmet_subsys_attr_attr_model,
+	&nvmet_subsys_attr_attr_cqt,
 	&nvmet_subsys_attr_attr_qid_max,
 	&nvmet_subsys_attr_attr_ieee_oui,
 	&nvmet_subsys_attr_attr_firmware,
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index cc88e5a28c8a..0d2a1206e08f 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -1393,6 +1393,10 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl)
 		return;
 	}
 
+	if (!nvmet_is_disc_subsys(ctrl->subsys)) {
+		ctrl->ciu = ((u8)(ctrl->ciu + 1)) ? : 1;
+		ctrl->cirn = get_random_u64();
+	}
 	ctrl->csts = NVME_CSTS_RDY;
 
 	/*
@@ -1661,6 +1665,12 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
 	}
 	ctrl->cntlid = ret;
 
+	if (!nvmet_is_disc_subsys(ctrl->subsys)) {
+		ctrl->cqt = subsys->cqt;
+		ctrl->ciu = get_random_u8() ? : 1;
+		ctrl->cirn = get_random_u64();
+	}
+
 	/*
 	 * Discovery controllers may use some arbitrary high value
 	 * in order to cleanup stale discovery sessions
@@ -1853,10 +1863,12 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char *subsysnqn,
 
 	switch (type) {
 	case NVME_NQN_NVME:
+		subsys->cqt = NVMF_CQT_MS;
 		subsys->max_qid = NVMET_NR_QUEUES;
 		break;
 	case NVME_NQN_DISC:
 	case NVME_NQN_CURR:
+		subsys->cqt = 0;
 		subsys->max_qid = 0;
 		break;
 	default:
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index b664b584fdc8..f5d9a01ec60c 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -264,7 +264,10 @@ struct nvmet_ctrl {
 
 	uuid_t			hostid;
 	u16			cntlid;
+	u16			cqt;
+	u8			ciu;
 	u32			kato;
+	u64			cirn;
 
 	struct nvmet_port	*port;
 
@@ -331,6 +334,7 @@ struct nvmet_subsys {
 #ifdef CONFIG_NVME_TARGET_DEBUGFS
 	struct dentry		*debugfs_dir;
 #endif
+	u16			cqt;
 	u16			max_qid;
 
 	u64			ver;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 655d194f8e72..5135cdc3c120 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -21,6 +21,9 @@
 #define NVMF_TRADDR_SIZE	256
 #define NVMF_TSAS_SIZE		256
 
+#define NVMF_CQT_MS		0
+#define NVMF_CCR_LIMIT		4
+
 #define NVME_DISC_SUBSYS_NAME	"nqn.2014-08.org.nvmexpress.discovery"
 
 #define NVME_NSID_ALL		0xffffffff
@@ -328,7 +331,10 @@ struct nvme_id_ctrl {
 	__le16			crdt1;
 	__le16			crdt2;
 	__le16			crdt3;
-	__u8			rsvd134[122];
+	__u8			rsvd134[1];
+	__u8			ciu;
+	__le64			cirn;
+	__u8			rsvd144[112];
 	__le16			oacs;
 	__u8			acl;
 	__u8			aerl;
@@ -362,7 +368,9 @@ struct nvme_id_ctrl {
 	__u8			anacap;
 	__le32			anagrpmax;
 	__le32			nanagrpid;
-	__u8			rsvd352[160];
+	__u8			rsvd352[34];
+	__le16			cqt;
+	__u8			rsvd388[124];
 	__u8			sqes;
 	__u8			cqes;
 	__le16			maxcmd;
@@ -389,7 +397,8 @@ struct nvme_id_ctrl {
 	__u8			msdbd;
 	__u8			rsvd1804[2];
 	__u8			dctype;
-	__u8			rsvd1807[241];
+	__u8			ccrl;
+	__u8			rsvd1808[240];
 	struct nvme_id_power_state	psd[32];
 	__u8			vs[1024];
 };
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v2 02/14] nvmet/debugfs: Add ctrl uniquifier and random values
  2026-01-30 22:34 [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
  2026-01-30 22:34 ` [PATCH v2 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
@ 2026-01-30 22:34 ` Mohamed Khalfella
  2026-02-03  3:04   ` Hannes Reinecke
                     ` (2 more replies)
  2026-01-30 22:34 ` [PATCH v2 03/14] nvmet: Implement CCR nvme command Mohamed Khalfella
                   ` (11 subsequent siblings)
  13 siblings, 3 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, Mohamed Khalfella

Export ctrl->random and ctrl->uniquifier as debugfs files under
controller debugfs directory.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/nvme/target/debugfs.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/nvme/target/debugfs.c b/drivers/nvme/target/debugfs.c
index 5dcbd5aa86e1..1300adf6c1fb 100644
--- a/drivers/nvme/target/debugfs.c
+++ b/drivers/nvme/target/debugfs.c
@@ -152,6 +152,23 @@ static int nvmet_ctrl_tls_concat_show(struct seq_file *m, void *p)
 }
 NVMET_DEBUGFS_ATTR(nvmet_ctrl_tls_concat);
 #endif
+static int nvmet_ctrl_instance_ciu_show(struct seq_file *m, void *p)
+{
+	struct nvmet_ctrl *ctrl = m->private;
+
+	seq_printf(m, "%02x\n", ctrl->ciu);
+	return 0;
+}
+NVMET_DEBUGFS_ATTR(nvmet_ctrl_instance_ciu);
+
+static int nvmet_ctrl_instance_cirn_show(struct seq_file *m, void *p)
+{
+	struct nvmet_ctrl *ctrl = m->private;
+
+	seq_printf(m, "%016llx\n", ctrl->cirn);
+	return 0;
+}
+NVMET_DEBUGFS_ATTR(nvmet_ctrl_instance_cirn);
 
 int nvmet_debugfs_ctrl_setup(struct nvmet_ctrl *ctrl)
 {
@@ -184,6 +201,10 @@ int nvmet_debugfs_ctrl_setup(struct nvmet_ctrl *ctrl)
 	debugfs_create_file("tls_key", S_IRUSR, ctrl->debugfs_dir, ctrl,
 			    &nvmet_ctrl_tls_key_fops);
 #endif
+	debugfs_create_file("ciu", S_IRUSR, ctrl->debugfs_dir, ctrl,
+			    &nvmet_ctrl_instance_ciu_fops);
+	debugfs_create_file("cirn", S_IRUSR, ctrl->debugfs_dir, ctrl,
+			    &nvmet_ctrl_instance_cirn_fops);
 	return 0;
 }
 
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v2 03/14] nvmet: Implement CCR nvme command
  2026-01-30 22:34 [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
  2026-01-30 22:34 ` [PATCH v2 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
  2026-01-30 22:34 ` [PATCH v2 02/14] nvmet/debugfs: Add ctrl uniquifier and random values Mohamed Khalfella
@ 2026-01-30 22:34 ` Mohamed Khalfella
  2026-02-03  3:19   ` Hannes Reinecke
  2026-02-07 14:11   ` Sagi Grimberg
  2026-01-30 22:34 ` [PATCH v2 04/14] nvmet: Implement CCR logpage Mohamed Khalfella
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, Mohamed Khalfella

Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
Reset) command is an nvme command issued to source controller by
initiator to reset impacted controller. Implement CCR command for linux
nvme target.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/nvme/target/admin-cmd.c | 74 +++++++++++++++++++++++++++++++++
 drivers/nvme/target/core.c      | 71 +++++++++++++++++++++++++++++++
 drivers/nvme/target/nvmet.h     | 13 ++++++
 include/linux/nvme.h            | 23 ++++++++++
 4 files changed, 181 insertions(+)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index ade1145df72d..c0fd8eca2e44 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -376,7 +376,9 @@ static void nvmet_get_cmd_effects_admin(struct nvmet_ctrl *ctrl,
 	log->acs[nvme_admin_get_features] =
 	log->acs[nvme_admin_async_event] =
 	log->acs[nvme_admin_keep_alive] =
+	log->acs[nvme_admin_cross_ctrl_reset] =
 		cpu_to_le32(NVME_CMD_EFFECTS_CSUPP);
+
 }
 
 static void nvmet_get_cmd_effects_nvm(struct nvme_effects_log *log)
@@ -1615,6 +1617,75 @@ void nvmet_execute_keep_alive(struct nvmet_req *req)
 	nvmet_req_complete(req, status);
 }
 
+void nvmet_execute_cross_ctrl_reset(struct nvmet_req *req)
+{
+	struct nvmet_ctrl *ictrl, *sctrl = req->sq->ctrl;
+	struct nvme_command *cmd = req->cmd;
+	struct nvmet_ccr *ccr, *new_ccr;
+	int ccr_active, ccr_total;
+	u16 cntlid, status = NVME_SC_SUCCESS;
+
+	cntlid = le16_to_cpu(cmd->ccr.icid);
+	if (sctrl->cntlid == cntlid) {
+		req->error_loc =
+			offsetof(struct nvme_cross_ctrl_reset_cmd, icid);
+		status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR;
+		goto out;
+	}
+
+	/* Find and get impacted controller */
+	ictrl = nvmet_ctrl_find_get_ccr(sctrl->subsys, sctrl->hostnqn,
+					cmd->ccr.ciu, cntlid,
+					le64_to_cpu(cmd->ccr.cirn));
+	if (!ictrl) {
+		/* Immediate Reset Successful */
+		nvmet_set_result(req, 1);
+		status = NVME_SC_SUCCESS;
+		goto out;
+	}
+
+	ccr_total = ccr_active = 0;
+	mutex_lock(&sctrl->lock);
+	list_for_each_entry(ccr, &sctrl->ccr_list, entry) {
+		if (ccr->ctrl == ictrl) {
+			status = NVME_SC_CCR_IN_PROGRESS | NVME_STATUS_DNR;
+			goto out_unlock;
+		}
+
+		ccr_total++;
+		if (ccr->ctrl)
+			ccr_active++;
+	}
+
+	if (ccr_active >= NVMF_CCR_LIMIT) {
+		status = NVME_SC_CCR_LIMIT_EXCEEDED;
+		goto out_unlock;
+	}
+	if (ccr_total >= NVMF_CCR_PER_PAGE) {
+		status = NVME_SC_CCR_LOGPAGE_FULL;
+		goto out_unlock;
+	}
+
+	new_ccr = kmalloc(sizeof(*new_ccr), GFP_KERNEL);
+	if (!new_ccr) {
+		status = NVME_SC_INTERNAL;
+		goto out_unlock;
+	}
+
+	new_ccr->ciu = cmd->ccr.ciu;
+	new_ccr->icid = cntlid;
+	new_ccr->ctrl = ictrl;
+	list_add_tail(&new_ccr->entry, &sctrl->ccr_list);
+
+out_unlock:
+	mutex_unlock(&sctrl->lock);
+	if (status == NVME_SC_SUCCESS)
+		nvmet_ctrl_fatal_error(ictrl);
+	nvmet_ctrl_put(ictrl);
+out:
+	nvmet_req_complete(req, status);
+}
+
 u32 nvmet_admin_cmd_data_len(struct nvmet_req *req)
 {
 	struct nvme_command *cmd = req->cmd;
@@ -1692,6 +1763,9 @@ u16 nvmet_parse_admin_cmd(struct nvmet_req *req)
 	case nvme_admin_keep_alive:
 		req->execute = nvmet_execute_keep_alive;
 		return 0;
+	case nvme_admin_cross_ctrl_reset:
+		req->execute = nvmet_execute_cross_ctrl_reset;
+		return 0;
 	default:
 		return nvmet_report_invalid_opcode(req);
 	}
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 0d2a1206e08f..54dd0dcfa12b 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -114,6 +114,20 @@ u16 nvmet_zero_sgl(struct nvmet_req *req, off_t off, size_t len)
 	return 0;
 }
 
+void nvmet_ctrl_cleanup_ccrs(struct nvmet_ctrl *ctrl, bool all)
+{
+	struct nvmet_ccr *ccr, *tmp;
+
+	lockdep_assert_held(&ctrl->lock);
+
+	list_for_each_entry_safe(ccr, tmp, &ctrl->ccr_list, entry) {
+		if (all || ccr->ctrl == NULL) {
+			list_del(&ccr->entry);
+			kfree(ccr);
+		}
+	}
+}
+
 static u32 nvmet_max_nsid(struct nvmet_subsys *subsys)
 {
 	struct nvmet_ns *cur;
@@ -1396,6 +1410,7 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl)
 	if (!nvmet_is_disc_subsys(ctrl->subsys)) {
 		ctrl->ciu = ((u8)(ctrl->ciu + 1)) ? : 1;
 		ctrl->cirn = get_random_u64();
+		nvmet_ctrl_cleanup_ccrs(ctrl, false);
 	}
 	ctrl->csts = NVME_CSTS_RDY;
 
@@ -1501,6 +1516,38 @@ struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
 	return ctrl;
 }
 
+struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
+					   const char *hostnqn, u8 ciu,
+					   u16 cntlid, u64 cirn)
+{
+	struct nvmet_ctrl *ctrl;
+	bool found = false;
+
+	mutex_lock(&subsys->lock);
+	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+		if (ctrl->cntlid != cntlid)
+			continue;
+		if (strncmp(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE))
+			continue;
+
+		/* Avoid racing with a controller that is becoming ready */
+		mutex_lock(&ctrl->lock);
+		if (ctrl->ciu == ciu && ctrl->cirn == cirn)
+			found = true;
+		mutex_unlock(&ctrl->lock);
+
+		if (found) {
+			if (kref_get_unless_zero(&ctrl->ref))
+				goto out;
+			break;
+		}
+	};
+	ctrl = NULL;
+out:
+	mutex_unlock(&subsys->lock);
+	return ctrl;
+}
+
 u16 nvmet_check_ctrl_status(struct nvmet_req *req)
 {
 	if (unlikely(!(req->sq->ctrl->cc & NVME_CC_ENABLE))) {
@@ -1626,6 +1673,7 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
 		subsys->clear_ids = 1;
 #endif
 
+	INIT_LIST_HEAD(&ctrl->ccr_list);
 	INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
 	INIT_LIST_HEAD(&ctrl->async_events);
 	INIT_RADIX_TREE(&ctrl->p2p_ns_map, GFP_KERNEL);
@@ -1740,12 +1788,35 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
 }
 EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
 
+static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
+{
+	struct nvmet_subsys *subsys = ctrl->subsys;
+	struct nvmet_ctrl *sctrl;
+	struct nvmet_ccr *ccr;
+
+	mutex_lock(&ctrl->lock);
+	nvmet_ctrl_cleanup_ccrs(ctrl, true);
+	mutex_unlock(&ctrl->lock);
+
+	list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
+		mutex_lock(&sctrl->lock);
+		list_for_each_entry(ccr, &sctrl->ccr_list, entry) {
+			if (ccr->ctrl == ctrl) {
+				ccr->ctrl = NULL;
+				break;
+			}
+		}
+		mutex_unlock(&sctrl->lock);
+	}
+}
+
 static void nvmet_ctrl_free(struct kref *ref)
 {
 	struct nvmet_ctrl *ctrl = container_of(ref, struct nvmet_ctrl, ref);
 	struct nvmet_subsys *subsys = ctrl->subsys;
 
 	mutex_lock(&subsys->lock);
+	nvmet_ctrl_complete_pending_ccr(ctrl);
 	nvmet_ctrl_destroy_pr(ctrl);
 	nvmet_release_p2p_ns_map(ctrl);
 	list_del(&ctrl->subsys_entry);
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index f5d9a01ec60c..93d6ac41cf85 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -269,6 +269,7 @@ struct nvmet_ctrl {
 	u32			kato;
 	u64			cirn;
 
+	struct list_head	ccr_list;
 	struct nvmet_port	*port;
 
 	u32			aen_enabled;
@@ -315,6 +316,13 @@ struct nvmet_ctrl {
 	struct nvmet_pr_log_mgr pr_log_mgr;
 };
 
+struct nvmet_ccr {
+	struct nvmet_ctrl	*ctrl;
+	struct list_head	entry;
+	u16			icid;
+	u8			ciu;
+};
+
 struct nvmet_subsys {
 	enum nvme_subsys_type	type;
 
@@ -578,6 +586,7 @@ void nvmet_req_free_sgls(struct nvmet_req *req);
 void nvmet_execute_set_features(struct nvmet_req *req);
 void nvmet_execute_get_features(struct nvmet_req *req);
 void nvmet_execute_keep_alive(struct nvmet_req *req);
+void nvmet_execute_cross_ctrl_reset(struct nvmet_req *req);
 
 u16 nvmet_check_cqid(struct nvmet_ctrl *ctrl, u16 cqid, bool create);
 u16 nvmet_check_io_cqid(struct nvmet_ctrl *ctrl, u16 cqid, bool create);
@@ -620,6 +629,10 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args);
 struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
 				       const char *hostnqn, u16 cntlid,
 				       struct nvmet_req *req);
+struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
+					   const char *hostnqn, u8 ciu,
+					   u16 cntlid, u64 cirn);
+void nvmet_ctrl_cleanup_ccrs(struct nvmet_ctrl *ctrl, bool all);
 void nvmet_ctrl_put(struct nvmet_ctrl *ctrl);
 u16 nvmet_check_ctrl_status(struct nvmet_req *req);
 ssize_t nvmet_ctrl_host_traddr(struct nvmet_ctrl *ctrl,
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 5135cdc3c120..0f305b317aa3 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -23,6 +23,7 @@
 
 #define NVMF_CQT_MS		0
 #define NVMF_CCR_LIMIT		4
+#define NVMF_CCR_PER_PAGE	511
 
 #define NVME_DISC_SUBSYS_NAME	"nqn.2014-08.org.nvmexpress.discovery"
 
@@ -1225,6 +1226,22 @@ struct nvme_zone_mgmt_recv_cmd {
 	__le32			cdw14[2];
 };
 
+struct nvme_cross_ctrl_reset_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__le64			rsvd2[2];
+	union nvme_data_ptr	dptr;
+	__u8			rsvd10;
+	__u8			ciu;
+	__le16			icid;
+	__le32			cdw11;
+	__le64			cirn;
+	__le32			cdw14;
+	__le32			cdw15;
+};
+
 struct nvme_io_mgmt_recv_cmd {
 	__u8			opcode;
 	__u8			flags;
@@ -1323,6 +1340,7 @@ enum nvme_admin_opcode {
 	nvme_admin_virtual_mgmt		= 0x1c,
 	nvme_admin_nvme_mi_send		= 0x1d,
 	nvme_admin_nvme_mi_recv		= 0x1e,
+	nvme_admin_cross_ctrl_reset	= 0x38,
 	nvme_admin_dbbuf		= 0x7C,
 	nvme_admin_format_nvm		= 0x80,
 	nvme_admin_security_send	= 0x81,
@@ -1356,6 +1374,7 @@ enum nvme_admin_opcode {
 		nvme_admin_opcode_name(nvme_admin_virtual_mgmt),	\
 		nvme_admin_opcode_name(nvme_admin_nvme_mi_send),	\
 		nvme_admin_opcode_name(nvme_admin_nvme_mi_recv),	\
+		nvme_admin_opcode_name(nvme_admin_cross_ctrl_reset),	\
 		nvme_admin_opcode_name(nvme_admin_dbbuf),		\
 		nvme_admin_opcode_name(nvme_admin_format_nvm),		\
 		nvme_admin_opcode_name(nvme_admin_security_send),	\
@@ -2009,6 +2028,7 @@ struct nvme_command {
 		struct nvme_dbbuf dbbuf;
 		struct nvme_directive_cmd directive;
 		struct nvme_io_mgmt_recv_cmd imr;
+		struct nvme_cross_ctrl_reset_cmd ccr;
 	};
 };
 
@@ -2173,6 +2193,9 @@ enum {
 	NVME_SC_PMR_SAN_PROHIBITED	= 0x123,
 	NVME_SC_ANA_GROUP_ID_INVALID	= 0x124,
 	NVME_SC_ANA_ATTACH_FAILED	= 0x125,
+	NVME_SC_CCR_IN_PROGRESS		= 0x13f,
+	NVME_SC_CCR_LOGPAGE_FULL	= 0x140,
+	NVME_SC_CCR_LIMIT_EXCEEDED	= 0x141,
 
 	/*
 	 * I/O Command Set Specific - NVM commands:
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v2 04/14] nvmet: Implement CCR logpage
  2026-01-30 22:34 [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
                   ` (2 preceding siblings ...)
  2026-01-30 22:34 ` [PATCH v2 03/14] nvmet: Implement CCR nvme command Mohamed Khalfella
@ 2026-01-30 22:34 ` Mohamed Khalfella
  2026-02-03  3:21   ` Hannes Reinecke
                     ` (2 more replies)
  2026-01-30 22:34 ` [PATCH v2 05/14] nvmet: Send an AEN on CCR completion Mohamed Khalfella
                   ` (9 subsequent siblings)
  13 siblings, 3 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, Mohamed Khalfella

Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
Reset) log page contains an entry for each CCR request submitted to
source controller. Implement CCR logpage for nvme linux target.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/nvme/target/admin-cmd.c | 44 +++++++++++++++++++++++++++++++++
 include/linux/nvme.h            | 29 ++++++++++++++++++++++
 2 files changed, 73 insertions(+)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index c0fd8eca2e44..fdeb79ec348a 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -220,6 +220,7 @@ static void nvmet_execute_get_supported_log_pages(struct nvmet_req *req)
 	logs->lids[NVME_LOG_FEATURES] = cpu_to_le32(NVME_LIDS_LSUPP);
 	logs->lids[NVME_LOG_RMI] = cpu_to_le32(NVME_LIDS_LSUPP);
 	logs->lids[NVME_LOG_RESERVATION] = cpu_to_le32(NVME_LIDS_LSUPP);
+	logs->lids[NVME_LOG_CCR] = cpu_to_le32(NVME_LIDS_LSUPP);
 
 	status = nvmet_copy_to_sgl(req, 0, logs, sizeof(*logs));
 	kfree(logs);
@@ -608,6 +609,47 @@ static void nvmet_execute_get_log_page_features(struct nvmet_req *req)
 	nvmet_req_complete(req, status);
 }
 
+static void nvmet_execute_get_log_page_ccr(struct nvmet_req *req)
+{
+	struct nvmet_ctrl *ctrl = req->sq->ctrl;
+	struct nvmet_ccr *ccr;
+	struct nvme_ccr_log *log;
+	int index = 0;
+	u16 status;
+
+	log = kzalloc(sizeof(*log), GFP_KERNEL);
+	if (!log) {
+		status = NVME_SC_INTERNAL;
+		goto out;
+	}
+
+	mutex_lock(&ctrl->lock);
+	list_for_each_entry(ccr, &ctrl->ccr_list, entry) {
+		u8 flags = NVME_CCR_FLAGS_VALIDATED | NVME_CCR_FLAGS_INITIATED;
+		u8 status = ccr->ctrl ? NVME_CCR_STATUS_IN_PROGRESS :
+					NVME_CCR_STATUS_SUCCESS;
+
+		log->entries[index].icid = cpu_to_le16(ccr->icid);
+		log->entries[index].ciu = ccr->ciu;
+		log->entries[index].acid = cpu_to_le16(0xffff);
+		log->entries[index].ccrs = status;
+		log->entries[index].ccrf = flags;
+		index++;
+	}
+
+	/* Cleanup completed CCRs if requested */
+	if (req->cmd->get_log_page.lsp & 0x1)
+		nvmet_ctrl_cleanup_ccrs(ctrl, false);
+	mutex_unlock(&ctrl->lock);
+
+	log->ne = cpu_to_le16(index);
+	nvmet_clear_aen_bit(req, NVME_AEN_BIT_CCR_COMPLETE);
+	status = nvmet_copy_to_sgl(req, 0, log, sizeof(*log));
+	kfree(log);
+out:
+	nvmet_req_complete(req, status);
+}
+
 static void nvmet_execute_get_log_page(struct nvmet_req *req)
 {
 	if (!nvmet_check_transfer_len(req, nvmet_get_log_page_len(req->cmd)))
@@ -641,6 +683,8 @@ static void nvmet_execute_get_log_page(struct nvmet_req *req)
 		return nvmet_execute_get_log_page_rmi(req);
 	case NVME_LOG_RESERVATION:
 		return nvmet_execute_get_log_page_resv(req);
+	case NVME_LOG_CCR:
+		return nvmet_execute_get_log_page_ccr(req);
 	}
 	pr_debug("unhandled lid %d on qid %d\n",
 	       req->cmd->get_log_page.lid, req->sq->qid);
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 0f305b317aa3..3e189674b69e 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -1435,6 +1435,7 @@ enum {
 	NVME_LOG_FDP_CONFIGS	= 0x20,
 	NVME_LOG_DISC		= 0x70,
 	NVME_LOG_RESERVATION	= 0x80,
+	NVME_LOG_CCR		= 0x1E,
 	NVME_FWACT_REPL		= (0 << 3),
 	NVME_FWACT_REPL_ACTV	= (1 << 3),
 	NVME_FWACT_ACTV		= (2 << 3),
@@ -1458,6 +1459,34 @@ enum {
 	NVME_FIS_CSCPE	= 1 << 21,
 };
 
+/* NVMe Cross-Controller Reset Status */
+enum {
+	NVME_CCR_STATUS_IN_PROGRESS,
+	NVME_CCR_STATUS_SUCCESS,
+	NVME_CCR_STATUS_FAILED,
+};
+
+/* NVMe Cross-Controller Reset Flags */
+enum {
+	NVME_CCR_FLAGS_VALIDATED	= 0x01,
+	NVME_CCR_FLAGS_INITIATED	= 0x02,
+};
+
+struct nvme_ccr_log_entry {
+	__le16			icid;
+	__u8			ciu;
+	__u8			rsvd3;
+	__le16			acid;
+	__u8			ccrs;
+	__u8			ccrf;
+};
+
+struct nvme_ccr_log {
+	__le16				ne;
+	__u8				rsvd2[6];
+	struct nvme_ccr_log_entry	entries[NVMF_CCR_PER_PAGE];
+};
+
 /* NVMe Namespace Write Protect State */
 enum {
 	NVME_NS_NO_WRITE_PROTECT = 0,
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v2 05/14] nvmet: Send an AEN on CCR completion
  2026-01-30 22:34 [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
                   ` (3 preceding siblings ...)
  2026-01-30 22:34 ` [PATCH v2 04/14] nvmet: Implement CCR logpage Mohamed Khalfella
@ 2026-01-30 22:34 ` Mohamed Khalfella
  2026-02-03  3:27   ` Hannes Reinecke
                     ` (2 more replies)
  2026-01-30 22:34 ` [PATCH v2 06/14] nvme: Rapid Path Failure Recovery read controller identify fields Mohamed Khalfella
                   ` (8 subsequent siblings)
  13 siblings, 3 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, Mohamed Khalfella

Send an AEN to initiator when impacted controller exists. The
notification points to CCR log page that initiator can read to check
which CCR operation completed.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/nvme/target/core.c  | 25 ++++++++++++++++++++++---
 drivers/nvme/target/nvmet.h |  3 ++-
 include/linux/nvme.h        |  3 +++
 3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 54dd0dcfa12b..ae2fe9f90bcd 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -202,7 +202,7 @@ static void nvmet_async_event_work(struct work_struct *work)
 	nvmet_async_events_process(ctrl);
 }
 
-void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
+static void nvmet_add_async_event_locked(struct nvmet_ctrl *ctrl, u8 event_type,
 		u8 event_info, u8 log_page)
 {
 	struct nvmet_async_event *aen;
@@ -215,13 +215,19 @@ void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
 	aen->event_info = event_info;
 	aen->log_page = log_page;
 
-	mutex_lock(&ctrl->lock);
 	list_add_tail(&aen->entry, &ctrl->async_events);
-	mutex_unlock(&ctrl->lock);
 
 	queue_work(nvmet_wq, &ctrl->async_event_work);
 }
 
+void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
+		u8 event_info, u8 log_page)
+{
+	mutex_lock(&ctrl->lock);
+	nvmet_add_async_event_locked(ctrl, event_type, event_info, log_page);
+	mutex_unlock(&ctrl->lock);
+}
+
 static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
 {
 	u32 i;
@@ -1788,6 +1794,18 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
 }
 EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
 
+static void nvmet_ctrl_notify_ccr(struct nvmet_ctrl *ctrl)
+{
+	lockdep_assert_held(&ctrl->lock);
+
+	if (nvmet_aen_bit_disabled(ctrl, NVME_AEN_BIT_CCR_COMPLETE))
+		return;
+
+	nvmet_add_async_event_locked(ctrl, NVME_AER_NOTICE,
+				     NVME_AER_NOTICE_CCR_COMPLETED,
+				     NVME_LOG_CCR);
+}
+
 static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
 {
 	struct nvmet_subsys *subsys = ctrl->subsys;
@@ -1803,6 +1821,7 @@ static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
 		list_for_each_entry(ccr, &sctrl->ccr_list, entry) {
 			if (ccr->ctrl == ctrl) {
 				ccr->ctrl = NULL;
+				nvmet_ctrl_notify_ccr(sctrl);
 				break;
 			}
 		}
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 93d6ac41cf85..00528feeb3cd 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -44,7 +44,8 @@
  * Supported optional AENs:
  */
 #define NVMET_AEN_CFG_OPTIONAL \
-	(NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_ANA_CHANGE)
+	(NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_ANA_CHANGE | \
+	 NVME_AEN_CFG_CCR_COMPLETE)
 #define NVMET_DISC_AEN_CFG_OPTIONAL \
 	(NVME_AEN_CFG_DISC_CHANGE)
 
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 3e189674b69e..f6d66dadc5b1 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -863,12 +863,14 @@ enum {
 	NVME_AER_NOTICE_FW_ACT_STARTING = 0x01,
 	NVME_AER_NOTICE_ANA		= 0x03,
 	NVME_AER_NOTICE_DISC_CHANGED	= 0xf0,
+	NVME_AER_NOTICE_CCR_COMPLETED	= 0xf4,
 };
 
 enum {
 	NVME_AEN_BIT_NS_ATTR		= 8,
 	NVME_AEN_BIT_FW_ACT		= 9,
 	NVME_AEN_BIT_ANA_CHANGE		= 11,
+	NVME_AEN_BIT_CCR_COMPLETE	= 20,
 	NVME_AEN_BIT_DISC_CHANGE	= 31,
 };
 
@@ -876,6 +878,7 @@ enum {
 	NVME_AEN_CFG_NS_ATTR		= 1 << NVME_AEN_BIT_NS_ATTR,
 	NVME_AEN_CFG_FW_ACT		= 1 << NVME_AEN_BIT_FW_ACT,
 	NVME_AEN_CFG_ANA_CHANGE		= 1 << NVME_AEN_BIT_ANA_CHANGE,
+	NVME_AEN_CFG_CCR_COMPLETE	= 1 << NVME_AEN_BIT_CCR_COMPLETE,
 	NVME_AEN_CFG_DISC_CHANGE	= 1 << NVME_AEN_BIT_DISC_CHANGE,
 };
 
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v2 06/14] nvme: Rapid Path Failure Recovery read controller identify fields
  2026-01-30 22:34 [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
                   ` (4 preceding siblings ...)
  2026-01-30 22:34 ` [PATCH v2 05/14] nvmet: Send an AEN on CCR completion Mohamed Khalfella
@ 2026-01-30 22:34 ` Mohamed Khalfella
  2026-02-03  3:28   ` Hannes Reinecke
                     ` (2 more replies)
  2026-01-30 22:34 ` [PATCH v2 07/14] nvme: Introduce FENCING and FENCED controller states Mohamed Khalfella
                   ` (7 subsequent siblings)
  13 siblings, 3 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, Mohamed Khalfella

TP8028 Rapid path failure added new fileds to controller identify
response. Read CIU (Controller Instance Uniquifier), CIRN (Controller
Instance Random Number), and CCRL (Cross-Controller Reset Limit) from
controller identify response. Expose CIU and CIRN as sysfs attributes
so the values can be used directrly by user if needed.

TP4129 KATO Corrections and Clarifications defined CQT (Command Quiesce
Time) which is used along with KATO (Keep Alive Timeout) to set an upper
limit for attempting Cross-Controller Recovery.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/nvme/host/core.c  |  5 +++++
 drivers/nvme/host/nvme.h  | 11 +++++++++++
 drivers/nvme/host/sysfs.c | 23 +++++++++++++++++++++++
 3 files changed, 39 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7bf228df6001..8961d612ccb0 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3571,6 +3571,11 @@ static int nvme_init_identify(struct nvme_ctrl *ctrl)
 	ctrl->crdt[1] = le16_to_cpu(id->crdt2);
 	ctrl->crdt[2] = le16_to_cpu(id->crdt3);
 
+	ctrl->ciu = id->ciu;
+	ctrl->cirn = le64_to_cpu(id->cirn);
+	atomic_set(&ctrl->ccr_limit, id->ccrl);
+	ctrl->cqt = le16_to_cpu(id->cqt);
+
 	ctrl->oacs = le16_to_cpu(id->oacs);
 	ctrl->oncs = le16_to_cpu(id->oncs);
 	ctrl->mtfa = le16_to_cpu(id->mtfa);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9a5f28c5103c..9dd9f179ad88 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -326,13 +326,17 @@ struct nvme_ctrl {
 	u32 max_zone_append;
 #endif
 	u16 crdt[3];
+	u16 cqt;
 	u16 oncs;
 	u8 dmrl;
+	u8 ciu;
 	u32 dmrsl;
+	u64 cirn;
 	u16 oacs;
 	u16 sqsize;
 	u32 max_namespaces;
 	atomic_t abort_limit;
+	atomic_t ccr_limit;
 	u8 vwc;
 	u32 vs;
 	u32 sgls;
@@ -1225,4 +1229,11 @@ static inline bool nvme_multi_css(struct nvme_ctrl *ctrl)
 	return (ctrl->ctrl_config & NVME_CC_CSS_MASK) == NVME_CC_CSS_CSI;
 }
 
+static inline unsigned long nvme_fence_timeout_ms(struct nvme_ctrl *ctrl)
+{
+	if (ctrl->ctratt & NVME_CTRL_ATTR_TBKAS)
+		return 3 * ctrl->kato * 1000 + ctrl->cqt;
+	return 2 * ctrl->kato * 1000 + ctrl->cqt;
+}
+
 #endif /* _NVME_H */
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 29430949ce2f..f81bbb6ec768 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -388,6 +388,27 @@ nvme_show_int_function(queue_count);
 nvme_show_int_function(sqsize);
 nvme_show_int_function(kato);
 
+static ssize_t nvme_sysfs_ciu_show(struct device *dev,
+					  struct device_attribute *attr,
+					  char *buf)
+{
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+	return sysfs_emit(buf, "%02x\n", ctrl->ciu);
+}
+static DEVICE_ATTR(ciu, S_IRUGO, nvme_sysfs_ciu_show, NULL);
+
+static ssize_t nvme_sysfs_cirn_show(struct device *dev,
+					  struct device_attribute *attr,
+					  char *buf)
+{
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+	return sysfs_emit(buf, "%016llx\n", ctrl->cirn);
+}
+static DEVICE_ATTR(cirn, S_IRUGO, nvme_sysfs_cirn_show, NULL);
+
+
 static ssize_t nvme_sysfs_delete(struct device *dev,
 				struct device_attribute *attr, const char *buf,
 				size_t count)
@@ -734,6 +755,8 @@ static struct attribute *nvme_dev_attrs[] = {
 	&dev_attr_numa_node.attr,
 	&dev_attr_queue_count.attr,
 	&dev_attr_sqsize.attr,
+	&dev_attr_ciu.attr,
+	&dev_attr_cirn.attr,
 	&dev_attr_hostnqn.attr,
 	&dev_attr_hostid.attr,
 	&dev_attr_ctrl_loss_tmo.attr,
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v2 07/14] nvme: Introduce FENCING and FENCED controller states
  2026-01-30 22:34 [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
                   ` (5 preceding siblings ...)
  2026-01-30 22:34 ` [PATCH v2 06/14] nvme: Rapid Path Failure Recovery read controller identify fields Mohamed Khalfella
@ 2026-01-30 22:34 ` Mohamed Khalfella
  2026-02-03  5:07   ` Hannes Reinecke
  2026-01-30 22:34 ` [PATCH v2 08/14] nvme: Implement cross-controller reset recovery Mohamed Khalfella
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, Mohamed Khalfella

FENCING is a new controller state that a LIVE controller enter when an
error is encountered. While in FENCING state inflight IOs that timeout
are not canceled because they should be held until either CCR succeeds
or time-based recovery completes. While the queues remain alive requests
are not allowed to be sent in this state and the controller can not be
reset of deleted. This is intentional because resetting or deleting the
controller results in canceling inflight IOs.

FENCED is a short-term state the controller enters before it is reset.
It exists only to prevent manual resets to happen while controller is
in FENCING state.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/nvme/host/core.c  | 25 +++++++++++++++++++++++--
 drivers/nvme/host/nvme.h  |  4 ++++
 drivers/nvme/host/sysfs.c |  2 ++
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8961d612ccb0..3e1e02822dd4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -574,10 +574,29 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 			break;
 		}
 		break;
+	case NVME_CTRL_FENCING:
+		switch (old_state) {
+		case NVME_CTRL_LIVE:
+			changed = true;
+			fallthrough;
+		default:
+			break;
+		}
+		break;
+	case NVME_CTRL_FENCED:
+		switch (old_state) {
+		case NVME_CTRL_FENCING:
+			changed = true;
+			fallthrough;
+		default:
+			break;
+		}
+		break;
 	case NVME_CTRL_RESETTING:
 		switch (old_state) {
 		case NVME_CTRL_NEW:
 		case NVME_CTRL_LIVE:
+		case NVME_CTRL_FENCED:
 			changed = true;
 			fallthrough;
 		default:
@@ -760,6 +779,7 @@ blk_status_t nvme_fail_nonready_command(struct nvme_ctrl *ctrl,
 
 	if (state != NVME_CTRL_DELETING_NOIO &&
 	    state != NVME_CTRL_DELETING &&
+	    state != NVME_CTRL_FENCING &&
 	    state != NVME_CTRL_DEAD &&
 	    !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
 	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
@@ -802,10 +822,11 @@ bool __nvme_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
 			     req->cmd->fabrics.fctype == nvme_fabrics_type_auth_receive))
 				return true;
 			break;
-		default:
-			break;
+		case NVME_CTRL_FENCING:
 		case NVME_CTRL_DEAD:
 			return false;
+		default:
+			break;
 		}
 	}
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9dd9f179ad88..00866bbc66f3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -251,6 +251,8 @@ static inline u16 nvme_req_qid(struct request *req)
 enum nvme_ctrl_state {
 	NVME_CTRL_NEW,
 	NVME_CTRL_LIVE,
+	NVME_CTRL_FENCING,
+	NVME_CTRL_FENCED,
 	NVME_CTRL_RESETTING,
 	NVME_CTRL_CONNECTING,
 	NVME_CTRL_DELETING,
@@ -777,6 +779,8 @@ static inline bool nvme_state_terminal(struct nvme_ctrl *ctrl)
 	switch (nvme_ctrl_state(ctrl)) {
 	case NVME_CTRL_NEW:
 	case NVME_CTRL_LIVE:
+	case NVME_CTRL_FENCING:
+	case NVME_CTRL_FENCED:
 	case NVME_CTRL_RESETTING:
 	case NVME_CTRL_CONNECTING:
 		return false;
diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index f81bbb6ec768..4ec9dfeb736e 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -443,6 +443,8 @@ static ssize_t nvme_sysfs_show_state(struct device *dev,
 	static const char *const state_name[] = {
 		[NVME_CTRL_NEW]		= "new",
 		[NVME_CTRL_LIVE]	= "live",
+		[NVME_CTRL_FENCING]	= "fencing",
+		[NVME_CTRL_FENCED]	= "fenced",
 		[NVME_CTRL_RESETTING]	= "resetting",
 		[NVME_CTRL_CONNECTING]	= "connecting",
 		[NVME_CTRL_DELETING]	= "deleting",
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v2 08/14] nvme: Implement cross-controller reset recovery
  2026-01-30 22:34 [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
                   ` (6 preceding siblings ...)
  2026-01-30 22:34 ` [PATCH v2 07/14] nvme: Introduce FENCING and FENCED controller states Mohamed Khalfella
@ 2026-01-30 22:34 ` Mohamed Khalfella
  2026-02-03  5:19   ` Hannes Reinecke
  2026-02-10 22:09   ` James Smart
  2026-01-30 22:34 ` [PATCH v2 09/14] nvme: Implement cross-controller reset completion Mohamed Khalfella
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, Mohamed Khalfella

A host that has more than one path connecting to an nvme subsystem
typically has an nvme controller associated with every path. This is
mostly applicable to nvmeof. If one path goes down, inflight IOs on that
path should not be retried immediately on another path because this
could lead to data corruption as described in TP4129. TP8028 defines
cross-controller reset mechanism that can be used by host to terminate
IOs on the failed path using one of the remaining healthy paths. Only
after IOs are terminated, or long enough time passes as defined by
TP4129, inflight IOs should be retried on another path. Implement core
cross-controller reset shared logic to be used by the transports.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/nvme/host/constants.c |   1 +
 drivers/nvme/host/core.c      | 129 ++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h      |   9 +++
 3 files changed, 139 insertions(+)

diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index dc90df9e13a2..f679efd5110e 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -46,6 +46,7 @@ static const char * const nvme_admin_ops[] = {
 	[nvme_admin_virtual_mgmt] = "Virtual Management",
 	[nvme_admin_nvme_mi_send] = "NVMe Send MI",
 	[nvme_admin_nvme_mi_recv] = "NVMe Receive MI",
+	[nvme_admin_cross_ctrl_reset] = "Cross Controller Reset",
 	[nvme_admin_dbbuf] = "Doorbell Buffer Config",
 	[nvme_admin_format_nvm] = "Format NVM",
 	[nvme_admin_security_send] = "Security Send",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3e1e02822dd4..13e0775d56b4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -554,6 +554,134 @@ void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_cancel_admin_tagset);
 
+static struct nvme_ctrl *nvme_find_ctrl_ccr(struct nvme_ctrl *ictrl,
+					    u32 min_cntlid)
+{
+	struct nvme_subsystem *subsys = ictrl->subsys;
+	struct nvme_ctrl *sctrl;
+	unsigned long flags;
+
+	mutex_lock(&nvme_subsystems_lock);
+	list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
+		if (sctrl->cntlid < min_cntlid)
+			continue;
+
+		if (atomic_dec_if_positive(&sctrl->ccr_limit) < 0)
+			continue;
+
+		spin_lock_irqsave(&sctrl->lock, flags);
+		if (sctrl->state != NVME_CTRL_LIVE) {
+			spin_unlock_irqrestore(&sctrl->lock, flags);
+			atomic_inc(&sctrl->ccr_limit);
+			continue;
+		}
+
+		/*
+		 * We got a good candidate source controller that is locked and
+		 * LIVE. However, no guarantee sctrl will not be deleted after
+		 * sctrl->lock is released. Get a ref of both sctrl and admin_q
+		 * so they do not disappear until we are done with them.
+		 */
+		WARN_ON_ONCE(!blk_get_queue(sctrl->admin_q));
+		nvme_get_ctrl(sctrl);
+		spin_unlock_irqrestore(&sctrl->lock, flags);
+		goto found;
+	}
+	sctrl = NULL;
+found:
+	mutex_unlock(&nvme_subsystems_lock);
+	return sctrl;
+}
+
+static void nvme_put_ctrl_ccr(struct nvme_ctrl *sctrl)
+{
+	atomic_inc(&sctrl->ccr_limit);
+	blk_put_queue(sctrl->admin_q);
+	nvme_put_ctrl(sctrl);
+}
+
+static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
+{
+	struct nvme_ccr_entry ccr = { };
+	union nvme_result res = { 0 };
+	struct nvme_command c = { };
+	unsigned long flags, tmo;
+	int ret = 0;
+	u32 result;
+
+	init_completion(&ccr.complete);
+	ccr.ictrl = ictrl;
+
+	spin_lock_irqsave(&sctrl->lock, flags);
+	list_add_tail(&ccr.list, &sctrl->ccr_list);
+	spin_unlock_irqrestore(&sctrl->lock, flags);
+
+	c.ccr.opcode = nvme_admin_cross_ctrl_reset;
+	c.ccr.ciu = ictrl->ciu;
+	c.ccr.icid = cpu_to_le16(ictrl->cntlid);
+	c.ccr.cirn = cpu_to_le64(ictrl->cirn);
+	ret = __nvme_submit_sync_cmd(sctrl->admin_q, &c, &res,
+				     NULL, 0, NVME_QID_ANY, 0);
+	if (ret)
+		goto out;
+
+	result = le32_to_cpu(res.u32);
+	if (result & 0x01) /* Immediate Reset Successful */
+		goto out;
+
+	tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
+	if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
+		ret = -ETIMEDOUT;
+		goto out;
+	}
+
+	if (ccr.ccrs != NVME_CCR_STATUS_SUCCESS)
+		ret = -EREMOTEIO;
+out:
+	spin_lock_irqsave(&sctrl->lock, flags);
+	list_del(&ccr.list);
+	spin_unlock_irqrestore(&sctrl->lock, flags);
+	return ret;
+}
+
+unsigned long nvme_fence_ctrl(struct nvme_ctrl *ictrl)
+{
+	unsigned long deadline, now, timeout;
+	struct nvme_ctrl *sctrl;
+	u32 min_cntlid = 0;
+	int ret;
+
+	timeout = nvme_fence_timeout_ms(ictrl);
+	dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
+
+	now = jiffies;
+	deadline = now + msecs_to_jiffies(timeout);
+	while (time_before(now, deadline)) {
+		sctrl = nvme_find_ctrl_ccr(ictrl, min_cntlid);
+		if (!sctrl) {
+			/* CCR failed, switch to time-based recovery */
+			return deadline - now;
+		}
+
+		ret = nvme_issue_wait_ccr(sctrl, ictrl);
+		if (!ret) {
+			dev_info(ictrl->device, "CCR succeeded using %s\n",
+				 dev_name(sctrl->device));
+			nvme_put_ctrl_ccr(sctrl);
+			return 0;
+		}
+
+		/* CCR failed, try another path */
+		min_cntlid = sctrl->cntlid + 1;
+		nvme_put_ctrl_ccr(sctrl);
+		now = jiffies;
+	}
+
+	dev_info(ictrl->device, "CCR reached timeout, call it done\n");
+	return 0;
+}
+EXPORT_SYMBOL_GPL(nvme_fence_ctrl);
+
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state)
 {
@@ -5119,6 +5247,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 
 	mutex_init(&ctrl->scan_lock);
 	INIT_LIST_HEAD(&ctrl->namespaces);
+	INIT_LIST_HEAD(&ctrl->ccr_list);
 	xa_init(&ctrl->cels);
 	ctrl->dev = dev;
 	ctrl->ops = ops;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 00866bbc66f3..fa18f580d76a 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -279,6 +279,13 @@ enum nvme_ctrl_flags {
 	NVME_CTRL_FROZEN		= 6,
 };
 
+struct nvme_ccr_entry {
+	struct list_head list;
+	struct completion complete;
+	struct nvme_ctrl *ictrl;
+	u8 ccrs;
+};
+
 struct nvme_ctrl {
 	bool comp_seen;
 	bool identified;
@@ -296,6 +303,7 @@ struct nvme_ctrl {
 	struct blk_mq_tag_set *tagset;
 	struct blk_mq_tag_set *admin_tagset;
 	struct list_head namespaces;
+	struct list_head ccr_list;
 	struct mutex namespaces_lock;
 	struct srcu_struct srcu;
 	struct device ctrl_device;
@@ -814,6 +822,7 @@ blk_status_t nvme_host_path_error(struct request *req);
 bool nvme_cancel_request(struct request *req, void *data);
 void nvme_cancel_tagset(struct nvme_ctrl *ctrl);
 void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl);
+unsigned long nvme_fence_ctrl(struct nvme_ctrl *ctrl);
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state);
 int nvme_disable_ctrl(struct nvme_ctrl *ctrl, bool shutdown);
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v2 09/14] nvme: Implement cross-controller reset completion
  2026-01-30 22:34 [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
                   ` (7 preceding siblings ...)
  2026-01-30 22:34 ` [PATCH v2 08/14] nvme: Implement cross-controller reset recovery Mohamed Khalfella
@ 2026-01-30 22:34 ` Mohamed Khalfella
  2026-02-03  5:22   ` Hannes Reinecke
  2026-01-30 22:34 ` [PATCH v2 10/14] nvme-tcp: Use CCR to recover controller that hits an error Mohamed Khalfella
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, Mohamed Khalfella

An nvme source controller that issues CCR command expects to receive an
NVME_AER_NOTICE_CCR_COMPLETED when pending CCR succeeds or fails. Add
sctrl->ccr_work to read NVME_LOG_CCR logpage and wakeup any thread
waiting on CCR completion.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/nvme/host/core.c | 49 +++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 13e0775d56b4..0f90feb46369 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1901,7 +1901,8 @@ EXPORT_SYMBOL_GPL(nvme_set_queue_count);
 
 #define NVME_AEN_SUPPORTED \
 	(NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | \
-	 NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_DISC_CHANGE)
+	 NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_CCR_COMPLETE | \
+	 NVME_AEN_CFG_DISC_CHANGE)
 
 static void nvme_enable_aen(struct nvme_ctrl *ctrl)
 {
@@ -4866,6 +4867,47 @@ static void nvme_get_fw_slot_info(struct nvme_ctrl *ctrl)
 	kfree(log);
 }
 
+static void nvme_ccr_work(struct work_struct *work)
+{
+	struct nvme_ctrl *ctrl = container_of(work, struct nvme_ctrl, ccr_work);
+	struct nvme_ccr_entry *ccr;
+	struct nvme_ccr_log_entry *entry;
+	struct nvme_ccr_log *log;
+	unsigned long flags;
+	int ret, i;
+
+	log = kmalloc(sizeof(*log), GFP_KERNEL);
+	if (!log)
+		return;
+
+	ret = nvme_get_log(ctrl, 0, NVME_LOG_CCR, 0x01,
+			   0x00, log, sizeof(*log), 0);
+	if (ret)
+		goto out;
+
+	spin_lock_irqsave(&ctrl->lock, flags);
+	for (i = 0; i < le16_to_cpu(log->ne); i++) {
+		entry = &log->entries[i];
+		if (entry->ccrs == NVME_CCR_STATUS_IN_PROGRESS)
+			continue;
+
+		list_for_each_entry(ccr, &ctrl->ccr_list, list) {
+			struct nvme_ctrl *ictrl = ccr->ictrl;
+
+			if (ictrl->cntlid != le16_to_cpu(entry->icid) ||
+			    ictrl->ciu != entry->ciu)
+				continue;
+
+			/* Complete matching entry */
+			ccr->ccrs = entry->ccrs;
+			complete(&ccr->complete);
+		}
+	}
+	spin_unlock_irqrestore(&ctrl->lock, flags);
+out:
+	kfree(log);
+}
+
 static void nvme_fw_act_work(struct work_struct *work)
 {
 	struct nvme_ctrl *ctrl = container_of(work,
@@ -4942,6 +4984,9 @@ static bool nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result)
 	case NVME_AER_NOTICE_DISC_CHANGED:
 		ctrl->aen_result = result;
 		break;
+	case NVME_AER_NOTICE_CCR_COMPLETED:
+		queue_work(nvme_wq, &ctrl->ccr_work);
+		break;
 	default:
 		dev_warn(ctrl->device, "async event result %08x\n", result);
 	}
@@ -5131,6 +5176,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
 	nvme_stop_failfast_work(ctrl);
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fw_act_work);
+	cancel_work_sync(&ctrl->ccr_work);
 	if (ctrl->ops->stop_ctrl)
 		ctrl->ops->stop_ctrl(ctrl);
 }
@@ -5254,6 +5300,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	ctrl->quirks = quirks;
 	ctrl->numa_node = NUMA_NO_NODE;
 	INIT_WORK(&ctrl->scan_work, nvme_scan_work);
+	INIT_WORK(&ctrl->ccr_work, nvme_ccr_work);
 	INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
 	INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
 	INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index fa18f580d76a..a7f382e35821 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -366,6 +366,7 @@ struct nvme_ctrl {
 	struct nvme_effects_log *effects;
 	struct xarray cels;
 	struct work_struct scan_work;
+	struct work_struct ccr_work;
 	struct work_struct async_event_work;
 	struct delayed_work ka_work;
 	struct delayed_work failfast_work;
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v2 10/14] nvme-tcp: Use CCR to recover controller that hits an error
  2026-01-30 22:34 [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
                   ` (8 preceding siblings ...)
  2026-01-30 22:34 ` [PATCH v2 09/14] nvme: Implement cross-controller reset completion Mohamed Khalfella
@ 2026-01-30 22:34 ` Mohamed Khalfella
  2026-02-03  5:34   ` Hannes Reinecke
  2026-01-30 22:34 ` [PATCH v2 11/14] nvme-rdma: " Mohamed Khalfella
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, Mohamed Khalfella

An alive nvme controller that hits an error now will move to FENCING
state instead of RESETTING state. ctrl->fencing_work attempts CCR to
terminate inflight IOs. If CCR succeeds, switch to FENCED -> RESETTING
and continue error recovery as usual. If CCR fails, the behavior depends
on whether the subsystem supports CQT or not. If CQT is not supported
then reset the controller immediately as if CCR succeeded in order to
maintain the current behavior. If CQT is supported switch to time-based
recovery. Schedule ctrl->fenced_work resets the controller when time
based recovery finishes.

Either ctrl->err_work or ctrl->reset_work can run after a controller is
fenced. Flush fencing work when either work run.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/nvme/host/tcp.c | 62 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 69cb04406b47..af8d3b36a4bb 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -193,6 +193,8 @@ struct nvme_tcp_ctrl {
 	struct sockaddr_storage src_addr;
 	struct nvme_ctrl	ctrl;
 
+	struct work_struct	fencing_work;
+	struct delayed_work	fenced_work;
 	struct work_struct	err_work;
 	struct delayed_work	connect_work;
 	struct nvme_tcp_request async_req;
@@ -611,6 +613,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
 
 static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
 {
+	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCING)) {
+		dev_warn(ctrl->device, "starting controller fencing\n");
+		queue_work(nvme_wq, &to_tcp_ctrl(ctrl)->fencing_work);
+		return;
+	}
+
 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
 		return;
 
@@ -2470,12 +2478,59 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
 	nvme_tcp_reconnect_or_remove(ctrl, ret);
 }
 
+static void nvme_tcp_fenced_work(struct work_struct *work)
+{
+	struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
+					struct nvme_tcp_ctrl, fenced_work);
+	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
+
+	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
+	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+		queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
+}
+
+static void nvme_tcp_fencing_work(struct work_struct *work)
+{
+	struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
+			struct nvme_tcp_ctrl, fencing_work);
+	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
+	unsigned long rem;
+
+	rem = nvme_fence_ctrl(ctrl);
+	if (!rem)
+		goto done;
+
+	if (!ctrl->cqt) {
+		dev_info(ctrl->device,
+			 "CCR failed, CQT not supported, skip time-based recovery\n");
+		goto done;
+	}
+
+	dev_info(ctrl->device,
+		 "CCR failed, switch to time-based recovery, timeout = %ums\n",
+		 jiffies_to_msecs(rem));
+	queue_delayed_work(nvme_wq, &tcp_ctrl->fenced_work, rem);
+	return;
+
+done:
+	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
+	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+		queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
+}
+
+static void nvme_tcp_flush_fencing_work(struct nvme_ctrl *ctrl)
+{
+	flush_work(&to_tcp_ctrl(ctrl)->fencing_work);
+	flush_delayed_work(&to_tcp_ctrl(ctrl)->fenced_work);
+}
+
 static void nvme_tcp_error_recovery_work(struct work_struct *work)
 {
 	struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
 				struct nvme_tcp_ctrl, err_work);
 	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
 
+	nvme_tcp_flush_fencing_work(ctrl);
 	if (nvme_tcp_key_revoke_needed(ctrl))
 		nvme_auth_revoke_tls_key(ctrl);
 	nvme_stop_keep_alive(ctrl);
@@ -2518,6 +2573,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
 		container_of(work, struct nvme_ctrl, reset_work);
 	int ret;
 
+	nvme_tcp_flush_fencing_work(ctrl);
 	if (nvme_tcp_key_revoke_needed(ctrl))
 		nvme_auth_revoke_tls_key(ctrl);
 	nvme_stop_ctrl(ctrl);
@@ -2643,13 +2699,15 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
 	struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
 	struct nvme_command *cmd = &pdu->cmd;
 	int qid = nvme_tcp_queue_id(req->queue);
+	enum nvme_ctrl_state state;
 
 	dev_warn(ctrl->device,
 		 "I/O tag %d (%04x) type %d opcode %#x (%s) QID %d timeout\n",
 		 rq->tag, nvme_cid(rq), pdu->hdr.type, cmd->common.opcode,
 		 nvme_fabrics_opcode_str(qid, cmd), qid);
 
-	if (nvme_ctrl_state(ctrl) != NVME_CTRL_LIVE) {
+	state = nvme_ctrl_state(ctrl);
+	if (state != NVME_CTRL_LIVE && state != NVME_CTRL_FENCING) {
 		/*
 		 * If we are resetting, connecting or deleting we should
 		 * complete immediately because we may block controller
@@ -2904,6 +2962,8 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
 
 	INIT_DELAYED_WORK(&ctrl->connect_work,
 			nvme_tcp_reconnect_ctrl_work);
+	INIT_DELAYED_WORK(&ctrl->fenced_work, nvme_tcp_fenced_work);
+	INIT_WORK(&ctrl->fencing_work, nvme_tcp_fencing_work);
 	INIT_WORK(&ctrl->err_work, nvme_tcp_error_recovery_work);
 	INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
 
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v2 11/14] nvme-rdma: Use CCR to recover controller that hits an error
  2026-01-30 22:34 [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
                   ` (9 preceding siblings ...)
  2026-01-30 22:34 ` [PATCH v2 10/14] nvme-tcp: Use CCR to recover controller that hits an error Mohamed Khalfella
@ 2026-01-30 22:34 ` Mohamed Khalfella
  2026-02-03  5:35   ` Hannes Reinecke
  2026-01-30 22:34 ` [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset Mohamed Khalfella
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, Mohamed Khalfella

An alive nvme controller that hits an error now will move to FENCING
state instead of RESETTING state. ctrl->fencing_work attempts CCR to
terminate inflight IOs. If CCR succeeds, switch to FENCED -> RESETTING
and continue error recovery as usual. If CCR fails, the behavior depends
on whether the subsystem supports CQT or not. If CQT is not supported
then reset the controller immediately as if CCR succeeded in order to
maintain the current behavior. If CQT is supported switch to time-based
recovery. Schedule ctrl->fenced_work resets the controller when time
based recovery finishes.

Either ctrl->err_work or ctrl->reset_work can run after a controller is
fenced. Flush fencing work when either work run.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/nvme/host/rdma.c | 62 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 35c0822edb2d..da45c9ea4f32 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -106,6 +106,8 @@ struct nvme_rdma_ctrl {
 
 	/* other member variables */
 	struct blk_mq_tag_set	tag_set;
+	struct work_struct	fencing_work;
+	struct delayed_work	fenced_work;
 	struct work_struct	err_work;
 
 	struct nvme_rdma_qe	async_event_sqe;
@@ -1120,11 +1122,58 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
 	nvme_rdma_reconnect_or_remove(ctrl, ret);
 }
 
+static void nvme_rdma_fenced_work(struct work_struct *work)
+{
+	struct nvme_rdma_ctrl *rdma_ctrl = container_of(to_delayed_work(work),
+					struct nvme_rdma_ctrl, fenced_work);
+	struct nvme_ctrl *ctrl = &rdma_ctrl->ctrl;
+
+	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
+	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+		queue_work(nvme_reset_wq, &rdma_ctrl->err_work);
+}
+
+static void nvme_rdma_fencing_work(struct work_struct *work)
+{
+	struct nvme_rdma_ctrl *rdma_ctrl = container_of(work,
+			struct nvme_rdma_ctrl, fencing_work);
+	struct nvme_ctrl *ctrl = &rdma_ctrl->ctrl;
+	unsigned long rem;
+
+	rem = nvme_fence_ctrl(ctrl);
+	if (!rem)
+		goto done;
+
+	if (!ctrl->cqt) {
+		dev_info(ctrl->device,
+			 "CCR failed, CQT not supported, skip time-based recovery\n");
+		goto done;
+	}
+
+	dev_info(ctrl->device,
+		 "CCR failed, switch to time-based recovery, timeout = %ums\n",
+		 jiffies_to_msecs(rem));
+	queue_delayed_work(nvme_wq, &rdma_ctrl->fenced_work, rem);
+	return;
+
+done:
+	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
+	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+		queue_work(nvme_reset_wq, &rdma_ctrl->err_work);
+}
+
+static void nvme_rdma_flush_fencing_work(struct nvme_rdma_ctrl *ctrl)
+{
+	flush_work(&ctrl->fencing_work);
+	flush_delayed_work(&ctrl->fenced_work);
+}
+
 static void nvme_rdma_error_recovery_work(struct work_struct *work)
 {
 	struct nvme_rdma_ctrl *ctrl = container_of(work,
 			struct nvme_rdma_ctrl, err_work);
 
+	nvme_rdma_flush_fencing_work(ctrl);
 	nvme_stop_keep_alive(&ctrl->ctrl);
 	flush_work(&ctrl->ctrl.async_event_work);
 	nvme_rdma_teardown_io_queues(ctrl, false);
@@ -1147,6 +1196,12 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
 
 static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
 {
+	if (nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_FENCING)) {
+		dev_warn(ctrl->ctrl.device, "starting controller fencing\n");
+		queue_work(nvme_wq, &ctrl->fencing_work);
+		return;
+	}
+
 	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
 		return;
 
@@ -1957,13 +2012,15 @@ static enum blk_eh_timer_return nvme_rdma_timeout(struct request *rq)
 	struct nvme_rdma_ctrl *ctrl = queue->ctrl;
 	struct nvme_command *cmd = req->req.cmd;
 	int qid = nvme_rdma_queue_idx(queue);
+	enum nvme_ctrl_state state;
 
 	dev_warn(ctrl->ctrl.device,
 		 "I/O tag %d (%04x) opcode %#x (%s) QID %d timeout\n",
 		 rq->tag, nvme_cid(rq), cmd->common.opcode,
 		 nvme_fabrics_opcode_str(qid, cmd), qid);
 
-	if (nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_LIVE) {
+	state = nvme_ctrl_state(&ctrl->ctrl);
+	if (state != NVME_CTRL_LIVE && state != NVME_CTRL_FENCING) {
 		/*
 		 * If we are resetting, connecting or deleting we should
 		 * complete immediately because we may block controller
@@ -2169,6 +2226,7 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
 		container_of(work, struct nvme_rdma_ctrl, ctrl.reset_work);
 	int ret;
 
+	nvme_rdma_flush_fencing_work(ctrl);
 	nvme_stop_ctrl(&ctrl->ctrl);
 	nvme_rdma_shutdown_ctrl(ctrl, false);
 
@@ -2281,6 +2339,8 @@ static struct nvme_rdma_ctrl *nvme_rdma_alloc_ctrl(struct device *dev,
 
 	INIT_DELAYED_WORK(&ctrl->reconnect_work,
 			nvme_rdma_reconnect_ctrl_work);
+	INIT_DELAYED_WORK(&ctrl->fenced_work, nvme_rdma_fenced_work);
+	INIT_WORK(&ctrl->fencing_work, nvme_rdma_fencing_work);
 	INIT_WORK(&ctrl->err_work, nvme_rdma_error_recovery_work);
 	INIT_WORK(&ctrl->ctrl.reset_work, nvme_rdma_reset_ctrl_work);
 
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset
  2026-01-30 22:34 [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
                   ` (10 preceding siblings ...)
  2026-01-30 22:34 ` [PATCH v2 11/14] nvme-rdma: " Mohamed Khalfella
@ 2026-01-30 22:34 ` Mohamed Khalfella
  2026-02-03  5:40   ` Hannes Reinecke
  2026-02-03 19:19   ` James Smart
  2026-01-30 22:34 ` [PATCH v2 13/14] nvme-fc: Use CCR to recover controller that hits an error Mohamed Khalfella
  2026-01-30 22:34 ` [PATCH v2 14/14] nvme-fc: Hold inflight requests while in FENCING state Mohamed Khalfella
  13 siblings, 2 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, Mohamed Khalfella

nvme_fc_error_recovery() called from nvme_fc_timeout() while controller
in CONNECTING state results in deadlock reported in link below. Update
nvme_fc_timeout() to schedule error recovery to avoid the deadlock.

Previous to this change if controller was LIVE error recovery resets
the controller and this does not match nvme-tcp and nvme-rdma. Decouple
error recovery from controller reset to match other fabric transports.

Link: https://lore.kernel.org/all/20250529214928.2112990-1-mkhalfella@purestorage.com/
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/nvme/host/fc.c | 94 ++++++++++++++++++------------------------
 1 file changed, 41 insertions(+), 53 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 6948de3f438a..f8f6071b78ed 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -227,6 +227,8 @@ static DEFINE_IDA(nvme_fc_ctrl_cnt);
 static struct device *fc_udev_device;
 
 static void nvme_fc_complete_rq(struct request *rq);
+static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
+					 char *errmsg);
 
 /* *********************** FC-NVME Port Management ************************ */
 
@@ -788,7 +790,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
 		"Reconnect", ctrl->cnum);
 
 	set_bit(ASSOC_FAILED, &ctrl->flags);
-	nvme_reset_ctrl(&ctrl->ctrl);
+	nvme_fc_start_ioerr_recovery(ctrl, "Connectivity Loss");
 }
 
 /**
@@ -985,7 +987,7 @@ fc_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
 static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
 static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
 
-static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg);
+static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl);
 
 static void
 __nvme_fc_finish_ls_req(struct nvmefc_ls_req_op *lsop)
@@ -1567,9 +1569,8 @@ nvme_fc_ls_disconnect_assoc(struct nvmefc_ls_rcv_op *lsop)
 	 * for the association have been ABTS'd by
 	 * nvme_fc_delete_association().
 	 */
-
-	/* fail the association */
-	nvme_fc_error_recovery(ctrl, "Disconnect Association LS received");
+	nvme_fc_start_ioerr_recovery(ctrl,
+				     "Disconnect Association LS received");
 
 	/* release the reference taken by nvme_fc_match_disconn_ls() */
 	nvme_fc_ctrl_put(ctrl);
@@ -1871,7 +1872,7 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
 	struct nvme_fc_ctrl *ctrl =
 			container_of(work, struct nvme_fc_ctrl, ioerr_work);
 
-	nvme_fc_error_recovery(ctrl, "transport detected io error");
+	nvme_fc_error_recovery(ctrl);
 }
 
 /*
@@ -1892,6 +1893,17 @@ char *nvme_fc_io_getuuid(struct nvmefc_fcp_req *req)
 }
 EXPORT_SYMBOL_GPL(nvme_fc_io_getuuid);
 
+static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
+					 char *errmsg)
+{
+	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
+		return;
+
+	dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error recovery %s\n",
+		 ctrl->cnum, errmsg);
+	queue_work(nvme_reset_wq, &ctrl->ioerr_work);
+}
+
 static void
 nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
 {
@@ -2049,9 +2061,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
 		nvme_fc_complete_rq(rq);
 
 check_error:
-	if (terminate_assoc &&
-	    nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_RESETTING)
-		queue_work(nvme_reset_wq, &ctrl->ioerr_work);
+	if (terminate_assoc)
+		nvme_fc_start_ioerr_recovery(ctrl, "io error");
 }
 
 static int
@@ -2495,39 +2506,6 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
 		nvme_unquiesce_admin_queue(&ctrl->ctrl);
 }
 
-static void
-nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
-{
-	enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
-
-	/*
-	 * if an error (io timeout, etc) while (re)connecting, the remote
-	 * port requested terminating of the association (disconnect_ls)
-	 * or an error (timeout or abort) occurred on an io while creating
-	 * the controller.  Abort any ios on the association and let the
-	 * create_association error path resolve things.
-	 */
-	if (state == NVME_CTRL_CONNECTING) {
-		__nvme_fc_abort_outstanding_ios(ctrl, true);
-		dev_warn(ctrl->ctrl.device,
-			"NVME-FC{%d}: transport error during (re)connect\n",
-			ctrl->cnum);
-		return;
-	}
-
-	/* Otherwise, only proceed if in LIVE state - e.g. on first error */
-	if (state != NVME_CTRL_LIVE)
-		return;
-
-	dev_warn(ctrl->ctrl.device,
-		"NVME-FC{%d}: transport association event: %s\n",
-		ctrl->cnum, errmsg);
-	dev_warn(ctrl->ctrl.device,
-		"NVME-FC{%d}: resetting controller\n", ctrl->cnum);
-
-	nvme_reset_ctrl(&ctrl->ctrl);
-}
-
 static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
 {
 	struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(rq);
@@ -2536,24 +2514,14 @@ static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
 	struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
 	struct nvme_command *sqe = &cmdiu->sqe;
 
-	/*
-	 * Attempt to abort the offending command. Command completion
-	 * will detect the aborted io and will fail the connection.
-	 */
 	dev_info(ctrl->ctrl.device,
 		"NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
 		"x%08x/x%08x\n",
 		ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
 		nvme_fabrics_opcode_str(qnum, sqe),
 		sqe->common.cdw10, sqe->common.cdw11);
-	if (__nvme_fc_abort_op(ctrl, op))
-		nvme_fc_error_recovery(ctrl, "io timeout abort failed");
 
-	/*
-	 * the io abort has been initiated. Have the reset timer
-	 * restarted and the abort completion will complete the io
-	 * shortly. Avoids a synchronous wait while the abort finishes.
-	 */
+	nvme_fc_start_ioerr_recovery(ctrl, "io timeout");
 	return BLK_EH_RESET_TIMER;
 }
 
@@ -3352,6 +3320,26 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
 	}
 }
 
+static void
+nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
+{
+	nvme_stop_keep_alive(&ctrl->ctrl);
+	nvme_stop_ctrl(&ctrl->ctrl);
+
+	/* will block while waiting for io to terminate */
+	nvme_fc_delete_association(ctrl);
+
+	/* Do not reconnect if controller is being deleted */
+	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
+		return;
+
+	if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
+		queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
+		return;
+	}
+
+	nvme_fc_reconnect_or_delete(ctrl, -ENOTCONN);
+}
 
 static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
 	.name			= "fc",
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v2 13/14] nvme-fc: Use CCR to recover controller that hits an error
  2026-01-30 22:34 [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
                   ` (11 preceding siblings ...)
  2026-01-30 22:34 ` [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset Mohamed Khalfella
@ 2026-01-30 22:34 ` Mohamed Khalfella
  2026-02-03  5:43   ` Hannes Reinecke
  2026-02-10 22:12   ` James Smart
  2026-01-30 22:34 ` [PATCH v2 14/14] nvme-fc: Hold inflight requests while in FENCING state Mohamed Khalfella
  13 siblings, 2 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, Mohamed Khalfella

An alive nvme controller that hits an error now will move to FENCING
state instead of RESETTING state. ctrl->fencing_work attempts CCR to
terminate inflight IOs. If CCR succeeds, switch to FENCED -> RESETTING
and continue error recovery as usual. If CCR fails, the behavior depends
on whether the subsystem supports CQT or not. If CQT is not supported
then reset the controller immediately as if CCR succeeded in order to
maintain the current behavior. If CQT is supported switch to time-based
recovery. Schedule ctrl->fenced_work resets the controller when time
based recovery finishes.

Either ctrl->err_work or ctrl->reset_work can run after a controller is
fenced. Flush fencing work when either work run.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/nvme/host/fc.c | 60 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index f8f6071b78ed..3a01aeb39081 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -166,6 +166,8 @@ struct nvme_fc_ctrl {
 	struct blk_mq_tag_set	admin_tag_set;
 	struct blk_mq_tag_set	tag_set;
 
+	struct work_struct	fencing_work;
+	struct delayed_work	fenced_work;
 	struct work_struct	ioerr_work;
 	struct delayed_work	connect_work;
 
@@ -1866,12 +1868,59 @@ __nvme_fc_fcpop_chk_teardowns(struct nvme_fc_ctrl *ctrl,
 	}
 }
 
+static void nvme_fc_fenced_work(struct work_struct *work)
+{
+	struct nvme_fc_ctrl *fc_ctrl = container_of(to_delayed_work(work),
+			struct nvme_fc_ctrl, fenced_work);
+	struct nvme_ctrl *ctrl = &fc_ctrl->ctrl;
+
+	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
+	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+		queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
+}
+
+static void nvme_fc_fencing_work(struct work_struct *work)
+{
+	struct nvme_fc_ctrl *fc_ctrl =
+			container_of(work, struct nvme_fc_ctrl, fencing_work);
+	struct nvme_ctrl *ctrl = &fc_ctrl->ctrl;
+	unsigned long rem;
+
+	rem = nvme_fence_ctrl(ctrl);
+	if (!rem)
+		goto done;
+
+	if (!ctrl->cqt) {
+		dev_info(ctrl->device,
+			 "CCR failed, CQT not supported, skip time-based recovery\n");
+		goto done;
+	}
+
+	dev_info(ctrl->device,
+		 "CCR failed, switch to time-based recovery, timeout = %ums\n",
+		 jiffies_to_msecs(rem));
+	queue_delayed_work(nvme_wq, &fc_ctrl->fenced_work, rem);
+	return;
+
+done:
+	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
+	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+		queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
+}
+
+static void nvme_fc_flush_fencing_work(struct nvme_fc_ctrl *ctrl)
+{
+	flush_work(&ctrl->fencing_work);
+	flush_delayed_work(&ctrl->fenced_work);
+}
+
 static void
 nvme_fc_ctrl_ioerr_work(struct work_struct *work)
 {
 	struct nvme_fc_ctrl *ctrl =
 			container_of(work, struct nvme_fc_ctrl, ioerr_work);
 
+	nvme_fc_flush_fencing_work(ctrl);
 	nvme_fc_error_recovery(ctrl);
 }
 
@@ -1896,6 +1945,14 @@ EXPORT_SYMBOL_GPL(nvme_fc_io_getuuid);
 static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
 					 char *errmsg)
 {
+	if (nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_FENCING)) {
+		dev_warn(ctrl->ctrl.device,
+			 "NVME-FC{%d}: starting controller fencing %s\n",
+			 ctrl->cnum, errmsg);
+		queue_work(nvme_wq, &ctrl->fencing_work);
+		return;
+	}
+
 	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
 		return;
 
@@ -3297,6 +3354,7 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
 	struct nvme_fc_ctrl *ctrl =
 		container_of(work, struct nvme_fc_ctrl, ctrl.reset_work);
 
+	nvme_fc_flush_fencing_work(ctrl);
 	nvme_stop_ctrl(&ctrl->ctrl);
 
 	/* will block will waiting for io to terminate */
@@ -3471,6 +3529,8 @@ nvme_fc_alloc_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 
 	INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
 	INIT_DELAYED_WORK(&ctrl->connect_work, nvme_fc_connect_ctrl_work);
+	INIT_DELAYED_WORK(&ctrl->fenced_work, nvme_fc_fenced_work);
+	INIT_WORK(&ctrl->fencing_work, nvme_fc_fencing_work);
 	INIT_WORK(&ctrl->ioerr_work, nvme_fc_ctrl_ioerr_work);
 	spin_lock_init(&ctrl->lock);
 
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v2 14/14] nvme-fc: Hold inflight requests while in FENCING state
  2026-01-30 22:34 [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
                   ` (12 preceding siblings ...)
  2026-01-30 22:34 ` [PATCH v2 13/14] nvme-fc: Use CCR to recover controller that hits an error Mohamed Khalfella
@ 2026-01-30 22:34 ` Mohamed Khalfella
  13 siblings, 0 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-01-30 22:34 UTC (permalink / raw)
  To: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, Mohamed Khalfella

While in FENCING state aborted inflight IOs should be held until fencing
is done. Update nvme_fc_fcpio_done() to not complete aborted requests or
requests with transport errors. These held requests will be canceled in
nvme_fc_delete_association() after fencing is done. nvme_fc_fcpio_done()
avoids racing with canceling aborted requests by making sure we complete
successful requests before waking up the waiting thread.

Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/nvme/host/fc.c | 61 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 51 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 3a01aeb39081..122ebd7085b1 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -173,7 +173,7 @@ struct nvme_fc_ctrl {
 
 	struct kref		ref;
 	unsigned long		flags;
-	u32			iocnt;
+	atomic_t		iocnt;
 	wait_queue_head_t	ioabort_wait;
 
 	struct nvme_fc_fcp_op	aen_ops[NVME_NR_AEN_COMMANDS];
@@ -1822,7 +1822,7 @@ __nvme_fc_abort_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_fcp_op *op)
 		atomic_set(&op->state, opstate);
 	else if (test_bit(FCCTRL_TERMIO, &ctrl->flags)) {
 		op->flags |= FCOP_FLAGS_TERMIO;
-		ctrl->iocnt++;
+		atomic_inc(&ctrl->iocnt);
 	}
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 
@@ -1852,20 +1852,29 @@ nvme_fc_abort_aen_ops(struct nvme_fc_ctrl *ctrl)
 }
 
 static inline void
+__nvme_fc_fcpop_count_one_down(struct nvme_fc_ctrl *ctrl)
+{
+	if (atomic_dec_return(&ctrl->iocnt) == 0)
+		wake_up(&ctrl->ioabort_wait);
+}
+
+static inline bool
 __nvme_fc_fcpop_chk_teardowns(struct nvme_fc_ctrl *ctrl,
 		struct nvme_fc_fcp_op *op, int opstate)
 {
 	unsigned long flags;
+	bool ret = false;
 
 	if (opstate == FCPOP_STATE_ABORTED) {
 		spin_lock_irqsave(&ctrl->lock, flags);
 		if (test_bit(FCCTRL_TERMIO, &ctrl->flags) &&
 		    op->flags & FCOP_FLAGS_TERMIO) {
-			if (!--ctrl->iocnt)
-				wake_up(&ctrl->ioabort_wait);
+			ret = true;
 		}
 		spin_unlock_irqrestore(&ctrl->lock, flags);
 	}
+
+	return ret;
 }
 
 static void nvme_fc_fenced_work(struct work_struct *work)
@@ -1973,7 +1982,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
 	struct nvme_command *sqe = &op->cmd_iu.sqe;
 	__le16 status = cpu_to_le16(NVME_SC_SUCCESS << 1);
 	union nvme_result result;
-	bool terminate_assoc = true;
+	bool op_term, terminate_assoc = true;
+	enum nvme_ctrl_state state;
 	int opstate;
 
 	/*
@@ -2106,16 +2116,38 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
 done:
 	if (op->flags & FCOP_FLAGS_AEN) {
 		nvme_complete_async_event(&queue->ctrl->ctrl, status, &result);
-		__nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
+		if (__nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate))
+			__nvme_fc_fcpop_count_one_down(ctrl);
 		atomic_set(&op->state, FCPOP_STATE_IDLE);
 		op->flags = FCOP_FLAGS_AEN;	/* clear other flags */
 		nvme_fc_ctrl_put(ctrl);
 		goto check_error;
 	}
 
-	__nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
+	/*
+	 * We can not access op after the request is completed because it can
+	 * be reused immediately. At the same time we want to wakeup the thread
+	 * waiting for ongoing IOs _after_ requests are completed. This is
+	 * necessary because that thread will start canceling inflight IOs
+	 * and we want to avoid request completion racing with cancellation.
+	 */
+	op_term = __nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
+
+	/*
+	 * If we are going to terminate associations and the controller is
+	 * LIVE or FENCING, then do not complete this request now. Let error
+	 * recovery cancel this request when it is safe to do so.
+	 */
+	state = nvme_ctrl_state(&ctrl->ctrl);
+	if (terminate_assoc &&
+	    (state == NVME_CTRL_LIVE || state == NVME_CTRL_FENCING))
+		goto check_op_term;
+
 	if (!nvme_try_complete_req(rq, status, result))
 		nvme_fc_complete_rq(rq);
+check_op_term:
+	if (op_term)
+		__nvme_fc_fcpop_count_one_down(ctrl);
 
 check_error:
 	if (terminate_assoc)
@@ -2754,7 +2786,8 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
 		 * cmd with the csn was supposed to arrive.
 		 */
 		opstate = atomic_xchg(&op->state, FCPOP_STATE_COMPLETE);
-		__nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate);
+		if (__nvme_fc_fcpop_chk_teardowns(ctrl, op, opstate))
+			__nvme_fc_fcpop_count_one_down(ctrl);
 
 		if (!(op->flags & FCOP_FLAGS_AEN)) {
 			nvme_fc_unmap_data(ctrl, op->rq, op);
@@ -3223,7 +3256,7 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl)
 
 	spin_lock_irqsave(&ctrl->lock, flags);
 	set_bit(FCCTRL_TERMIO, &ctrl->flags);
-	ctrl->iocnt = 0;
+	atomic_set(&ctrl->iocnt, 0);
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 
 	__nvme_fc_abort_outstanding_ios(ctrl, false);
@@ -3232,11 +3265,19 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl)
 	nvme_fc_abort_aen_ops(ctrl);
 
 	/* wait for all io that had to be aborted */
+	wait_event(ctrl->ioabort_wait, atomic_read(&ctrl->iocnt) == 0);
 	spin_lock_irq(&ctrl->lock);
-	wait_event_lock_irq(ctrl->ioabort_wait, ctrl->iocnt == 0, ctrl->lock);
 	clear_bit(FCCTRL_TERMIO, &ctrl->flags);
 	spin_unlock_irq(&ctrl->lock);
 
+	/*
+	 * At this point all inflight requests have been successfully
+	 * aborted. Now it is safe to cancel all requests we decided
+	 * not to complete in nvme_fc_fcpio_done().
+	 */
+	nvme_cancel_tagset(&ctrl->ctrl);
+	nvme_cancel_admin_tagset(&ctrl->ctrl);
+
 	nvme_fc_term_aen_ops(ctrl);
 
 	/*
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields
  2026-01-30 22:34 ` [PATCH v2 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
@ 2026-02-03  3:03   ` Hannes Reinecke
  2026-02-03 18:14     ` Mohamed Khalfella
  0 siblings, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-03  3:03 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 1/30/26 23:34, Mohamed Khalfella wrote:
> TP8028 Rapid Path Failure Recovery defined new fields in controller
> identify response. The newly defined fields are:
> 
> - CIU (Controller Instance UNIQUIFIER): is an 8bit non-zero value that
> is assigned a random value when controller first created. The value is
> expected to be incremented when RDY bit in CSTS register is asserted
> - CIRN (Controller Instance Random Number): is 64bit random value that
> gets generated when controller is crated. CIRN is regenerated everytime
> RDY bit is CSTS register is asserted.
> - CCRL (Cross-Controller Reset Limit) is an 8bit value that defines the
> maximum number of in-progress controller reset operations. CCRL is
> hardcoded to 4 as recommended by TP8028.
> 
> TP4129 KATO Corrections and Clarifications defined CQT (Command Quiesce
> Time) which is used along with KATO (Keep Alive Timeout) to set an upper
> time limit for attempting Cross-Controller Recovery. For NVME subsystem
> CQT is set to 0 by default to keep the current behavior. The value can
> be set from configfs if needed.
> 
> Make the new fields available for IO controllers only since TP8028 is
> not very useful for discovery controllers.
> 
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>   drivers/nvme/target/admin-cmd.c |  6 ++++++
>   drivers/nvme/target/configfs.c  | 31 +++++++++++++++++++++++++++++++
>   drivers/nvme/target/core.c      | 12 ++++++++++++
>   drivers/nvme/target/nvmet.h     |  4 ++++
>   include/linux/nvme.h            | 15 ++++++++++++---
>   5 files changed, 65 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
> index 3da31bb1183e..ade1145df72d 100644
> --- a/drivers/nvme/target/admin-cmd.c
> +++ b/drivers/nvme/target/admin-cmd.c
> @@ -696,6 +696,12 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
>   
>   	id->cntlid = cpu_to_le16(ctrl->cntlid);
>   	id->ver = cpu_to_le32(ctrl->subsys->ver);
> +	if (!nvmet_is_disc_subsys(ctrl->subsys)) {
> +		id->cqt = cpu_to_le16(ctrl->cqt);
> +		id->ciu = ctrl->ciu;
> +		id->cirn = cpu_to_le64(ctrl->cirn);
> +		id->ccrl = NVMF_CCR_LIMIT;
> +	}
>   
>   	/* XXX: figure out what to do about RTD3R/RTD3 */
>   	id->oaes = cpu_to_le32(NVMET_AEN_CFG_OPTIONAL);
> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
> index e44ef69dffc2..035f6e75a818 100644
> --- a/drivers/nvme/target/configfs.c
> +++ b/drivers/nvme/target/configfs.c
> @@ -1636,6 +1636,36 @@ static ssize_t nvmet_subsys_attr_pi_enable_store(struct config_item *item,
>   CONFIGFS_ATTR(nvmet_subsys_, attr_pi_enable);
>   #endif
>   
> +static ssize_t nvmet_subsys_attr_cqt_show(struct config_item *item,
> +					  char *page)
> +{
> +	return snprintf(page, PAGE_SIZE, "%u\n", to_subsys(item)->cqt);
> +}
> +
> +static ssize_t nvmet_subsys_attr_cqt_store(struct config_item *item,
> +					   const char *page, size_t cnt)
> +{
> +	struct nvmet_subsys *subsys = to_subsys(item);
> +	struct nvmet_ctrl *ctrl;
> +	u16 cqt;
> +
> +	if (sscanf(page, "%hu\n", &cqt) != 1)
> +		return -EINVAL;
> +
> +	down_write(&nvmet_config_sem);
> +	if (subsys->cqt == cqt)
> +		goto out;
> +
> +	subsys->cqt = cqt;
> +	/* Force reconnect */
> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
> +		ctrl->ops->delete_ctrl(ctrl);
> +out:
> +	up_write(&nvmet_config_sem);
> +	return cnt;
> +}
> +CONFIGFS_ATTR(nvmet_subsys_, attr_cqt);
> +
>   static ssize_t nvmet_subsys_attr_qid_max_show(struct config_item *item,
>   					      char *page)
>   {
> @@ -1676,6 +1706,7 @@ static struct configfs_attribute *nvmet_subsys_attrs[] = {
>   	&nvmet_subsys_attr_attr_vendor_id,
>   	&nvmet_subsys_attr_attr_subsys_vendor_id,
>   	&nvmet_subsys_attr_attr_model,
> +	&nvmet_subsys_attr_attr_cqt,
>   	&nvmet_subsys_attr_attr_qid_max,
>   	&nvmet_subsys_attr_attr_ieee_oui,
>   	&nvmet_subsys_attr_attr_firmware,

I do think that TP8028 (ie the CQT defintions) are somewhat independent
on CCR. So I'm not sure if they should be integrated in this patchset;
personally I would prefer to have it moved to another patchset.

> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> index cc88e5a28c8a..0d2a1206e08f 100644
> --- a/drivers/nvme/target/core.c
> +++ b/drivers/nvme/target/core.c
> @@ -1393,6 +1393,10 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl)
>   		return;
>   	}
>   
> +	if (!nvmet_is_disc_subsys(ctrl->subsys)) {
> +		ctrl->ciu = ((u8)(ctrl->ciu + 1)) ? : 1;
> +		ctrl->cirn = get_random_u64();
> +	}
>   	ctrl->csts = NVME_CSTS_RDY;
>   
>   	/*
> @@ -1661,6 +1665,12 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
>   	}
>   	ctrl->cntlid = ret;
>   
> +	if (!nvmet_is_disc_subsys(ctrl->subsys)) {
> +		ctrl->cqt = subsys->cqt;
> +		ctrl->ciu = get_random_u8() ? : 1;
> +		ctrl->cirn = get_random_u64();
> +	}
> +
>   	/*
>   	 * Discovery controllers may use some arbitrary high value
>   	 * in order to cleanup stale discovery sessions
> @@ -1853,10 +1863,12 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char *subsysnqn,
>   
>   	switch (type) {
>   	case NVME_NQN_NVME:
> +		subsys->cqt = NVMF_CQT_MS;
>   		subsys->max_qid = NVMET_NR_QUEUES;
>   		break;

And I would not set the CQT default here.
Thing is, implementing CQT to the letter would inflict a CQT delay
during failover for _every_ installation, thereby resulting in a
regression to previous implementations where we would fail over
with _no_ delay.
So again, we should make it a different patchset.

>   	case NVME_NQN_DISC:
>   	case NVME_NQN_CURR:
> +		subsys->cqt = 0;
>   		subsys->max_qid = 0;
>   		break;
>   	default:
> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
> index b664b584fdc8..f5d9a01ec60c 100644
> --- a/drivers/nvme/target/nvmet.h
> +++ b/drivers/nvme/target/nvmet.h
> @@ -264,7 +264,10 @@ struct nvmet_ctrl {
>   
>   	uuid_t			hostid;
>   	u16			cntlid;
> +	u16			cqt;
> +	u8			ciu;
>   	u32			kato;
> +	u64			cirn;
>   
>   	struct nvmet_port	*port;
>   
> @@ -331,6 +334,7 @@ struct nvmet_subsys {
>   #ifdef CONFIG_NVME_TARGET_DEBUGFS
>   	struct dentry		*debugfs_dir;
>   #endif
> +	u16			cqt;
>   	u16			max_qid;
>   
>   	u64			ver;
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index 655d194f8e72..5135cdc3c120 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -21,6 +21,9 @@
>   #define NVMF_TRADDR_SIZE	256
>   #define NVMF_TSAS_SIZE		256
>   
> +#define NVMF_CQT_MS		0
> +#define NVMF_CCR_LIMIT		4
> +
>   #define NVME_DISC_SUBSYS_NAME	"nqn.2014-08.org.nvmexpress.discovery"
>   
>   #define NVME_NSID_ALL		0xffffffff
> @@ -328,7 +331,10 @@ struct nvme_id_ctrl {
>   	__le16			crdt1;
>   	__le16			crdt2;
>   	__le16			crdt3;
> -	__u8			rsvd134[122];
> +	__u8			rsvd134[1];
> +	__u8			ciu;
> +	__le64			cirn;
> +	__u8			rsvd144[112];
>   	__le16			oacs;
>   	__u8			acl;
>   	__u8			aerl;
> @@ -362,7 +368,9 @@ struct nvme_id_ctrl {
>   	__u8			anacap;
>   	__le32			anagrpmax;
>   	__le32			nanagrpid;
> -	__u8			rsvd352[160];
> +	__u8			rsvd352[34];
> +	__le16			cqt;
> +	__u8			rsvd388[124];
>   	__u8			sqes;
>   	__u8			cqes;
>   	__le16			maxcmd;
> @@ -389,7 +397,8 @@ struct nvme_id_ctrl {
>   	__u8			msdbd;
>   	__u8			rsvd1804[2];
>   	__u8			dctype;
> -	__u8			rsvd1807[241];
> +	__u8			ccrl;
> +	__u8			rsvd1808[240];
>   	struct nvme_id_power_state	psd[32];
>   	__u8			vs[1024];
>   };

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 02/14] nvmet/debugfs: Add ctrl uniquifier and random values
  2026-01-30 22:34 ` [PATCH v2 02/14] nvmet/debugfs: Add ctrl uniquifier and random values Mohamed Khalfella
@ 2026-02-03  3:04   ` Hannes Reinecke
  2026-02-07 13:47   ` Sagi Grimberg
  2026-02-11  0:50   ` Randy Jennings
  2 siblings, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-03  3:04 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 1/30/26 23:34, Mohamed Khalfella wrote:
> Export ctrl->random and ctrl->uniquifier as debugfs files under
> controller debugfs directory.
> 
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>   drivers/nvme/target/debugfs.c | 21 +++++++++++++++++++++
>   1 file changed, 21 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 03/14] nvmet: Implement CCR nvme command
  2026-01-30 22:34 ` [PATCH v2 03/14] nvmet: Implement CCR nvme command Mohamed Khalfella
@ 2026-02-03  3:19   ` Hannes Reinecke
  2026-02-03 18:40     ` Mohamed Khalfella
  2026-02-07 14:11   ` Sagi Grimberg
  1 sibling, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-03  3:19 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 1/30/26 23:34, Mohamed Khalfella wrote:
> Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
> Reset) command is an nvme command issued to source controller by
> initiator to reset impacted controller. Implement CCR command for linux
> nvme target.
> 
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>   drivers/nvme/target/admin-cmd.c | 74 +++++++++++++++++++++++++++++++++
>   drivers/nvme/target/core.c      | 71 +++++++++++++++++++++++++++++++
>   drivers/nvme/target/nvmet.h     | 13 ++++++
>   include/linux/nvme.h            | 23 ++++++++++
>   4 files changed, 181 insertions(+)
> 
> diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
> index ade1145df72d..c0fd8eca2e44 100644
> --- a/drivers/nvme/target/admin-cmd.c
> +++ b/drivers/nvme/target/admin-cmd.c
> @@ -376,7 +376,9 @@ static void nvmet_get_cmd_effects_admin(struct nvmet_ctrl *ctrl,
>   	log->acs[nvme_admin_get_features] =
>   	log->acs[nvme_admin_async_event] =
>   	log->acs[nvme_admin_keep_alive] =
> +	log->acs[nvme_admin_cross_ctrl_reset] =
>   		cpu_to_le32(NVME_CMD_EFFECTS_CSUPP);
> +
>   }
>   
>   static void nvmet_get_cmd_effects_nvm(struct nvme_effects_log *log)
> @@ -1615,6 +1617,75 @@ void nvmet_execute_keep_alive(struct nvmet_req *req)
>   	nvmet_req_complete(req, status);
>   }
>   
> +void nvmet_execute_cross_ctrl_reset(struct nvmet_req *req)
> +{
> +	struct nvmet_ctrl *ictrl, *sctrl = req->sq->ctrl;
> +	struct nvme_command *cmd = req->cmd;
> +	struct nvmet_ccr *ccr, *new_ccr;
> +	int ccr_active, ccr_total;
> +	u16 cntlid, status = NVME_SC_SUCCESS;
> +
> +	cntlid = le16_to_cpu(cmd->ccr.icid);
> +	if (sctrl->cntlid == cntlid) {
> +		req->error_loc =
> +			offsetof(struct nvme_cross_ctrl_reset_cmd, icid);
> +		status = NVME_SC_INVALID_FIELD | NVME_STATUS_DNR;
> +		goto out;
> +	}
> +
> +	/* Find and get impacted controller */
> +	ictrl = nvmet_ctrl_find_get_ccr(sctrl->subsys, sctrl->hostnqn,
> +					cmd->ccr.ciu, cntlid,
> +					le64_to_cpu(cmd->ccr.cirn));
> +	if (!ictrl) {
> +		/* Immediate Reset Successful */
> +		nvmet_set_result(req, 1);
> +		status = NVME_SC_SUCCESS;
> +		goto out;
> +	}
> +
> +	ccr_total = ccr_active = 0;
> +	mutex_lock(&sctrl->lock);
> +	list_for_each_entry(ccr, &sctrl->ccr_list, entry) {
> +		if (ccr->ctrl == ictrl) {
> +			status = NVME_SC_CCR_IN_PROGRESS | NVME_STATUS_DNR;
> +			goto out_unlock;
> +		}
> +
> +		ccr_total++;
> +		if (ccr->ctrl)
> +			ccr_active++;
> +	}
> +
> +	if (ccr_active >= NVMF_CCR_LIMIT) {
> +		status = NVME_SC_CCR_LIMIT_EXCEEDED;
> +		goto out_unlock;
> +	}
> +	if (ccr_total >= NVMF_CCR_PER_PAGE) {
> +		status = NVME_SC_CCR_LOGPAGE_FULL;
> +		goto out_unlock;
> +	}
> +
> +	new_ccr = kmalloc(sizeof(*new_ccr), GFP_KERNEL);
> +	if (!new_ccr) {
> +		status = NVME_SC_INTERNAL;
> +		goto out_unlock;
> +	}
> +
> +	new_ccr->ciu = cmd->ccr.ciu;
> +	new_ccr->icid = cntlid;
> +	new_ccr->ctrl = ictrl;
> +	list_add_tail(&new_ccr->entry, &sctrl->ccr_list);
> +
> +out_unlock:
> +	mutex_unlock(&sctrl->lock);
> +	if (status == NVME_SC_SUCCESS)
> +		nvmet_ctrl_fatal_error(ictrl);
> +	nvmet_ctrl_put(ictrl);
> +out:
> +	nvmet_req_complete(req, status);
> +}
> +
>   u32 nvmet_admin_cmd_data_len(struct nvmet_req *req)
>   {
>   	struct nvme_command *cmd = req->cmd;
> @@ -1692,6 +1763,9 @@ u16 nvmet_parse_admin_cmd(struct nvmet_req *req)
>   	case nvme_admin_keep_alive:
>   		req->execute = nvmet_execute_keep_alive;
>   		return 0;
> +	case nvme_admin_cross_ctrl_reset:
> +		req->execute = nvmet_execute_cross_ctrl_reset;
> +		return 0;
>   	default:
>   		return nvmet_report_invalid_opcode(req);
>   	}
> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> index 0d2a1206e08f..54dd0dcfa12b 100644
> --- a/drivers/nvme/target/core.c
> +++ b/drivers/nvme/target/core.c
> @@ -114,6 +114,20 @@ u16 nvmet_zero_sgl(struct nvmet_req *req, off_t off, size_t len)
>   	return 0;
>   }
>   
> +void nvmet_ctrl_cleanup_ccrs(struct nvmet_ctrl *ctrl, bool all)
> +{
> +	struct nvmet_ccr *ccr, *tmp;
> +
> +	lockdep_assert_held(&ctrl->lock);
> +
> +	list_for_each_entry_safe(ccr, tmp, &ctrl->ccr_list, entry) {
> +		if (all || ccr->ctrl == NULL) {
> +			list_del(&ccr->entry);
> +			kfree(ccr);
> +		}
> +	}
> +}
> +
>   static u32 nvmet_max_nsid(struct nvmet_subsys *subsys)
>   {
>   	struct nvmet_ns *cur;
> @@ -1396,6 +1410,7 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl)
>   	if (!nvmet_is_disc_subsys(ctrl->subsys)) {
>   		ctrl->ciu = ((u8)(ctrl->ciu + 1)) ? : 1;
>   		ctrl->cirn = get_random_u64();
> +		nvmet_ctrl_cleanup_ccrs(ctrl, false);
>   	}
>   	ctrl->csts = NVME_CSTS_RDY;
>   
> @@ -1501,6 +1516,38 @@ struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
>   	return ctrl;
>   }
>   
> +struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
> +					   const char *hostnqn, u8 ciu,
> +					   u16 cntlid, u64 cirn)
> +{
> +	struct nvmet_ctrl *ctrl;
> +	bool found = false;
> +
> +	mutex_lock(&subsys->lock);
> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> +		if (ctrl->cntlid != cntlid)
> +			continue;
> +		if (strncmp(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE))
> +			continue;
> +
Why do we compare the hostnqn here, too? To my understanding the host 
NQN is tied to the controller, so the controller ID should be sufficient
here.

> +		/* Avoid racing with a controller that is becoming ready */
> +		mutex_lock(&ctrl->lock);
> +		if (ctrl->ciu == ciu && ctrl->cirn == cirn)
> +			found = true;
> +		mutex_unlock(&ctrl->lock);
> +
> +		if (found) {
> +			if (kref_get_unless_zero(&ctrl->ref))
> +				goto out;
> +			break;
> +		}
> +	};
> +	ctrl = NULL;
> +out:
> +	mutex_unlock(&subsys->lock);
> +	return ctrl;
> +}
> +
>   u16 nvmet_check_ctrl_status(struct nvmet_req *req)
>   {
>   	if (unlikely(!(req->sq->ctrl->cc & NVME_CC_ENABLE))) {
> @@ -1626,6 +1673,7 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
>   		subsys->clear_ids = 1;
>   #endif
>   
> +	INIT_LIST_HEAD(&ctrl->ccr_list);
>   	INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work);
>   	INIT_LIST_HEAD(&ctrl->async_events);
>   	INIT_RADIX_TREE(&ctrl->p2p_ns_map, GFP_KERNEL);
> @@ -1740,12 +1788,35 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
>   }
>   EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
>   
> +static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
> +{
> +	struct nvmet_subsys *subsys = ctrl->subsys;
> +	struct nvmet_ctrl *sctrl;
> +	struct nvmet_ccr *ccr;
> +
> +	mutex_lock(&ctrl->lock);
> +	nvmet_ctrl_cleanup_ccrs(ctrl, true);
> +	mutex_unlock(&ctrl->lock);
> +
> +	list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
> +		mutex_lock(&sctrl->lock);
> +		list_for_each_entry(ccr, &sctrl->ccr_list, entry) {
> +			if (ccr->ctrl == ctrl) {
> +				ccr->ctrl = NULL;
> +				break;
> +			}
> +		}
> +		mutex_unlock(&sctrl->lock);
> +	}
> +}
> +

Maybe add documentation here that the first CCR cleanup is for clearing
CCRs issued from this controller, and the second is for CCRs issued _to_
this controller.

>   static void nvmet_ctrl_free(struct kref *ref)
>   {
>   	struct nvmet_ctrl *ctrl = container_of(ref, struct nvmet_ctrl, ref);
>   	struct nvmet_subsys *subsys = ctrl->subsys;
>   
>   	mutex_lock(&subsys->lock);
> +	nvmet_ctrl_complete_pending_ccr(ctrl);
>   	nvmet_ctrl_destroy_pr(ctrl);
>   	nvmet_release_p2p_ns_map(ctrl);
>   	list_del(&ctrl->subsys_entry);
> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
> index f5d9a01ec60c..93d6ac41cf85 100644
> --- a/drivers/nvme/target/nvmet.h
> +++ b/drivers/nvme/target/nvmet.h
> @@ -269,6 +269,7 @@ struct nvmet_ctrl {
>   	u32			kato;
>   	u64			cirn;
>   
> +	struct list_head	ccr_list;
>   	struct nvmet_port	*port;
>   
>   	u32			aen_enabled;
> @@ -315,6 +316,13 @@ struct nvmet_ctrl {
>   	struct nvmet_pr_log_mgr pr_log_mgr;
>   };
>   
> +struct nvmet_ccr {
> +	struct nvmet_ctrl	*ctrl;
> +	struct list_head	entry;
> +	u16			icid;
> +	u8			ciu;
> +};
> +
>   struct nvmet_subsys {
>   	enum nvme_subsys_type	type;
>   
> @@ -578,6 +586,7 @@ void nvmet_req_free_sgls(struct nvmet_req *req);
>   void nvmet_execute_set_features(struct nvmet_req *req);
>   void nvmet_execute_get_features(struct nvmet_req *req);
>   void nvmet_execute_keep_alive(struct nvmet_req *req);
> +void nvmet_execute_cross_ctrl_reset(struct nvmet_req *req);
>   
>   u16 nvmet_check_cqid(struct nvmet_ctrl *ctrl, u16 cqid, bool create);
>   u16 nvmet_check_io_cqid(struct nvmet_ctrl *ctrl, u16 cqid, bool create);
> @@ -620,6 +629,10 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args);
>   struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
>   				       const char *hostnqn, u16 cntlid,
>   				       struct nvmet_req *req);
> +struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
> +					   const char *hostnqn, u8 ciu,
> +					   u16 cntlid, u64 cirn);
> +void nvmet_ctrl_cleanup_ccrs(struct nvmet_ctrl *ctrl, bool all);
>   void nvmet_ctrl_put(struct nvmet_ctrl *ctrl);
>   u16 nvmet_check_ctrl_status(struct nvmet_req *req);
>   ssize_t nvmet_ctrl_host_traddr(struct nvmet_ctrl *ctrl,
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index 5135cdc3c120..0f305b317aa3 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -23,6 +23,7 @@
>   
>   #define NVMF_CQT_MS		0
>   #define NVMF_CCR_LIMIT		4
> +#define NVMF_CCR_PER_PAGE	511
>   
>   #define NVME_DISC_SUBSYS_NAME	"nqn.2014-08.org.nvmexpress.discovery"
>   
> @@ -1225,6 +1226,22 @@ struct nvme_zone_mgmt_recv_cmd {
>   	__le32			cdw14[2];
>   };
>   
> +struct nvme_cross_ctrl_reset_cmd {
> +	__u8			opcode;
> +	__u8			flags;
> +	__u16			command_id;
> +	__le32			nsid;
> +	__le64			rsvd2[2];
> +	union nvme_data_ptr	dptr;
> +	__u8			rsvd10;
> +	__u8			ciu;
> +	__le16			icid;
> +	__le32			cdw11;
> +	__le64			cirn;
> +	__le32			cdw14;
> +	__le32			cdw15;
> +};
> +
>   struct nvme_io_mgmt_recv_cmd {
>   	__u8			opcode;
>   	__u8			flags;

I would have expected this definitions in the
first patch. But probably not that important.

> @@ -1323,6 +1340,7 @@ enum nvme_admin_opcode {
>   	nvme_admin_virtual_mgmt		= 0x1c,
>   	nvme_admin_nvme_mi_send		= 0x1d,
>   	nvme_admin_nvme_mi_recv		= 0x1e,
> +	nvme_admin_cross_ctrl_reset	= 0x38,
>   	nvme_admin_dbbuf		= 0x7C,
>   	nvme_admin_format_nvm		= 0x80,
>   	nvme_admin_security_send	= 0x81,
> @@ -1356,6 +1374,7 @@ enum nvme_admin_opcode {
>   		nvme_admin_opcode_name(nvme_admin_virtual_mgmt),	\
>   		nvme_admin_opcode_name(nvme_admin_nvme_mi_send),	\
>   		nvme_admin_opcode_name(nvme_admin_nvme_mi_recv),	\
> +		nvme_admin_opcode_name(nvme_admin_cross_ctrl_reset),	\
>   		nvme_admin_opcode_name(nvme_admin_dbbuf),		\
>   		nvme_admin_opcode_name(nvme_admin_format_nvm),		\
>   		nvme_admin_opcode_name(nvme_admin_security_send),	\
> @@ -2009,6 +2028,7 @@ struct nvme_command {
>   		struct nvme_dbbuf dbbuf;
>   		struct nvme_directive_cmd directive;
>   		struct nvme_io_mgmt_recv_cmd imr;
> +		struct nvme_cross_ctrl_reset_cmd ccr;
>   	};
>   };
>   
> @@ -2173,6 +2193,9 @@ enum {
>   	NVME_SC_PMR_SAN_PROHIBITED	= 0x123,
>   	NVME_SC_ANA_GROUP_ID_INVALID	= 0x124,
>   	NVME_SC_ANA_ATTACH_FAILED	= 0x125,
> +	NVME_SC_CCR_IN_PROGRESS		= 0x13f,
> +	NVME_SC_CCR_LOGPAGE_FULL	= 0x140,
> +	NVME_SC_CCR_LIMIT_EXCEEDED	= 0x141,
>   
>   	/*
>   	 * I/O Command Set Specific - NVM commands:

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 04/14] nvmet: Implement CCR logpage
  2026-01-30 22:34 ` [PATCH v2 04/14] nvmet: Implement CCR logpage Mohamed Khalfella
@ 2026-02-03  3:21   ` Hannes Reinecke
  2026-02-07 14:11   ` Sagi Grimberg
  2026-02-11  1:49   ` Randy Jennings
  2 siblings, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-03  3:21 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 1/30/26 23:34, Mohamed Khalfella wrote:
> Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
> Reset) log page contains an entry for each CCR request submitted to
> source controller. Implement CCR logpage for nvme linux target.
> 
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>   drivers/nvme/target/admin-cmd.c | 44 +++++++++++++++++++++++++++++++++
>   include/linux/nvme.h            | 29 ++++++++++++++++++++++
>   2 files changed, 73 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 05/14] nvmet: Send an AEN on CCR completion
  2026-01-30 22:34 ` [PATCH v2 05/14] nvmet: Send an AEN on CCR completion Mohamed Khalfella
@ 2026-02-03  3:27   ` Hannes Reinecke
  2026-02-03 18:48     ` Mohamed Khalfella
  2026-02-07 14:12   ` Sagi Grimberg
  2026-02-11  1:52   ` Randy Jennings
  2 siblings, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-03  3:27 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 1/30/26 23:34, Mohamed Khalfella wrote:
> Send an AEN to initiator when impacted controller exists. The
> notification points to CCR log page that initiator can read to check
> which CCR operation completed.
> 
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>   drivers/nvme/target/core.c  | 25 ++++++++++++++++++++++---
>   drivers/nvme/target/nvmet.h |  3 ++-
>   include/linux/nvme.h        |  3 +++
>   3 files changed, 27 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> index 54dd0dcfa12b..ae2fe9f90bcd 100644
> --- a/drivers/nvme/target/core.c
> +++ b/drivers/nvme/target/core.c
> @@ -202,7 +202,7 @@ static void nvmet_async_event_work(struct work_struct *work)
>   	nvmet_async_events_process(ctrl);
>   }
>   
> -void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> +static void nvmet_add_async_event_locked(struct nvmet_ctrl *ctrl, u8 event_type,
>   		u8 event_info, u8 log_page)
>   {
>   	struct nvmet_async_event *aen;
> @@ -215,13 +215,19 @@ void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
>   	aen->event_info = event_info;
>   	aen->log_page = log_page;
>   
> -	mutex_lock(&ctrl->lock);
>   	list_add_tail(&aen->entry, &ctrl->async_events);
> -	mutex_unlock(&ctrl->lock);
>   
>   	queue_work(nvmet_wq, &ctrl->async_event_work);
>   }
>   
> +void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> +		u8 event_info, u8 log_page)
> +{
> +	mutex_lock(&ctrl->lock);
> +	nvmet_add_async_event_locked(ctrl, event_type, event_info, log_page);
> +	mutex_unlock(&ctrl->lock);
> +}
> +
>   static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
>   {
>   	u32 i;
> @@ -1788,6 +1794,18 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
>   }
>   EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
>   
> +static void nvmet_ctrl_notify_ccr(struct nvmet_ctrl *ctrl)
> +{
> +	lockdep_assert_held(&ctrl->lock);
> +
> +	if (nvmet_aen_bit_disabled(ctrl, NVME_AEN_BIT_CCR_COMPLETE))
> +		return;
> +
> +	nvmet_add_async_event_locked(ctrl, NVME_AER_NOTICE,
> +				     NVME_AER_NOTICE_CCR_COMPLETED,
> +				     NVME_LOG_CCR);
> +}
> +
>   static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
>   {
>   	struct nvmet_subsys *subsys = ctrl->subsys;

But what does the CCR command actually _do_?
At the very lease I would have expected it to trigger a controller reset
(eg calling into nvmet_ctrl_fatal_error()), yet I don't see it doing
that anywhere ...

> @@ -1803,6 +1821,7 @@ static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
>   		list_for_each_entry(ccr, &sctrl->ccr_list, entry) {
>   			if (ccr->ctrl == ctrl) {
>   				ccr->ctrl = NULL;
> +				nvmet_ctrl_notify_ccr(sctrl);
>   				break;
>   			}
>   		}
> diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
> index 93d6ac41cf85..00528feeb3cd 100644
> --- a/drivers/nvme/target/nvmet.h
> +++ b/drivers/nvme/target/nvmet.h
> @@ -44,7 +44,8 @@
>    * Supported optional AENs:
>    */
>   #define NVMET_AEN_CFG_OPTIONAL \
> -	(NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_ANA_CHANGE)
> +	(NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_ANA_CHANGE | \
> +	 NVME_AEN_CFG_CCR_COMPLETE)
>   #define NVMET_DISC_AEN_CFG_OPTIONAL \
>   	(NVME_AEN_CFG_DISC_CHANGE)
>   
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index 3e189674b69e..f6d66dadc5b1 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -863,12 +863,14 @@ enum {
>   	NVME_AER_NOTICE_FW_ACT_STARTING = 0x01,
>   	NVME_AER_NOTICE_ANA		= 0x03,
>   	NVME_AER_NOTICE_DISC_CHANGED	= 0xf0,
> +	NVME_AER_NOTICE_CCR_COMPLETED	= 0xf4,
>   };
>   
>   enum {
>   	NVME_AEN_BIT_NS_ATTR		= 8,
>   	NVME_AEN_BIT_FW_ACT		= 9,
>   	NVME_AEN_BIT_ANA_CHANGE		= 11,
> +	NVME_AEN_BIT_CCR_COMPLETE	= 20,
>   	NVME_AEN_BIT_DISC_CHANGE	= 31,
>   };
>   
> @@ -876,6 +878,7 @@ enum {
>   	NVME_AEN_CFG_NS_ATTR		= 1 << NVME_AEN_BIT_NS_ATTR,
>   	NVME_AEN_CFG_FW_ACT		= 1 << NVME_AEN_BIT_FW_ACT,
>   	NVME_AEN_CFG_ANA_CHANGE		= 1 << NVME_AEN_BIT_ANA_CHANGE,
> +	NVME_AEN_CFG_CCR_COMPLETE	= 1 << NVME_AEN_BIT_CCR_COMPLETE,
>   	NVME_AEN_CFG_DISC_CHANGE	= 1 << NVME_AEN_BIT_DISC_CHANGE,
>   };
>   
Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 06/14] nvme: Rapid Path Failure Recovery read controller identify fields
  2026-01-30 22:34 ` [PATCH v2 06/14] nvme: Rapid Path Failure Recovery read controller identify fields Mohamed Khalfella
@ 2026-02-03  3:28   ` Hannes Reinecke
  2026-02-07 14:13   ` Sagi Grimberg
  2026-02-11  1:56   ` Randy Jennings
  2 siblings, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-03  3:28 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 1/30/26 23:34, Mohamed Khalfella wrote:
> TP8028 Rapid path failure added new fileds to controller identify
> response. Read CIU (Controller Instance Uniquifier), CIRN (Controller
> Instance Random Number), and CCRL (Cross-Controller Reset Limit) from
> controller identify response. Expose CIU and CIRN as sysfs attributes
> so the values can be used directrly by user if needed.
> 
> TP4129 KATO Corrections and Clarifications defined CQT (Command Quiesce
> Time) which is used along with KATO (Keep Alive Timeout) to set an upper
> limit for attempting Cross-Controller Recovery.
> 
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>   drivers/nvme/host/core.c  |  5 +++++
>   drivers/nvme/host/nvme.h  | 11 +++++++++++
>   drivers/nvme/host/sysfs.c | 23 +++++++++++++++++++++++
>   3 files changed, 39 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 07/14] nvme: Introduce FENCING and FENCED controller states
  2026-01-30 22:34 ` [PATCH v2 07/14] nvme: Introduce FENCING and FENCED controller states Mohamed Khalfella
@ 2026-02-03  5:07   ` Hannes Reinecke
  2026-02-03 19:13     ` Mohamed Khalfella
  0 siblings, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-03  5:07 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 1/30/26 23:34, Mohamed Khalfella wrote:
> FENCING is a new controller state that a LIVE controller enter when an
> error is encountered. While in FENCING state inflight IOs that timeout
> are not canceled because they should be held until either CCR succeeds
> or time-based recovery completes. While the queues remain alive requests
> are not allowed to be sent in this state and the controller can not be
> reset of deleted. This is intentional because resetting or deleting the
> controller results in canceling inflight IOs.
> 
> FENCED is a short-term state the controller enters before it is reset.
> It exists only to prevent manual resets to happen while controller is
> in FENCING state.
> 
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>   drivers/nvme/host/core.c  | 25 +++++++++++++++++++++++--
>   drivers/nvme/host/nvme.h  |  4 ++++
>   drivers/nvme/host/sysfs.c |  2 ++
>   3 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 8961d612ccb0..3e1e02822dd4 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -574,10 +574,29 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>   			break;
>   		}
>   		break;
> +	case NVME_CTRL_FENCING:
> +		switch (old_state) {
> +		case NVME_CTRL_LIVE:
> +			changed = true;
> +			fallthrough;
> +		default:
> +			break;
> +		}
> +		break;
 > +	case NVME_CTRL_FENCED:> +		switch (old_state) {
> +		case NVME_CTRL_FENCING:
> +			changed = true;
> +			fallthrough;
> +		default:
> +			break;
> +		}
> +		break;
>   	case NVME_CTRL_RESETTING:
>   		switch (old_state) {
>   		case NVME_CTRL_NEW:
>   		case NVME_CTRL_LIVE:
> +		case NVME_CTRL_FENCED:
>   			changed = true;
>   			fallthrough;
>   		default:
> @@ -760,6 +779,7 @@ blk_status_t nvme_fail_nonready_command(struct nvme_ctrl *ctrl,
>   
>   	if (state != NVME_CTRL_DELETING_NOIO &&
>   	    state != NVME_CTRL_DELETING &&
> +	    state != NVME_CTRL_FENCING &&

Shouldn't 'FENCED' be in here, too?

>   	    state != NVME_CTRL_DEAD &&
>   	    !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
>   	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
> @@ -802,10 +822,11 @@ bool __nvme_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
>   			     req->cmd->fabrics.fctype == nvme_fabrics_type_auth_receive))
>   				return true;
>   			break;
> -		default:
> -			break;
> +		case NVME_CTRL_FENCING:

Similar here.

>   		case NVME_CTRL_DEAD:
>   			return false;
> +		default:
> +			break;
>   		}
>   	}
>   
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 9dd9f179ad88..00866bbc66f3 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -251,6 +251,8 @@ static inline u16 nvme_req_qid(struct request *req)
>   enum nvme_ctrl_state {
>   	NVME_CTRL_NEW,
>   	NVME_CTRL_LIVE,
> +	NVME_CTRL_FENCING,
> +	NVME_CTRL_FENCED,
>   	NVME_CTRL_RESETTING,
>   	NVME_CTRL_CONNECTING,
>   	NVME_CTRL_DELETING,
> @@ -777,6 +779,8 @@ static inline bool nvme_state_terminal(struct nvme_ctrl *ctrl)
>   	switch (nvme_ctrl_state(ctrl)) {
>   	case NVME_CTRL_NEW:
>   	case NVME_CTRL_LIVE:
> +	case NVME_CTRL_FENCING:
> +	case NVME_CTRL_FENCED:
>   	case NVME_CTRL_RESETTING:
>   	case NVME_CTRL_CONNECTING:
>   		return false;
> diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
> index f81bbb6ec768..4ec9dfeb736e 100644
> --- a/drivers/nvme/host/sysfs.c
> +++ b/drivers/nvme/host/sysfs.c
> @@ -443,6 +443,8 @@ static ssize_t nvme_sysfs_show_state(struct device *dev,
>   	static const char *const state_name[] = {
>   		[NVME_CTRL_NEW]		= "new",
>   		[NVME_CTRL_LIVE]	= "live",
> +		[NVME_CTRL_FENCING]	= "fencing",
> +		[NVME_CTRL_FENCED]	= "fenced",
>   		[NVME_CTRL_RESETTING]	= "resetting",
>   		[NVME_CTRL_CONNECTING]	= "connecting",
>   		[NVME_CTRL_DELETING]	= "deleting",

You need to modify nvme-tcp.c:nvme_tcp_timeout() too, as this checks
'just' for 'LIVE' state and will abort/terminate commands when in
FENCING. Similar argument for nvme-rdma.c. And nvme-fc.c also needs an
audit to ensure it works correctly.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery
  2026-01-30 22:34 ` [PATCH v2 08/14] nvme: Implement cross-controller reset recovery Mohamed Khalfella
@ 2026-02-03  5:19   ` Hannes Reinecke
  2026-02-03 20:00     ` Mohamed Khalfella
  2026-02-10 22:09   ` James Smart
  1 sibling, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-03  5:19 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 1/30/26 23:34, Mohamed Khalfella wrote:
> A host that has more than one path connecting to an nvme subsystem
> typically has an nvme controller associated with every path. This is
> mostly applicable to nvmeof. If one path goes down, inflight IOs on that
> path should not be retried immediately on another path because this
> could lead to data corruption as described in TP4129. TP8028 defines
> cross-controller reset mechanism that can be used by host to terminate
> IOs on the failed path using one of the remaining healthy paths. Only
> after IOs are terminated, or long enough time passes as defined by
> TP4129, inflight IOs should be retried on another path. Implement core
> cross-controller reset shared logic to be used by the transports.
> 
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>   drivers/nvme/host/constants.c |   1 +
>   drivers/nvme/host/core.c      | 129 ++++++++++++++++++++++++++++++++++
>   drivers/nvme/host/nvme.h      |   9 +++
>   3 files changed, 139 insertions(+)
> 
> diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
> index dc90df9e13a2..f679efd5110e 100644
> --- a/drivers/nvme/host/constants.c
> +++ b/drivers/nvme/host/constants.c
> @@ -46,6 +46,7 @@ static const char * const nvme_admin_ops[] = {
>   	[nvme_admin_virtual_mgmt] = "Virtual Management",
>   	[nvme_admin_nvme_mi_send] = "NVMe Send MI",
>   	[nvme_admin_nvme_mi_recv] = "NVMe Receive MI",
> +	[nvme_admin_cross_ctrl_reset] = "Cross Controller Reset",
>   	[nvme_admin_dbbuf] = "Doorbell Buffer Config",
>   	[nvme_admin_format_nvm] = "Format NVM",
>   	[nvme_admin_security_send] = "Security Send",
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 3e1e02822dd4..13e0775d56b4 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -554,6 +554,134 @@ void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl)
>   }
>   EXPORT_SYMBOL_GPL(nvme_cancel_admin_tagset);
>   
> +static struct nvme_ctrl *nvme_find_ctrl_ccr(struct nvme_ctrl *ictrl,
> +					    u32 min_cntlid)
> +{
> +	struct nvme_subsystem *subsys = ictrl->subsys;
> +	struct nvme_ctrl *sctrl;
> +	unsigned long flags;
> +
> +	mutex_lock(&nvme_subsystems_lock);
> +	list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
> +		if (sctrl->cntlid < min_cntlid)
> +			continue;
> +
> +		if (atomic_dec_if_positive(&sctrl->ccr_limit) < 0)
> +			continue;
> +
> +		spin_lock_irqsave(&sctrl->lock, flags);
> +		if (sctrl->state != NVME_CTRL_LIVE) {
> +			spin_unlock_irqrestore(&sctrl->lock, flags);
> +			atomic_inc(&sctrl->ccr_limit);
> +			continue;
> +		}
> +
> +		/*
> +		 * We got a good candidate source controller that is locked and
> +		 * LIVE. However, no guarantee sctrl will not be deleted after
> +		 * sctrl->lock is released. Get a ref of both sctrl and admin_q
> +		 * so they do not disappear until we are done with them.
> +		 */
> +		WARN_ON_ONCE(!blk_get_queue(sctrl->admin_q));
> +		nvme_get_ctrl(sctrl);
> +		spin_unlock_irqrestore(&sctrl->lock, flags);
> +		goto found;
> +	}
> +	sctrl = NULL;
> +found:

Normally one would be using a temporary loop variable and assign 'sctrl' 
to that one if found. Then you can just call 'break' and drop the 'goto'.

> +	mutex_unlock(&nvme_subsystems_lock);
> +	return sctrl;
> +}
> +
> +static void nvme_put_ctrl_ccr(struct nvme_ctrl *sctrl)
> +{
> +	atomic_inc(&sctrl->ccr_limit);
> +	blk_put_queue(sctrl->admin_q);
> +	nvme_put_ctrl(sctrl);
> +}
> +
> +static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
> +{
> +	struct nvme_ccr_entry ccr = { };
> +	union nvme_result res = { 0 };
> +	struct nvme_command c = { };
> +	unsigned long flags, tmo;
> +	int ret = 0;
> +	u32 result;
> +
> +	init_completion(&ccr.complete);
> +	ccr.ictrl = ictrl;
> +
> +	spin_lock_irqsave(&sctrl->lock, flags);
> +	list_add_tail(&ccr.list, &sctrl->ccr_list);
> +	spin_unlock_irqrestore(&sctrl->lock, flags);
> +
> +	c.ccr.opcode = nvme_admin_cross_ctrl_reset;
> +	c.ccr.ciu = ictrl->ciu;
> +	c.ccr.icid = cpu_to_le16(ictrl->cntlid);
> +	c.ccr.cirn = cpu_to_le64(ictrl->cirn);
> +	ret = __nvme_submit_sync_cmd(sctrl->admin_q, &c, &res,
> +				     NULL, 0, NVME_QID_ANY, 0);
> +	if (ret)
> +		goto out;
> +
> +	result = le32_to_cpu(res.u32);
> +	if (result & 0x01) /* Immediate Reset Successful */
> +		goto out;
> +
> +	tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
> +	if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
> +		ret = -ETIMEDOUT;
> +		goto out;
> +	}
> +
> +	if (ccr.ccrs != NVME_CCR_STATUS_SUCCESS)
> +		ret = -EREMOTEIO;
> +out:
> +	spin_lock_irqsave(&sctrl->lock, flags);
> +	list_del(&ccr.list);
> +	spin_unlock_irqrestore(&sctrl->lock, flags);
> +	return ret;
> +}
> +
> +unsigned long nvme_fence_ctrl(struct nvme_ctrl *ictrl)
> +{
> +	unsigned long deadline, now, timeout;
> +	struct nvme_ctrl *sctrl;
> +	u32 min_cntlid = 0;
> +	int ret;
> +
> +	timeout = nvme_fence_timeout_ms(ictrl);
> +	dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> +
> +	now = jiffies;
> +	deadline = now + msecs_to_jiffies(timeout);
> +	while (time_before(now, deadline)) {
> +		sctrl = nvme_find_ctrl_ccr(ictrl, min_cntlid);
> +		if (!sctrl) {
> +			/* CCR failed, switch to time-based recovery */
> +			return deadline - now;
> +		}
> +
> +		ret = nvme_issue_wait_ccr(sctrl, ictrl);
> +		if (!ret) {
> +			dev_info(ictrl->device, "CCR succeeded using %s\n",
> +				 dev_name(sctrl->device));
> +			nvme_put_ctrl_ccr(sctrl);
> +			return 0;
> +		}
> +
> +		/* CCR failed, try another path */
> +		min_cntlid = sctrl->cntlid + 1;
> +		nvme_put_ctrl_ccr(sctrl);
> +		now = jiffies;
> +	}

That will spin until 'deadline' is reached if 'nvme_issue_wait_ccr()' 
returns an error. _And_ if the CCR itself runs into a timeout we would
never have tried another path (which could have succeeded).

I'd rather rework this loop to open-code 'issue_and_wait()' in the loop,
and only switch to the next controller if the submission of CCR failed.
Once that is done we can 'just' wait for completion, as a failure there
will be after KATO timeout anyway and any subsequent CCR would be pointless.

> +
> +	dev_info(ictrl->device, "CCR reached timeout, call it done\n");
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(nvme_fence_ctrl);
> +
>   bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>   		enum nvme_ctrl_state new_state)
>   {
> @@ -5119,6 +5247,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
>   
>   	mutex_init(&ctrl->scan_lock);
>   	INIT_LIST_HEAD(&ctrl->namespaces);
> +	INIT_LIST_HEAD(&ctrl->ccr_list);
>   	xa_init(&ctrl->cels);
>   	ctrl->dev = dev;
>   	ctrl->ops = ops;
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 00866bbc66f3..fa18f580d76a 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -279,6 +279,13 @@ enum nvme_ctrl_flags {
>   	NVME_CTRL_FROZEN		= 6,
>   };
>   
> +struct nvme_ccr_entry {
> +	struct list_head list;
> +	struct completion complete;
> +	struct nvme_ctrl *ictrl;
> +	u8 ccrs;
> +};
> +
 >   struct nvme_ctrl {>   	bool comp_seen;
>   	bool identified;
> @@ -296,6 +303,7 @@ struct nvme_ctrl {
>   	struct blk_mq_tag_set *tagset;
>   	struct blk_mq_tag_set *admin_tagset;
>   	struct list_head namespaces;
> +	struct list_head ccr_list;
>   	struct mutex namespaces_lock;
>   	struct srcu_struct srcu;
>   	struct device ctrl_device;
> @@ -814,6 +822,7 @@ blk_status_t nvme_host_path_error(struct request *req);
>   bool nvme_cancel_request(struct request *req, void *data);
>   void nvme_cancel_tagset(struct nvme_ctrl *ctrl);
>   void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl);
> +unsigned long nvme_fence_ctrl(struct nvme_ctrl *ctrl);
>   bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>   		enum nvme_ctrl_state new_state);
>   int nvme_disable_ctrl(struct nvme_ctrl *ctrl, bool shutdown);

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 09/14] nvme: Implement cross-controller reset completion
  2026-01-30 22:34 ` [PATCH v2 09/14] nvme: Implement cross-controller reset completion Mohamed Khalfella
@ 2026-02-03  5:22   ` Hannes Reinecke
  2026-02-03 20:07     ` Mohamed Khalfella
  0 siblings, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-03  5:22 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 1/30/26 23:34, Mohamed Khalfella wrote:
> An nvme source controller that issues CCR command expects to receive an
> NVME_AER_NOTICE_CCR_COMPLETED when pending CCR succeeds or fails. Add
> sctrl->ccr_work to read NVME_LOG_CCR logpage and wakeup any thread
> waiting on CCR completion.
> 
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>   drivers/nvme/host/core.c | 49 +++++++++++++++++++++++++++++++++++++++-
>   drivers/nvme/host/nvme.h |  1 +
>   2 files changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 13e0775d56b4..0f90feb46369 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1901,7 +1901,8 @@ EXPORT_SYMBOL_GPL(nvme_set_queue_count);
>   
>   #define NVME_AEN_SUPPORTED \
>   	(NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | \
> -	 NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_DISC_CHANGE)
> +	 NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_CCR_COMPLETE | \
> +	 NVME_AEN_CFG_DISC_CHANGE)
>   
>   static void nvme_enable_aen(struct nvme_ctrl *ctrl)
>   {
> @@ -4866,6 +4867,47 @@ static void nvme_get_fw_slot_info(struct nvme_ctrl *ctrl)
>   	kfree(log);
>   }
>   
> +static void nvme_ccr_work(struct work_struct *work)
> +{
> +	struct nvme_ctrl *ctrl = container_of(work, struct nvme_ctrl, ccr_work);
> +	struct nvme_ccr_entry *ccr;
> +	struct nvme_ccr_log_entry *entry;
> +	struct nvme_ccr_log *log;
> +	unsigned long flags;
> +	int ret, i;
> +
> +	log = kmalloc(sizeof(*log), GFP_KERNEL);
> +	if (!log)
> +		return;
> +
> +	ret = nvme_get_log(ctrl, 0, NVME_LOG_CCR, 0x01,
> +			   0x00, log, sizeof(*log), 0);
> +	if (ret)
> +		goto out;
> +
> +	spin_lock_irqsave(&ctrl->lock, flags);
> +	for (i = 0; i < le16_to_cpu(log->ne); i++) {
> +		entry = &log->entries[i];
> +		if (entry->ccrs == NVME_CCR_STATUS_IN_PROGRESS)
> +			continue;
> +
> +		list_for_each_entry(ccr, &ctrl->ccr_list, list) {
> +			struct nvme_ctrl *ictrl = ccr->ictrl;
> +
> +			if (ictrl->cntlid != le16_to_cpu(entry->icid) ||
> +			    ictrl->ciu != entry->ciu)
> +				continue;
> +
> +			/* Complete matching entry */
> +			ccr->ccrs = entry->ccrs;
> +			complete(&ccr->complete);
> +		}
> +	}
> +	spin_unlock_irqrestore(&ctrl->lock, flags);
> +out:
> +	kfree(log);
> +}
> +
>   static void nvme_fw_act_work(struct work_struct *work)
>   {
>   	struct nvme_ctrl *ctrl = container_of(work,
> @@ -4942,6 +4984,9 @@ static bool nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result)
>   	case NVME_AER_NOTICE_DISC_CHANGED:
>   		ctrl->aen_result = result;
>   		break;
> +	case NVME_AER_NOTICE_CCR_COMPLETED:
> +		queue_work(nvme_wq, &ctrl->ccr_work);
> +		break;
>   	default:
>   		dev_warn(ctrl->device, "async event result %08x\n", result);
>   	}
> @@ -5131,6 +5176,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
>   	nvme_stop_failfast_work(ctrl);
>   	flush_work(&ctrl->async_event_work);
>   	cancel_work_sync(&ctrl->fw_act_work);
> +	cancel_work_sync(&ctrl->ccr_work);
>   	if (ctrl->ops->stop_ctrl)
>   		ctrl->ops->stop_ctrl(ctrl);
>   }
> @@ -5254,6 +5300,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
>   	ctrl->quirks = quirks;
>   	ctrl->numa_node = NUMA_NO_NODE;
>   	INIT_WORK(&ctrl->scan_work, nvme_scan_work);
> +	INIT_WORK(&ctrl->ccr_work, nvme_ccr_work);
>   	INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
>   	INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
>   	INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work);
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index fa18f580d76a..a7f382e35821 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -366,6 +366,7 @@ struct nvme_ctrl {
>   	struct nvme_effects_log *effects;
>   	struct xarray cels;
>   	struct work_struct scan_work;
> +	struct work_struct ccr_work;
>   	struct work_struct async_event_work;
>   	struct delayed_work ka_work;
>   	struct delayed_work failfast_work;

This confuses me. Why do we call 'complete()' but do not have a 
corresponding 'wait_for_completion()' call?

Please merge with the next patch to allow reviewers to have the full
picture.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 10/14] nvme-tcp: Use CCR to recover controller that hits an error
  2026-01-30 22:34 ` [PATCH v2 10/14] nvme-tcp: Use CCR to recover controller that hits an error Mohamed Khalfella
@ 2026-02-03  5:34   ` Hannes Reinecke
  2026-02-03 21:24     ` Mohamed Khalfella
  0 siblings, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-03  5:34 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 1/30/26 23:34, Mohamed Khalfella wrote:
> An alive nvme controller that hits an error now will move to FENCING
> state instead of RESETTING state. ctrl->fencing_work attempts CCR to
> terminate inflight IOs. If CCR succeeds, switch to FENCED -> RESETTING
> and continue error recovery as usual. If CCR fails, the behavior depends
> on whether the subsystem supports CQT or not. If CQT is not supported
> then reset the controller immediately as if CCR succeeded in order to
> maintain the current behavior. If CQT is supported switch to time-based
> recovery. Schedule ctrl->fenced_work resets the controller when time
> based recovery finishes.
> 
> Either ctrl->err_work or ctrl->reset_work can run after a controller is
> fenced. Flush fencing work when either work run.
> 
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>   drivers/nvme/host/tcp.c | 62 ++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 61 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 69cb04406b47..af8d3b36a4bb 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -193,6 +193,8 @@ struct nvme_tcp_ctrl {
>   	struct sockaddr_storage src_addr;
>   	struct nvme_ctrl	ctrl;
>   
> +	struct work_struct	fencing_work;
> +	struct delayed_work	fenced_work;
>   	struct work_struct	err_work;
>   	struct delayed_work	connect_work;
>   	struct nvme_tcp_request async_req;
> @@ -611,6 +613,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
>   
>   static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
>   {
> +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCING)) {
> +		dev_warn(ctrl->device, "starting controller fencing\n");
> +		queue_work(nvme_wq, &to_tcp_ctrl(ctrl)->fencing_work);
> +		return;
> +	}
> +

Don't you need to flush any outstanding 'fenced_work' queue items here
before calling 'queue_work()'?

>   	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
>   		return;
>   
> @@ -2470,12 +2478,59 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
>   	nvme_tcp_reconnect_or_remove(ctrl, ret);
>   }
>   
> +static void nvme_tcp_fenced_work(struct work_struct *work)
> +{
> +	struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
> +					struct nvme_tcp_ctrl, fenced_work);
> +	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> +
> +	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> +		queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
> +}
> +
> +static void nvme_tcp_fencing_work(struct work_struct *work)
> +{
> +	struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> +			struct nvme_tcp_ctrl, fencing_work);
> +	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> +	unsigned long rem;
> +
> +	rem = nvme_fence_ctrl(ctrl);
> +	if (!rem)
> +		goto done;
> +
> +	if (!ctrl->cqt) {
> +		dev_info(ctrl->device,
> +			 "CCR failed, CQT not supported, skip time-based recovery\n");
> +		goto done;
> +	}
> +

As mentioned, cqt handling should be part of another patchset.

> +	dev_info(ctrl->device,
> +		 "CCR failed, switch to time-based recovery, timeout = %ums\n",
> +		 jiffies_to_msecs(rem));
> +	queue_delayed_work(nvme_wq, &tcp_ctrl->fenced_work, rem);
> +	return;
> +

Why do you need the 'fenced' workqueue at all? All it does is queing yet 
another workqueue item, which certainly can be done from the 'fencing' 
workqueue directly, no?

> +done:
> +	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> +		queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
> +}
> +
> +static void nvme_tcp_flush_fencing_work(struct nvme_ctrl *ctrl)
> +{
> +	flush_work(&to_tcp_ctrl(ctrl)->fencing_work);
> +	flush_delayed_work(&to_tcp_ctrl(ctrl)->fenced_work);
> +}
> +
>   static void nvme_tcp_error_recovery_work(struct work_struct *work)
>   {
>   	struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
>   				struct nvme_tcp_ctrl, err_work);
>   	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
>   
> +	nvme_tcp_flush_fencing_work(ctrl);

Why not 'fenced_work' ?

>   	if (nvme_tcp_key_revoke_needed(ctrl))
>   		nvme_auth_revoke_tls_key(ctrl);
>   	nvme_stop_keep_alive(ctrl);
> @@ -2518,6 +2573,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
>   		container_of(work, struct nvme_ctrl, reset_work);
>   	int ret;
>   
> +	nvme_tcp_flush_fencing_work(ctrl);

Same.

>   	if (nvme_tcp_key_revoke_needed(ctrl))
>   		nvme_auth_revoke_tls_key(ctrl);
>   	nvme_stop_ctrl(ctrl);
> @@ -2643,13 +2699,15 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
>   	struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
>   	struct nvme_command *cmd = &pdu->cmd;
>   	int qid = nvme_tcp_queue_id(req->queue);
> +	enum nvme_ctrl_state state;
>   
>   	dev_warn(ctrl->device,
>   		 "I/O tag %d (%04x) type %d opcode %#x (%s) QID %d timeout\n",
>   		 rq->tag, nvme_cid(rq), pdu->hdr.type, cmd->common.opcode,
>   		 nvme_fabrics_opcode_str(qid, cmd), qid);
>   
> -	if (nvme_ctrl_state(ctrl) != NVME_CTRL_LIVE) {
> +	state = nvme_ctrl_state(ctrl);
> +	if (state != NVME_CTRL_LIVE && state != NVME_CTRL_FENCING) {

'FENCED' too, presumably?

>   		/*
>   		 * If we are resetting, connecting or deleting we should
>   		 * complete immediately because we may block controller
> @@ -2904,6 +2962,8 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
>   
>   	INIT_DELAYED_WORK(&ctrl->connect_work,
>   			nvme_tcp_reconnect_ctrl_work);
> +	INIT_DELAYED_WORK(&ctrl->fenced_work, nvme_tcp_fenced_work);
> +	INIT_WORK(&ctrl->fencing_work, nvme_tcp_fencing_work);
>   	INIT_WORK(&ctrl->err_work, nvme_tcp_error_recovery_work);
>   	INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
>   

Here you are calling CCR whenever error recovery is triggered.
This will cause CCR to be send from a command timeout, which is
technically wrong (CCR should be send when the KATO timeout expires,
not when a command timout expires). Both could be vastly different.

So I'd prefer to have CCR send whenever KATO timeout triggers, and
lease to current command timeout mechanism in place.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 11/14] nvme-rdma: Use CCR to recover controller that hits an error
  2026-01-30 22:34 ` [PATCH v2 11/14] nvme-rdma: " Mohamed Khalfella
@ 2026-02-03  5:35   ` Hannes Reinecke
  0 siblings, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-03  5:35 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 1/30/26 23:34, Mohamed Khalfella wrote:
> An alive nvme controller that hits an error now will move to FENCING
> state instead of RESETTING state. ctrl->fencing_work attempts CCR to
> terminate inflight IOs. If CCR succeeds, switch to FENCED -> RESETTING
> and continue error recovery as usual. If CCR fails, the behavior depends
> on whether the subsystem supports CQT or not. If CQT is not supported
> then reset the controller immediately as if CCR succeeded in order to
> maintain the current behavior. If CQT is supported switch to time-based
> recovery. Schedule ctrl->fenced_work resets the controller when time
> based recovery finishes.
> 
> Either ctrl->err_work or ctrl->reset_work can run after a controller is
> fenced. Flush fencing work when either work run.
> 
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>   drivers/nvme/host/rdma.c | 62 +++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 61 insertions(+), 1 deletion(-)
> 
The rdma driver is largely similar to the tcp one, so comments there
apply to both, one would guess.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset
  2026-01-30 22:34 ` [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset Mohamed Khalfella
@ 2026-02-03  5:40   ` Hannes Reinecke
  2026-02-03 21:29     ` Mohamed Khalfella
  2026-02-03 19:19   ` James Smart
  1 sibling, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-03  5:40 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 1/30/26 23:34, Mohamed Khalfella wrote:
> nvme_fc_error_recovery() called from nvme_fc_timeout() while controller
> in CONNECTING state results in deadlock reported in link below. Update
> nvme_fc_timeout() to schedule error recovery to avoid the deadlock.
> 
> Previous to this change if controller was LIVE error recovery resets
> the controller and this does not match nvme-tcp and nvme-rdma. Decouple
> error recovery from controller reset to match other fabric transports.
> 
> Link: https://lore.kernel.org/all/20250529214928.2112990-1-mkhalfella@purestorage.com/
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>   drivers/nvme/host/fc.c | 94 ++++++++++++++++++------------------------
>   1 file changed, 41 insertions(+), 53 deletions(-)
> 
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index 6948de3f438a..f8f6071b78ed 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -227,6 +227,8 @@ static DEFINE_IDA(nvme_fc_ctrl_cnt);
>   static struct device *fc_udev_device;
>   
>   static void nvme_fc_complete_rq(struct request *rq);
> +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> +					 char *errmsg);
>   
>   /* *********************** FC-NVME Port Management ************************ */
>   
> @@ -788,7 +790,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
>   		"Reconnect", ctrl->cnum);
>   
>   	set_bit(ASSOC_FAILED, &ctrl->flags);
> -	nvme_reset_ctrl(&ctrl->ctrl);
> +	nvme_fc_start_ioerr_recovery(ctrl, "Connectivity Loss");
>   }
>   
>   /**
> @@ -985,7 +987,7 @@ fc_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>   static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
>   static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
>   
> -static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg);
> +static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl);
>   
>   static void
>   __nvme_fc_finish_ls_req(struct nvmefc_ls_req_op *lsop)
> @@ -1567,9 +1569,8 @@ nvme_fc_ls_disconnect_assoc(struct nvmefc_ls_rcv_op *lsop)
>   	 * for the association have been ABTS'd by
>   	 * nvme_fc_delete_association().
>   	 */
> -
> -	/* fail the association */
> -	nvme_fc_error_recovery(ctrl, "Disconnect Association LS received");
> +	nvme_fc_start_ioerr_recovery(ctrl,
> +				     "Disconnect Association LS received");
>   
>   	/* release the reference taken by nvme_fc_match_disconn_ls() */
>   	nvme_fc_ctrl_put(ctrl);
> @@ -1871,7 +1872,7 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
>   	struct nvme_fc_ctrl *ctrl =
>   			container_of(work, struct nvme_fc_ctrl, ioerr_work);
>   
> -	nvme_fc_error_recovery(ctrl, "transport detected io error");
> +	nvme_fc_error_recovery(ctrl);
>   }
>   
>   /*
> @@ -1892,6 +1893,17 @@ char *nvme_fc_io_getuuid(struct nvmefc_fcp_req *req)
>   }
>   EXPORT_SYMBOL_GPL(nvme_fc_io_getuuid);
>   
> +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> +					 char *errmsg)
> +{
> +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
> +		return;
> +
> +	dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error recovery %s\n",
> +		 ctrl->cnum, errmsg);
> +	queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> +}
> +
>   static void
>   nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
>   {
> @@ -2049,9 +2061,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
>   		nvme_fc_complete_rq(rq);
>   
>   check_error:
> -	if (terminate_assoc &&
> -	    nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_RESETTING)
> -		queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> +	if (terminate_assoc)
> +		nvme_fc_start_ioerr_recovery(ctrl, "io error");
>   }
>   
>   static int
> @@ -2495,39 +2506,6 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
>   		nvme_unquiesce_admin_queue(&ctrl->ctrl);
>   }
>   
> -static void
> -nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
> -{
> -	enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
> -
> -	/*
> -	 * if an error (io timeout, etc) while (re)connecting, the remote
> -	 * port requested terminating of the association (disconnect_ls)
> -	 * or an error (timeout or abort) occurred on an io while creating
> -	 * the controller.  Abort any ios on the association and let the
> -	 * create_association error path resolve things.
> -	 */
> -	if (state == NVME_CTRL_CONNECTING) {
> -		__nvme_fc_abort_outstanding_ios(ctrl, true);
> -		dev_warn(ctrl->ctrl.device,
> -			"NVME-FC{%d}: transport error during (re)connect\n",
> -			ctrl->cnum);
> -		return;
> -	}
> -
> -	/* Otherwise, only proceed if in LIVE state - e.g. on first error */
> -	if (state != NVME_CTRL_LIVE)
> -		return;
> -
> -	dev_warn(ctrl->ctrl.device,
> -		"NVME-FC{%d}: transport association event: %s\n",
> -		ctrl->cnum, errmsg);
> -	dev_warn(ctrl->ctrl.device,
> -		"NVME-FC{%d}: resetting controller\n", ctrl->cnum);
> -
> -	nvme_reset_ctrl(&ctrl->ctrl);
> -}
> -
>   static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
>   {
>   	struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(rq);
> @@ -2536,24 +2514,14 @@ static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
>   	struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
>   	struct nvme_command *sqe = &cmdiu->sqe;
>   
> -	/*
> -	 * Attempt to abort the offending command. Command completion
> -	 * will detect the aborted io and will fail the connection.
> -	 */
>   	dev_info(ctrl->ctrl.device,
>   		"NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
>   		"x%08x/x%08x\n",
>   		ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
>   		nvme_fabrics_opcode_str(qnum, sqe),
>   		sqe->common.cdw10, sqe->common.cdw11);
> -	if (__nvme_fc_abort_op(ctrl, op))
> -		nvme_fc_error_recovery(ctrl, "io timeout abort failed");
>   
> -	/*
> -	 * the io abort has been initiated. Have the reset timer
> -	 * restarted and the abort completion will complete the io
> -	 * shortly. Avoids a synchronous wait while the abort finishes.
> -	 */
> +	nvme_fc_start_ioerr_recovery(ctrl, "io timeout");
>   	return BLK_EH_RESET_TIMER;
>   }
>   
> @@ -3352,6 +3320,26 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
>   	}
>   }
>   
> +static void
> +nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
> +{
> +	nvme_stop_keep_alive(&ctrl->ctrl);
> +	nvme_stop_ctrl(&ctrl->ctrl);
> +
> +	/* will block while waiting for io to terminate */
> +	nvme_fc_delete_association(ctrl);
> +
> +	/* Do not reconnect if controller is being deleted */
> +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
> +		return;
> +
> +	if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
> +		queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
> +		return;
> +	}
> +
> +	nvme_fc_reconnect_or_delete(ctrl, -ENOTCONN);
> +}
>   
>   static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
>   	.name			= "fc",

I really don't get it. Why do you need to do additional steps here, when
all you do is split an existing function in half?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 13/14] nvme-fc: Use CCR to recover controller that hits an error
  2026-01-30 22:34 ` [PATCH v2 13/14] nvme-fc: Use CCR to recover controller that hits an error Mohamed Khalfella
@ 2026-02-03  5:43   ` Hannes Reinecke
  2026-02-10 22:12   ` James Smart
  1 sibling, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-03  5:43 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 1/30/26 23:34, Mohamed Khalfella wrote:
> An alive nvme controller that hits an error now will move to FENCING
> state instead of RESETTING state. ctrl->fencing_work attempts CCR to
> terminate inflight IOs. If CCR succeeds, switch to FENCED -> RESETTING
> and continue error recovery as usual. If CCR fails, the behavior depends
> on whether the subsystem supports CQT or not. If CQT is not supported
> then reset the controller immediately as if CCR succeeded in order to
> maintain the current behavior. If CQT is supported switch to time-based
> recovery. Schedule ctrl->fenced_work resets the controller when time
> based recovery finishes.
> 
> Either ctrl->err_work or ctrl->reset_work can run after a controller is
> fenced. Flush fencing work when either work run.
> 
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>   drivers/nvme/host/fc.c | 60 ++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 60 insertions(+)
> 
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index f8f6071b78ed..3a01aeb39081 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -166,6 +166,8 @@ struct nvme_fc_ctrl {
>   	struct blk_mq_tag_set	admin_tag_set;
>   	struct blk_mq_tag_set	tag_set;
>   
> +	struct work_struct	fencing_work;
> +	struct delayed_work	fenced_work;
>   	struct work_struct	ioerr_work;
>   	struct delayed_work	connect_work;
>   
> @@ -1866,12 +1868,59 @@ __nvme_fc_fcpop_chk_teardowns(struct nvme_fc_ctrl *ctrl,
>   	}
>   }
>   
> +static void nvme_fc_fenced_work(struct work_struct *work)
> +{
> +	struct nvme_fc_ctrl *fc_ctrl = container_of(to_delayed_work(work),
> +			struct nvme_fc_ctrl, fenced_work);
> +	struct nvme_ctrl *ctrl = &fc_ctrl->ctrl;
> +
> +	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> +		queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
> +}
> +
> +static void nvme_fc_fencing_work(struct work_struct *work)
> +{
> +	struct nvme_fc_ctrl *fc_ctrl =
> +			container_of(work, struct nvme_fc_ctrl, fencing_work);
> +	struct nvme_ctrl *ctrl = &fc_ctrl->ctrl;
> +	unsigned long rem;
> +
> +	rem = nvme_fence_ctrl(ctrl);
> +	if (!rem)
> +		goto done;
> +
> +	if (!ctrl->cqt) {
> +		dev_info(ctrl->device,
> +			 "CCR failed, CQT not supported, skip time-based recovery\n");
> +		goto done;
> +	}
> +
> +	dev_info(ctrl->device,
> +		 "CCR failed, switch to time-based recovery, timeout = %ums\n",
> +		 jiffies_to_msecs(rem));
> +	queue_delayed_work(nvme_wq, &fc_ctrl->fenced_work, rem);
> +	return;
> +

Same comments re fenced workqueue apply here.

> +done:
> +	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> +		queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
> +}
> +
> +static void nvme_fc_flush_fencing_work(struct nvme_fc_ctrl *ctrl)
> +{
> +	flush_work(&ctrl->fencing_work);
> +	flush_delayed_work(&ctrl->fenced_work);
> +}
> +
>   static void
>   nvme_fc_ctrl_ioerr_work(struct work_struct *work)
>   {
>   	struct nvme_fc_ctrl *ctrl =
>   			container_of(work, struct nvme_fc_ctrl, ioerr_work);
>   
> +	nvme_fc_flush_fencing_work(ctrl);
>   	nvme_fc_error_recovery(ctrl);
>   }
>   
> @@ -1896,6 +1945,14 @@ EXPORT_SYMBOL_GPL(nvme_fc_io_getuuid);
>   static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
>   					 char *errmsg)
>   {
> +	if (nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_FENCING)) {
> +		dev_warn(ctrl->ctrl.device,
> +			 "NVME-FC{%d}: starting controller fencing %s\n",
> +			 ctrl->cnum, errmsg);
> +		queue_work(nvme_wq, &ctrl->fencing_work);
> +		return;
> +	}
> +
>   	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
>   		return;
>   
> @@ -3297,6 +3354,7 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
>   	struct nvme_fc_ctrl *ctrl =
>   		container_of(work, struct nvme_fc_ctrl, ctrl.reset_work);
>   
> +	nvme_fc_flush_fencing_work(ctrl);
>   	nvme_stop_ctrl(&ctrl->ctrl);
>   
>   	/* will block will waiting for io to terminate */
> @@ -3471,6 +3529,8 @@ nvme_fc_alloc_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
>   
>   	INIT_WORK(&ctrl->ctrl.reset_work, nvme_fc_reset_ctrl_work);
>   	INIT_DELAYED_WORK(&ctrl->connect_work, nvme_fc_connect_ctrl_work);
> +	INIT_DELAYED_WORK(&ctrl->fenced_work, nvme_fc_fenced_work);
> +	INIT_WORK(&ctrl->fencing_work, nvme_fc_fencing_work);
>   	INIT_WORK(&ctrl->ioerr_work, nvme_fc_ctrl_ioerr_work);
>   	spin_lock_init(&ctrl->lock);
>   

And, of course, the same comments about sending CCR in response to a
KATO timeout and not a command timeout apply.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields
  2026-02-03  3:03   ` Hannes Reinecke
@ 2026-02-03 18:14     ` Mohamed Khalfella
  2026-02-04  0:34       ` Hannes Reinecke
  0 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-03 18:14 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On Tue 2026-02-03 04:03:22 +0100, Hannes Reinecke wrote:
> On 1/30/26 23:34, Mohamed Khalfella wrote:
> > TP8028 Rapid Path Failure Recovery defined new fields in controller
> > identify response. The newly defined fields are:
> > 
> > - CIU (Controller Instance UNIQUIFIER): is an 8bit non-zero value that
> > is assigned a random value when controller first created. The value is
> > expected to be incremented when RDY bit in CSTS register is asserted
> > - CIRN (Controller Instance Random Number): is 64bit random value that
> > gets generated when controller is crated. CIRN is regenerated everytime
> > RDY bit is CSTS register is asserted.
> > - CCRL (Cross-Controller Reset Limit) is an 8bit value that defines the
> > maximum number of in-progress controller reset operations. CCRL is
> > hardcoded to 4 as recommended by TP8028.
> > 
> > TP4129 KATO Corrections and Clarifications defined CQT (Command Quiesce
> > Time) which is used along with KATO (Keep Alive Timeout) to set an upper
> > time limit for attempting Cross-Controller Recovery. For NVME subsystem
> > CQT is set to 0 by default to keep the current behavior. The value can
> > be set from configfs if needed.
> > 
> > Make the new fields available for IO controllers only since TP8028 is
> > not very useful for discovery controllers.
> > 
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> >   drivers/nvme/target/admin-cmd.c |  6 ++++++
> >   drivers/nvme/target/configfs.c  | 31 +++++++++++++++++++++++++++++++
> >   drivers/nvme/target/core.c      | 12 ++++++++++++
> >   drivers/nvme/target/nvmet.h     |  4 ++++
> >   include/linux/nvme.h            | 15 ++++++++++++---
> >   5 files changed, 65 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
> > index 3da31bb1183e..ade1145df72d 100644
> > --- a/drivers/nvme/target/admin-cmd.c
> > +++ b/drivers/nvme/target/admin-cmd.c
> > @@ -696,6 +696,12 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
> >   
> >   	id->cntlid = cpu_to_le16(ctrl->cntlid);
> >   	id->ver = cpu_to_le32(ctrl->subsys->ver);
> > +	if (!nvmet_is_disc_subsys(ctrl->subsys)) {
> > +		id->cqt = cpu_to_le16(ctrl->cqt);
> > +		id->ciu = ctrl->ciu;
> > +		id->cirn = cpu_to_le64(ctrl->cirn);
> > +		id->ccrl = NVMF_CCR_LIMIT;
> > +	}
> >   
> >   	/* XXX: figure out what to do about RTD3R/RTD3 */
> >   	id->oaes = cpu_to_le32(NVMET_AEN_CFG_OPTIONAL);
> > diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
> > index e44ef69dffc2..035f6e75a818 100644
> > --- a/drivers/nvme/target/configfs.c
> > +++ b/drivers/nvme/target/configfs.c
> > @@ -1636,6 +1636,36 @@ static ssize_t nvmet_subsys_attr_pi_enable_store(struct config_item *item,
> >   CONFIGFS_ATTR(nvmet_subsys_, attr_pi_enable);
> >   #endif
> >   
> > +static ssize_t nvmet_subsys_attr_cqt_show(struct config_item *item,
> > +					  char *page)
> > +{
> > +	return snprintf(page, PAGE_SIZE, "%u\n", to_subsys(item)->cqt);
> > +}
> > +
> > +static ssize_t nvmet_subsys_attr_cqt_store(struct config_item *item,
> > +					   const char *page, size_t cnt)
> > +{
> > +	struct nvmet_subsys *subsys = to_subsys(item);
> > +	struct nvmet_ctrl *ctrl;
> > +	u16 cqt;
> > +
> > +	if (sscanf(page, "%hu\n", &cqt) != 1)
> > +		return -EINVAL;
> > +
> > +	down_write(&nvmet_config_sem);
> > +	if (subsys->cqt == cqt)
> > +		goto out;
> > +
> > +	subsys->cqt = cqt;
> > +	/* Force reconnect */
> > +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
> > +		ctrl->ops->delete_ctrl(ctrl);
> > +out:
> > +	up_write(&nvmet_config_sem);
> > +	return cnt;
> > +}
> > +CONFIGFS_ATTR(nvmet_subsys_, attr_cqt);
> > +
> >   static ssize_t nvmet_subsys_attr_qid_max_show(struct config_item *item,
> >   					      char *page)
> >   {
> > @@ -1676,6 +1706,7 @@ static struct configfs_attribute *nvmet_subsys_attrs[] = {
> >   	&nvmet_subsys_attr_attr_vendor_id,
> >   	&nvmet_subsys_attr_attr_subsys_vendor_id,
> >   	&nvmet_subsys_attr_attr_model,
> > +	&nvmet_subsys_attr_attr_cqt,
> >   	&nvmet_subsys_attr_attr_qid_max,
> >   	&nvmet_subsys_attr_attr_ieee_oui,
> >   	&nvmet_subsys_attr_attr_firmware,
> 
> I do think that TP8028 (ie the CQT defintions) are somewhat independent
> on CCR. So I'm not sure if they should be integrated in this patchset;
> personally I would prefer to have it moved to another patchset.

Agreed that CQT is not directly related to CCR from the target
perspective. But there is a relationship when it comes to how the
initiator uses CQT to calculate the time budget for CCR. As you know on
the host side if CCR fails and CQT is supported the requests needs to be
held for certain amount of time before they are retried. So CQT value is
needed and that I why I included it in this patchset.

> 
> > diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> > index cc88e5a28c8a..0d2a1206e08f 100644
> > --- a/drivers/nvme/target/core.c
> > +++ b/drivers/nvme/target/core.c
> > @@ -1393,6 +1393,10 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl)
> >   		return;
> >   	}
> >   
> > +	if (!nvmet_is_disc_subsys(ctrl->subsys)) {
> > +		ctrl->ciu = ((u8)(ctrl->ciu + 1)) ? : 1;
> > +		ctrl->cirn = get_random_u64();
> > +	}
> >   	ctrl->csts = NVME_CSTS_RDY;
> >   
> >   	/*
> > @@ -1661,6 +1665,12 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
> >   	}
> >   	ctrl->cntlid = ret;
> >   
> > +	if (!nvmet_is_disc_subsys(ctrl->subsys)) {
> > +		ctrl->cqt = subsys->cqt;
> > +		ctrl->ciu = get_random_u8() ? : 1;
> > +		ctrl->cirn = get_random_u64();
> > +	}
> > +
> >   	/*
> >   	 * Discovery controllers may use some arbitrary high value
> >   	 * in order to cleanup stale discovery sessions
> > @@ -1853,10 +1863,12 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char *subsysnqn,
> >   
> >   	switch (type) {
> >   	case NVME_NQN_NVME:
> > +		subsys->cqt = NVMF_CQT_MS;
> >   		subsys->max_qid = NVMET_NR_QUEUES;
> >   		break;
> 
> And I would not set the CQT default here.
> Thing is, implementing CQT to the letter would inflict a CQT delay
> during failover for _every_ installation, thereby resulting in a
> regression to previous implementations where we would fail over
> with _no_ delay.
> So again, we should make it a different patchset.

CQT defaults to 0 to avoid introducing surprise delay. The initiator will
skip holding requests if it sees CQT set to 0.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 03/14] nvmet: Implement CCR nvme command
  2026-02-03  3:19   ` Hannes Reinecke
@ 2026-02-03 18:40     ` Mohamed Khalfella
  2026-02-04  0:38       ` Hannes Reinecke
  0 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-03 18:40 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On Tue 2026-02-03 04:19:50 +0100, Hannes Reinecke wrote:
> On 1/30/26 23:34, Mohamed Khalfella wrote:
> > @@ -1501,6 +1516,38 @@ struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
> >   	return ctrl;
> >   }
> >   
> > +struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
> > +					   const char *hostnqn, u8 ciu,
> > +					   u16 cntlid, u64 cirn)
> > +{
> > +	struct nvmet_ctrl *ctrl;
> > +	bool found = false;
> > +
> > +	mutex_lock(&subsys->lock);
> > +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> > +		if (ctrl->cntlid != cntlid)
> > +			continue;
> > +		if (strncmp(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE))
> > +			continue;
> > +
> Why do we compare the hostnqn here, too? To my understanding the host 
> NQN is tied to the controller, so the controller ID should be sufficient
> here.

We got cntlid from CCR nvme command and we do not trust the value sent by
the host. We check hostnqn to confirm that host is actually connected to
the impacted controller. A host should not be allowed to reset a
controller connected to another host.

> > @@ -1740,12 +1788,35 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
> >   }
> >   EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
> >   
> > +static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
> > +{
> > +	struct nvmet_subsys *subsys = ctrl->subsys;
> > +	struct nvmet_ctrl *sctrl;
> > +	struct nvmet_ccr *ccr;
> > +
> > +	mutex_lock(&ctrl->lock);
> > +	nvmet_ctrl_cleanup_ccrs(ctrl, true);
> > +	mutex_unlock(&ctrl->lock);
> > +
> > +	list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
> > +		mutex_lock(&sctrl->lock);
> > +		list_for_each_entry(ccr, &sctrl->ccr_list, entry) {
> > +			if (ccr->ctrl == ctrl) {
> > +				ccr->ctrl = NULL;
> > +				break;
> > +			}
> > +		}
> > +		mutex_unlock(&sctrl->lock);
> > +	}
> > +}
> > +
> 
> Maybe add documentation here that the first CCR cleanup is for clearing
> CCRs issued from this controller, and the second is for CCRs issued _to_
> this controller.
> 

Good point. Will do that.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 05/14] nvmet: Send an AEN on CCR completion
  2026-02-03  3:27   ` Hannes Reinecke
@ 2026-02-03 18:48     ` Mohamed Khalfella
  2026-02-04  0:43       ` Hannes Reinecke
  0 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-03 18:48 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On Tue 2026-02-03 04:27:39 +0100, Hannes Reinecke wrote:
> On 1/30/26 23:34, Mohamed Khalfella wrote:
> > Send an AEN to initiator when impacted controller exists. The
> > notification points to CCR log page that initiator can read to check
> > which CCR operation completed.
> > 
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> >   drivers/nvme/target/core.c  | 25 ++++++++++++++++++++++---
> >   drivers/nvme/target/nvmet.h |  3 ++-
> >   include/linux/nvme.h        |  3 +++
> >   3 files changed, 27 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> > index 54dd0dcfa12b..ae2fe9f90bcd 100644
> > --- a/drivers/nvme/target/core.c
> > +++ b/drivers/nvme/target/core.c
> > @@ -202,7 +202,7 @@ static void nvmet_async_event_work(struct work_struct *work)
> >   	nvmet_async_events_process(ctrl);
> >   }
> >   
> > -void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> > +static void nvmet_add_async_event_locked(struct nvmet_ctrl *ctrl, u8 event_type,
> >   		u8 event_info, u8 log_page)
> >   {
> >   	struct nvmet_async_event *aen;
> > @@ -215,13 +215,19 @@ void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> >   	aen->event_info = event_info;
> >   	aen->log_page = log_page;
> >   
> > -	mutex_lock(&ctrl->lock);
> >   	list_add_tail(&aen->entry, &ctrl->async_events);
> > -	mutex_unlock(&ctrl->lock);
> >   
> >   	queue_work(nvmet_wq, &ctrl->async_event_work);
> >   }
> >   
> > +void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
> > +		u8 event_info, u8 log_page)
> > +{
> > +	mutex_lock(&ctrl->lock);
> > +	nvmet_add_async_event_locked(ctrl, event_type, event_info, log_page);
> > +	mutex_unlock(&ctrl->lock);
> > +}
> > +
> >   static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
> >   {
> >   	u32 i;
> > @@ -1788,6 +1794,18 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
> >   }
> >   EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
> >   
> > +static void nvmet_ctrl_notify_ccr(struct nvmet_ctrl *ctrl)
> > +{
> > +	lockdep_assert_held(&ctrl->lock);
> > +
> > +	if (nvmet_aen_bit_disabled(ctrl, NVME_AEN_BIT_CCR_COMPLETE))
> > +		return;
> > +
> > +	nvmet_add_async_event_locked(ctrl, NVME_AER_NOTICE,
> > +				     NVME_AER_NOTICE_CCR_COMPLETED,
> > +				     NVME_LOG_CCR);
> > +}
> > +
> >   static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
> >   {
> >   	struct nvmet_subsys *subsys = ctrl->subsys;
> 
> But what does the CCR command actually _do_?
> At the very lease I would have expected it to trigger a controller reset
> (eg calling into nvmet_ctrl_fatal_error()), yet I don't see it doing
> that anywhere ...

[PATCH v2 03/14] nvmet: Implement CCR nvme command
is where impacted controller is told to fail. It does exactly what you
mentioned above.

+out_unlock:
+       mutex_unlock(&sctrl->lock);
+       if (status == NVME_SC_SUCCESS)
+               nvmet_ctrl_fatal_error(ictrl);
+       nvmet_ctrl_put(ictrl);
+out:
+       nvmet_req_complete(req, status);

I refactored the error handling codepath into success codepath. That is
why I think it is kind of hidden. If this is not obvious I can separate
the two codepaths. What do you think?


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 07/14] nvme: Introduce FENCING and FENCED controller states
  2026-02-03  5:07   ` Hannes Reinecke
@ 2026-02-03 19:13     ` Mohamed Khalfella
  0 siblings, 0 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-03 19:13 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On Tue 2026-02-03 06:07:35 +0100, Hannes Reinecke wrote:
> On 1/30/26 23:34, Mohamed Khalfella wrote:
> > FENCING is a new controller state that a LIVE controller enter when an
> > error is encountered. While in FENCING state inflight IOs that timeout
> > are not canceled because they should be held until either CCR succeeds
> > or time-based recovery completes. While the queues remain alive requests
> > are not allowed to be sent in this state and the controller can not be
> > reset of deleted. This is intentional because resetting or deleting the
> > controller results in canceling inflight IOs.
> > 
> > FENCED is a short-term state the controller enters before it is reset.
> > It exists only to prevent manual resets to happen while controller is
> > in FENCING state.
> > 
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> >   drivers/nvme/host/core.c  | 25 +++++++++++++++++++++++--
> >   drivers/nvme/host/nvme.h  |  4 ++++
> >   drivers/nvme/host/sysfs.c |  2 ++
> >   3 files changed, 29 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 8961d612ccb0..3e1e02822dd4 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -574,10 +574,29 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
> >   			break;
> >   		}
> >   		break;
> > +	case NVME_CTRL_FENCING:
> > +		switch (old_state) {
> > +		case NVME_CTRL_LIVE:
> > +			changed = true;
> > +			fallthrough;
> > +		default:
> > +			break;
> > +		}
> > +		break;
>  > +	case NVME_CTRL_FENCED:> +		switch (old_state) {
> > +		case NVME_CTRL_FENCING:
> > +			changed = true;
> > +			fallthrough;
> > +		default:
> > +			break;
> > +		}
> > +		break;
> >   	case NVME_CTRL_RESETTING:
> >   		switch (old_state) {
> >   		case NVME_CTRL_NEW:
> >   		case NVME_CTRL_LIVE:
> > +		case NVME_CTRL_FENCED:
> >   			changed = true;
> >   			fallthrough;
> >   		default:
> > @@ -760,6 +779,7 @@ blk_status_t nvme_fail_nonready_command(struct nvme_ctrl *ctrl,
> >   
> >   	if (state != NVME_CTRL_DELETING_NOIO &&
> >   	    state != NVME_CTRL_DELETING &&
> > +	    state != NVME_CTRL_FENCING &&
> 
> Shouldn't 'FENCED' be in here, too?

Agreed. Will add FENCED to the two places.

> 
> >   	    state != NVME_CTRL_DEAD &&
> >   	    !test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags) &&
> >   	    !blk_noretry_request(rq) && !(rq->cmd_flags & REQ_NVME_MPATH))
> > @@ -802,10 +822,11 @@ bool __nvme_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
> >   			     req->cmd->fabrics.fctype == nvme_fabrics_type_auth_receive))
> >   				return true;
> >   			break;
> > -		default:
> > -			break;
> > +		case NVME_CTRL_FENCING:
> 
> Similar here.
> 
> >   		case NVME_CTRL_DEAD:
> >   			return false;
> > +		default:
> > +			break;
> >   		}
> >   	}
> >   
> > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> > index 9dd9f179ad88..00866bbc66f3 100644
> > --- a/drivers/nvme/host/nvme.h
> > +++ b/drivers/nvme/host/nvme.h
> > @@ -251,6 +251,8 @@ static inline u16 nvme_req_qid(struct request *req)
> >   enum nvme_ctrl_state {
> >   	NVME_CTRL_NEW,
> >   	NVME_CTRL_LIVE,
> > +	NVME_CTRL_FENCING,
> > +	NVME_CTRL_FENCED,
> >   	NVME_CTRL_RESETTING,
> >   	NVME_CTRL_CONNECTING,
> >   	NVME_CTRL_DELETING,
> > @@ -777,6 +779,8 @@ static inline bool nvme_state_terminal(struct nvme_ctrl *ctrl)
> >   	switch (nvme_ctrl_state(ctrl)) {
> >   	case NVME_CTRL_NEW:
> >   	case NVME_CTRL_LIVE:
> > +	case NVME_CTRL_FENCING:
> > +	case NVME_CTRL_FENCED:
> >   	case NVME_CTRL_RESETTING:
> >   	case NVME_CTRL_CONNECTING:
> >   		return false;
> > diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
> > index f81bbb6ec768..4ec9dfeb736e 100644
> > --- a/drivers/nvme/host/sysfs.c
> > +++ b/drivers/nvme/host/sysfs.c
> > @@ -443,6 +443,8 @@ static ssize_t nvme_sysfs_show_state(struct device *dev,
> >   	static const char *const state_name[] = {
> >   		[NVME_CTRL_NEW]		= "new",
> >   		[NVME_CTRL_LIVE]	= "live",
> > +		[NVME_CTRL_FENCING]	= "fencing",
> > +		[NVME_CTRL_FENCED]	= "fenced",
> >   		[NVME_CTRL_RESETTING]	= "resetting",
> >   		[NVME_CTRL_CONNECTING]	= "connecting",
> >   		[NVME_CTRL_DELETING]	= "deleting",
> 
> You need to modify nvme-tcp.c:nvme_tcp_timeout() too, as this checks
> 'just' for 'LIVE' state and will abort/terminate commands when in
> FENCING. Similar argument for nvme-rdma.c. And nvme-fc.c also needs an
> audit to ensure it works correctly.

Exactly. The changes to nvme-tcp, nvme-rdma, and nvme-fc are in
transport specific patches. For tcp and rdma the timeout callback
handler has been modified to do what you mentioned.

For nvme-fc nvme_fc_start_ioerr_recovery() does nothing if the
controller is in FENCING state.



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset
  2026-01-30 22:34 ` [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset Mohamed Khalfella
  2026-02-03  5:40   ` Hannes Reinecke
@ 2026-02-03 19:19   ` James Smart
  2026-02-03 22:49     ` James Smart
  2026-02-04  0:11     ` Mohamed Khalfella
  1 sibling, 2 replies; 82+ messages in thread
From: James Smart @ 2026-02-03 19:19 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, jsmart833426

On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> nvme_fc_error_recovery() called from nvme_fc_timeout() while controller
> in CONNECTING state results in deadlock reported in link below. Update
> nvme_fc_timeout() to schedule error recovery to avoid the deadlock.
> 
> Previous to this change if controller was LIVE error recovery resets
> the controller and this does not match nvme-tcp and nvme-rdma.

It is not intended to match tcp/rda. Using the reset path was done to 
avoid code duplication of paths to teardown the association.  FC, given 
we interact with an HBA for device and io state and have a lot of async 
io completions, requires a lot more work than straight data structure 
teardown in rdma/tcp.

I agree with wanting to changeup the execution thread for the deadlock.


> Decouple> error recovery from controller reset to match other fabric transports.>
> Link: https://lore.kernel.org/all/20250529214928.2112990-1-mkhalfella@purestorage.com/
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>   drivers/nvme/host/fc.c | 94 ++++++++++++++++++------------------------
>   1 file changed, 41 insertions(+), 53 deletions(-)
> 
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index 6948de3f438a..f8f6071b78ed 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -227,6 +227,8 @@ static DEFINE_IDA(nvme_fc_ctrl_cnt);
>   static struct device *fc_udev_device;
>   
>   static void nvme_fc_complete_rq(struct request *rq);
> +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> +					 char *errmsg);
>   
>   /* *********************** FC-NVME Port Management ************************ */
>   
> @@ -788,7 +790,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
>   		"Reconnect", ctrl->cnum);
>   
>   	set_bit(ASSOC_FAILED, &ctrl->flags);
> -	nvme_reset_ctrl(&ctrl->ctrl);
> +	nvme_fc_start_ioerr_recovery(ctrl, "Connectivity Loss");
>   }
>   
>   /**
> @@ -985,7 +987,7 @@ fc_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
>   static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
>   static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
>   
> -static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg);
> +static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl);
>   
>   static void
>   __nvme_fc_finish_ls_req(struct nvmefc_ls_req_op *lsop)
> @@ -1567,9 +1569,8 @@ nvme_fc_ls_disconnect_assoc(struct nvmefc_ls_rcv_op *lsop)
>   	 * for the association have been ABTS'd by
>   	 * nvme_fc_delete_association().
>   	 */
> -
> -	/* fail the association */
> -	nvme_fc_error_recovery(ctrl, "Disconnect Association LS received");
> +	nvme_fc_start_ioerr_recovery(ctrl,
> +				     "Disconnect Association LS received");
>   
>   	/* release the reference taken by nvme_fc_match_disconn_ls() */
>   	nvme_fc_ctrl_put(ctrl);
> @@ -1871,7 +1872,7 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
>   	struct nvme_fc_ctrl *ctrl =
>   			container_of(work, struct nvme_fc_ctrl, ioerr_work);
>   
> -	nvme_fc_error_recovery(ctrl, "transport detected io error");
> +	nvme_fc_error_recovery(ctrl);

hmm.. not sure how I feel about this. There is at least a break in reset 
processing that is no longer present - e.g. prior queued ioerr_work, 
which would then queue reset_work. This effectively calls the reset_work 
handler directly. I assume it should be ok.

>   }
>   
>   /*
> @@ -1892,6 +1893,17 @@ char *nvme_fc_io_getuuid(struct nvmefc_fcp_req *req)
>   }
>   EXPORT_SYMBOL_GPL(nvme_fc_io_getuuid);
>   
> +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> +					 char *errmsg)
> +{
> +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
> +		return;
 > +> +	dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error 
recovery %s\n",
> +		 ctrl->cnum, errmsg);
> +	queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> +}
> +

Disagree with this.

The clause in error_recovery around the CONNECTING state is pretty 
important to terminate io occurring during connect/reconnect where the 
ctrl state should not change. we don't want start_ioerr making it RESETTING.

This should be reworked.

>   static void
>   nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
>   {
> @@ -2049,9 +2061,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
>   		nvme_fc_complete_rq(rq);
>   
>   check_error:
> -	if (terminate_assoc &&
> -	    nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_RESETTING)
> -		queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> +	if (terminate_assoc)
> +		nvme_fc_start_ioerr_recovery(ctrl, "io error");

this is ok. the ioerr_recovery will bounce the RESETTING state if it's 
already in the state. So this is a little cleaner.

>   }
>   
>   static int
> @@ -2495,39 +2506,6 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
>   		nvme_unquiesce_admin_queue(&ctrl->ctrl);
>   }
>   
> -static void
> -nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
> -{
> -	enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
> -
> -	/*
> -	 * if an error (io timeout, etc) while (re)connecting, the remote
> -	 * port requested terminating of the association (disconnect_ls)
> -	 * or an error (timeout or abort) occurred on an io while creating
> -	 * the controller.  Abort any ios on the association and let the
> -	 * create_association error path resolve things.
> -	 */
> -	if (state == NVME_CTRL_CONNECTING) {
> -		__nvme_fc_abort_outstanding_ios(ctrl, true);
> -		dev_warn(ctrl->ctrl.device,
> -			"NVME-FC{%d}: transport error during (re)connect\n",
> -			ctrl->cnum);
> -		return;
> -	}

This logic needs to be preserved. Its no longer part of 
nvme_fc_start_ioerr_recovery(). Failures during CONNECTING should not be 
"fenced". They should fail immediately.

> -
> -	/* Otherwise, only proceed if in LIVE state - e.g. on first error */
> -	if (state != NVME_CTRL_LIVE)
> -		return;

This was to filter out multiple requests of the reset. I guess that is 
what happens now in start_ioerr when attempting to set state to 
RESETTING and already RESETTING.

There is a small difference here in that The existing code avoids doing 
the ctrl reset if the controller is NEW. start_ioerr will change the 
ctrl to RESETTING. I'm not sure how much of an impact that is.


> -
> -	dev_warn(ctrl->ctrl.device,
> -		"NVME-FC{%d}: transport association event: %s\n",
> -		ctrl->cnum, errmsg);
> -	dev_warn(ctrl->ctrl.device,
> -		"NVME-FC{%d}: resetting controller\n", ctrl->cnum);

I haven't paid much attention, but keeping the transport messages for 
these cases is very very useful for diagnosis.

> -
> -	nvme_reset_ctrl(&ctrl->ctrl);
> -}
> -
>   static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
>   {
>   	struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(rq);
> @@ -2536,24 +2514,14 @@ static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
>   	struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
>   	struct nvme_command *sqe = &cmdiu->sqe;
>   
> -	/*
> -	 * Attempt to abort the offending command. Command completion
> -	 * will detect the aborted io and will fail the connection.
> -	 */
>   	dev_info(ctrl->ctrl.device,
>   		"NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
>   		"x%08x/x%08x\n",
>   		ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
>   		nvme_fabrics_opcode_str(qnum, sqe),
>   		sqe->common.cdw10, sqe->common.cdw11);
> -	if (__nvme_fc_abort_op(ctrl, op))
> -		nvme_fc_error_recovery(ctrl, "io timeout abort failed");
>   
> -	/*
> -	 * the io abort has been initiated. Have the reset timer
> -	 * restarted and the abort completion will complete the io
> -	 * shortly. Avoids a synchronous wait while the abort finishes.
> -	 */
> +	nvme_fc_start_ioerr_recovery(ctrl, "io timeout");

Why get rid of the abort logic ?
Note: the error recovery/controller reset is only called when the abort 
failed.

I believe you should continue to abort the op.  The fence logic will 
kick in when the op completes later (along with other io completions). 
If nothing else, it allows a hw resource to be freed up.


>   	return BLK_EH_RESET_TIMER;
>   }
>   
> @@ -3352,6 +3320,26 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
>   	}
>   }
>   
> +static void
> +nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
> +{
> +	nvme_stop_keep_alive(&ctrl->ctrl);

Curious, why did the stop_keep_alive() call get added to this ?
Doesn't hurt.

I assume it was due to other transports having it as they originally 
were calling stop_ctrl, but then moved to stop_keep_alive. Shouldn't 
this be followed by flush_work((&ctrl->ctrl.async_event_work) ?

> +	nvme_stop_ctrl(&ctrl->ctrl);
> +
> +	/* will block while waiting for io to terminate */
> +	nvme_fc_delete_association(ctrl);
> +
> +	/* Do not reconnect if controller is being deleted */
> +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
> +		return;
> +
> +	if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
> +		queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
> +		return;
> +	}
> +
> +	nvme_fc_reconnect_or_delete(ctrl, -ENOTCONN);
> +}

This code and that in nvme_fc_reset_ctrl_work() need to be collapsed 
into a common helper function invoked by the 2 routines.  Also addresses 
the missing flush_delayed work in this routine.

>   
>   static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
>   	.name			= "fc",


-- James

(new email address. can always reach me at james.smart@broadcom.com as well)



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery
  2026-02-03  5:19   ` Hannes Reinecke
@ 2026-02-03 20:00     ` Mohamed Khalfella
  2026-02-04  1:10       ` Hannes Reinecke
  0 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-03 20:00 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On Tue 2026-02-03 06:19:51 +0100, Hannes Reinecke wrote:
> On 1/30/26 23:34, Mohamed Khalfella wrote:
> > A host that has more than one path connecting to an nvme subsystem
> > typically has an nvme controller associated with every path. This is
> > mostly applicable to nvmeof. If one path goes down, inflight IOs on that
> > path should not be retried immediately on another path because this
> > could lead to data corruption as described in TP4129. TP8028 defines
> > cross-controller reset mechanism that can be used by host to terminate
> > IOs on the failed path using one of the remaining healthy paths. Only
> > after IOs are terminated, or long enough time passes as defined by
> > TP4129, inflight IOs should be retried on another path. Implement core
> > cross-controller reset shared logic to be used by the transports.
> > 
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> >   drivers/nvme/host/constants.c |   1 +
> >   drivers/nvme/host/core.c      | 129 ++++++++++++++++++++++++++++++++++
> >   drivers/nvme/host/nvme.h      |   9 +++
> >   3 files changed, 139 insertions(+)
> > 
> > diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
> > index dc90df9e13a2..f679efd5110e 100644
> > --- a/drivers/nvme/host/constants.c
> > +++ b/drivers/nvme/host/constants.c
> > @@ -46,6 +46,7 @@ static const char * const nvme_admin_ops[] = {
> >   	[nvme_admin_virtual_mgmt] = "Virtual Management",
> >   	[nvme_admin_nvme_mi_send] = "NVMe Send MI",
> >   	[nvme_admin_nvme_mi_recv] = "NVMe Receive MI",
> > +	[nvme_admin_cross_ctrl_reset] = "Cross Controller Reset",
> >   	[nvme_admin_dbbuf] = "Doorbell Buffer Config",
> >   	[nvme_admin_format_nvm] = "Format NVM",
> >   	[nvme_admin_security_send] = "Security Send",
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 3e1e02822dd4..13e0775d56b4 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -554,6 +554,134 @@ void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl)
> >   }
> >   EXPORT_SYMBOL_GPL(nvme_cancel_admin_tagset);
> >   
> > +static struct nvme_ctrl *nvme_find_ctrl_ccr(struct nvme_ctrl *ictrl,
> > +					    u32 min_cntlid)
> > +{
> > +	struct nvme_subsystem *subsys = ictrl->subsys;
> > +	struct nvme_ctrl *sctrl;
> > +	unsigned long flags;
> > +
> > +	mutex_lock(&nvme_subsystems_lock);
> > +	list_for_each_entry(sctrl, &subsys->ctrls, subsys_entry) {
> > +		if (sctrl->cntlid < min_cntlid)
> > +			continue;
> > +
> > +		if (atomic_dec_if_positive(&sctrl->ccr_limit) < 0)
> > +			continue;
> > +
> > +		spin_lock_irqsave(&sctrl->lock, flags);
> > +		if (sctrl->state != NVME_CTRL_LIVE) {
> > +			spin_unlock_irqrestore(&sctrl->lock, flags);
> > +			atomic_inc(&sctrl->ccr_limit);
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * We got a good candidate source controller that is locked and
> > +		 * LIVE. However, no guarantee sctrl will not be deleted after
> > +		 * sctrl->lock is released. Get a ref of both sctrl and admin_q
> > +		 * so they do not disappear until we are done with them.
> > +		 */
> > +		WARN_ON_ONCE(!blk_get_queue(sctrl->admin_q));
> > +		nvme_get_ctrl(sctrl);
> > +		spin_unlock_irqrestore(&sctrl->lock, flags);
> > +		goto found;
> > +	}
> > +	sctrl = NULL;
> > +found:
> 
> Normally one would be using a temporary loop variable and assign 'sctrl' 
> to that one if found. Then you can just call 'break' and drop the 'goto'.

Got it. I did like you suggested. It looks cleaner this way.

> 
> > +	mutex_unlock(&nvme_subsystems_lock);
> > +	return sctrl;
> > +}
> > +
> > +static void nvme_put_ctrl_ccr(struct nvme_ctrl *sctrl)
> > +{
> > +	atomic_inc(&sctrl->ccr_limit);
> > +	blk_put_queue(sctrl->admin_q);
> > +	nvme_put_ctrl(sctrl);
> > +}
> > +
> > +static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
> > +{
> > +	struct nvme_ccr_entry ccr = { };
> > +	union nvme_result res = { 0 };
> > +	struct nvme_command c = { };
> > +	unsigned long flags, tmo;
> > +	int ret = 0;
> > +	u32 result;
> > +
> > +	init_completion(&ccr.complete);
> > +	ccr.ictrl = ictrl;
> > +
> > +	spin_lock_irqsave(&sctrl->lock, flags);
> > +	list_add_tail(&ccr.list, &sctrl->ccr_list);
> > +	spin_unlock_irqrestore(&sctrl->lock, flags);
> > +
> > +	c.ccr.opcode = nvme_admin_cross_ctrl_reset;
> > +	c.ccr.ciu = ictrl->ciu;
> > +	c.ccr.icid = cpu_to_le16(ictrl->cntlid);
> > +	c.ccr.cirn = cpu_to_le64(ictrl->cirn);
> > +	ret = __nvme_submit_sync_cmd(sctrl->admin_q, &c, &res,
> > +				     NULL, 0, NVME_QID_ANY, 0);
> > +	if (ret)
> > +		goto out;
> > +
> > +	result = le32_to_cpu(res.u32);
> > +	if (result & 0x01) /* Immediate Reset Successful */
> > +		goto out;
> > +
> > +	tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
> > +	if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
> > +		ret = -ETIMEDOUT;
> > +		goto out;
> > +	}
> > +
> > +	if (ccr.ccrs != NVME_CCR_STATUS_SUCCESS)
> > +		ret = -EREMOTEIO;
> > +out:
> > +	spin_lock_irqsave(&sctrl->lock, flags);
> > +	list_del(&ccr.list);
> > +	spin_unlock_irqrestore(&sctrl->lock, flags);
> > +	return ret;
> > +}
> > +
> > +unsigned long nvme_fence_ctrl(struct nvme_ctrl *ictrl)
> > +{
> > +	unsigned long deadline, now, timeout;
> > +	struct nvme_ctrl *sctrl;
> > +	u32 min_cntlid = 0;
> > +	int ret;
> > +
> > +	timeout = nvme_fence_timeout_ms(ictrl);
> > +	dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> > +
> > +	now = jiffies;
> > +	deadline = now + msecs_to_jiffies(timeout);
> > +	while (time_before(now, deadline)) {
> > +		sctrl = nvme_find_ctrl_ccr(ictrl, min_cntlid);
> > +		if (!sctrl) {
> > +			/* CCR failed, switch to time-based recovery */
> > +			return deadline - now;
> > +		}
> > +
> > +		ret = nvme_issue_wait_ccr(sctrl, ictrl);
> > +		if (!ret) {
> > +			dev_info(ictrl->device, "CCR succeeded using %s\n",
> > +				 dev_name(sctrl->device));
> > +			nvme_put_ctrl_ccr(sctrl);
> > +			return 0;
> > +		}
> > +
> > +		/* CCR failed, try another path */
> > +		min_cntlid = sctrl->cntlid + 1;
> > +		nvme_put_ctrl_ccr(sctrl);
> > +		now = jiffies;
> > +	}
> 
> That will spin until 'deadline' is reached if 'nvme_issue_wait_ccr()' 
> returns an error. _And_ if the CCR itself runs into a timeout we would
> never have tried another path (which could have succeeded).

True. We can do one thing at a time in CCR time budget. Either wait for
CCR to succeed or give up early and try another path. It is a trade off.

> 
> I'd rather rework this loop to open-code 'issue_and_wait()' in the loop,
> and only switch to the next controller if the submission of CCR failed.
> Once that is done we can 'just' wait for completion, as a failure there
> will be after KATO timeout anyway and any subsequent CCR would be pointless.

If I understood this correctly then we will stick with the first sctrl
that accepts the CCR command. We wait for CCR to complete and give up on
fencing ictrl if CCR operation fails or times out. Did I get this correctly?

If I got it correctly, why is this better that current logic?

Currently nvme_issue_wait_ccr() waits max(cqt, kato) for CCR to
complete. If we change this logic, should it wait for "deadline - now"
for CCR to complete? Or keep it as it is?

The specs does not say how much time to wait for CCR operation to complete.
My impression is that max(cqt, kato) is a reasonable amount of time to
wait. If we do not hear from sctrl then we should switch to next path.

> 
> > +
> > +	dev_info(ictrl->device, "CCR reached timeout, call it done\n");
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(nvme_fence_ctrl);
> > +
> >   bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
> >   		enum nvme_ctrl_state new_state)
> >   {
> > @@ -5119,6 +5247,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
> >   
> >   	mutex_init(&ctrl->scan_lock);
> >   	INIT_LIST_HEAD(&ctrl->namespaces);
> > +	INIT_LIST_HEAD(&ctrl->ccr_list);
> >   	xa_init(&ctrl->cels);
> >   	ctrl->dev = dev;
> >   	ctrl->ops = ops;
> > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> > index 00866bbc66f3..fa18f580d76a 100644
> > --- a/drivers/nvme/host/nvme.h
> > +++ b/drivers/nvme/host/nvme.h
> > @@ -279,6 +279,13 @@ enum nvme_ctrl_flags {
> >   	NVME_CTRL_FROZEN		= 6,
> >   };
> >   
> > +struct nvme_ccr_entry {
> > +	struct list_head list;
> > +	struct completion complete;
> > +	struct nvme_ctrl *ictrl;
> > +	u8 ccrs;
> > +};
> > +
>  >   struct nvme_ctrl {>   	bool comp_seen;
> >   	bool identified;
> > @@ -296,6 +303,7 @@ struct nvme_ctrl {
> >   	struct blk_mq_tag_set *tagset;
> >   	struct blk_mq_tag_set *admin_tagset;
> >   	struct list_head namespaces;
> > +	struct list_head ccr_list;
> >   	struct mutex namespaces_lock;
> >   	struct srcu_struct srcu;
> >   	struct device ctrl_device;
> > @@ -814,6 +822,7 @@ blk_status_t nvme_host_path_error(struct request *req);
> >   bool nvme_cancel_request(struct request *req, void *data);
> >   void nvme_cancel_tagset(struct nvme_ctrl *ctrl);
> >   void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl);
> > +unsigned long nvme_fence_ctrl(struct nvme_ctrl *ctrl);
> >   bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
> >   		enum nvme_ctrl_state new_state);
> >   int nvme_disable_ctrl(struct nvme_ctrl *ctrl, bool shutdown);
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 09/14] nvme: Implement cross-controller reset completion
  2026-02-03  5:22   ` Hannes Reinecke
@ 2026-02-03 20:07     ` Mohamed Khalfella
  0 siblings, 0 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-03 20:07 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On Tue 2026-02-03 06:22:38 +0100, Hannes Reinecke wrote:
> On 1/30/26 23:34, Mohamed Khalfella wrote:
> > An nvme source controller that issues CCR command expects to receive an
> > NVME_AER_NOTICE_CCR_COMPLETED when pending CCR succeeds or fails. Add
> > sctrl->ccr_work to read NVME_LOG_CCR logpage and wakeup any thread
> > waiting on CCR completion.
> > 
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> >   drivers/nvme/host/core.c | 49 +++++++++++++++++++++++++++++++++++++++-
> >   drivers/nvme/host/nvme.h |  1 +
> >   2 files changed, 49 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 13e0775d56b4..0f90feb46369 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -1901,7 +1901,8 @@ EXPORT_SYMBOL_GPL(nvme_set_queue_count);
> >   
> >   #define NVME_AEN_SUPPORTED \
> >   	(NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT | \
> > -	 NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_DISC_CHANGE)
> > +	 NVME_AEN_CFG_ANA_CHANGE | NVME_AEN_CFG_CCR_COMPLETE | \
> > +	 NVME_AEN_CFG_DISC_CHANGE)
> >   
> >   static void nvme_enable_aen(struct nvme_ctrl *ctrl)
> >   {
> > @@ -4866,6 +4867,47 @@ static void nvme_get_fw_slot_info(struct nvme_ctrl *ctrl)
> >   	kfree(log);
> >   }
> >   
> > +static void nvme_ccr_work(struct work_struct *work)
> > +{
> > +	struct nvme_ctrl *ctrl = container_of(work, struct nvme_ctrl, ccr_work);
> > +	struct nvme_ccr_entry *ccr;
> > +	struct nvme_ccr_log_entry *entry;
> > +	struct nvme_ccr_log *log;
> > +	unsigned long flags;
> > +	int ret, i;
> > +
> > +	log = kmalloc(sizeof(*log), GFP_KERNEL);
> > +	if (!log)
> > +		return;
> > +
> > +	ret = nvme_get_log(ctrl, 0, NVME_LOG_CCR, 0x01,
> > +			   0x00, log, sizeof(*log), 0);
> > +	if (ret)
> > +		goto out;
> > +
> > +	spin_lock_irqsave(&ctrl->lock, flags);
> > +	for (i = 0; i < le16_to_cpu(log->ne); i++) {
> > +		entry = &log->entries[i];
> > +		if (entry->ccrs == NVME_CCR_STATUS_IN_PROGRESS)
> > +			continue;
> > +
> > +		list_for_each_entry(ccr, &ctrl->ccr_list, list) {
> > +			struct nvme_ctrl *ictrl = ccr->ictrl;
> > +
> > +			if (ictrl->cntlid != le16_to_cpu(entry->icid) ||
> > +			    ictrl->ciu != entry->ciu)
> > +				continue;
> > +
> > +			/* Complete matching entry */
> > +			ccr->ccrs = entry->ccrs;
> > +			complete(&ccr->complete);
> > +		}
> > +	}
> > +	spin_unlock_irqrestore(&ctrl->lock, flags);
> > +out:
> > +	kfree(log);
> > +}
> > +
> >   static void nvme_fw_act_work(struct work_struct *work)
> >   {
> >   	struct nvme_ctrl *ctrl = container_of(work,
> > @@ -4942,6 +4984,9 @@ static bool nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result)
> >   	case NVME_AER_NOTICE_DISC_CHANGED:
> >   		ctrl->aen_result = result;
> >   		break;
> > +	case NVME_AER_NOTICE_CCR_COMPLETED:
> > +		queue_work(nvme_wq, &ctrl->ccr_work);
> > +		break;
> >   	default:
> >   		dev_warn(ctrl->device, "async event result %08x\n", result);
> >   	}
> > @@ -5131,6 +5176,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
> >   	nvme_stop_failfast_work(ctrl);
> >   	flush_work(&ctrl->async_event_work);
> >   	cancel_work_sync(&ctrl->fw_act_work);
> > +	cancel_work_sync(&ctrl->ccr_work);
> >   	if (ctrl->ops->stop_ctrl)
> >   		ctrl->ops->stop_ctrl(ctrl);
> >   }
> > @@ -5254,6 +5300,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
> >   	ctrl->quirks = quirks;
> >   	ctrl->numa_node = NUMA_NO_NODE;
> >   	INIT_WORK(&ctrl->scan_work, nvme_scan_work);
> > +	INIT_WORK(&ctrl->ccr_work, nvme_ccr_work);
> >   	INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
> >   	INIT_WORK(&ctrl->fw_act_work, nvme_fw_act_work);
> >   	INIT_WORK(&ctrl->delete_work, nvme_delete_ctrl_work);
> > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> > index fa18f580d76a..a7f382e35821 100644
> > --- a/drivers/nvme/host/nvme.h
> > +++ b/drivers/nvme/host/nvme.h
> > @@ -366,6 +366,7 @@ struct nvme_ctrl {
> >   	struct nvme_effects_log *effects;
> >   	struct xarray cels;
> >   	struct work_struct scan_work;
> > +	struct work_struct ccr_work;
> >   	struct work_struct async_event_work;
> >   	struct delayed_work ka_work;
> >   	struct delayed_work failfast_work;
> 
> This confuses me. Why do we call 'complete()' but do not have a 
> corresponding 'wait_for_completion()' call?

I think because the two commits were developed separately.

> 
> Please merge with the next patch to allow reviewers to have the full
> picture.

I thought it is easier to review this way because this change includes
ctrl->ccr_work. If you feel strongly about it I will merge it into the
previous patch "[PATCH v2 08/14] nvme: Implement cross-controller reset
recovery"?

> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 10/14] nvme-tcp: Use CCR to recover controller that hits an error
  2026-02-03  5:34   ` Hannes Reinecke
@ 2026-02-03 21:24     ` Mohamed Khalfella
  2026-02-04  0:48       ` Randy Jennings
  2026-02-04  2:57       ` Hannes Reinecke
  0 siblings, 2 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-03 21:24 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On Tue 2026-02-03 06:34:51 +0100, Hannes Reinecke wrote:
> On 1/30/26 23:34, Mohamed Khalfella wrote:
> > An alive nvme controller that hits an error now will move to FENCING
> > state instead of RESETTING state. ctrl->fencing_work attempts CCR to
> > terminate inflight IOs. If CCR succeeds, switch to FENCED -> RESETTING
> > and continue error recovery as usual. If CCR fails, the behavior depends
> > on whether the subsystem supports CQT or not. If CQT is not supported
> > then reset the controller immediately as if CCR succeeded in order to
> > maintain the current behavior. If CQT is supported switch to time-based
> > recovery. Schedule ctrl->fenced_work resets the controller when time
> > based recovery finishes.
> > 
> > Either ctrl->err_work or ctrl->reset_work can run after a controller is
> > fenced. Flush fencing work when either work run.
> > 
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> >   drivers/nvme/host/tcp.c | 62 ++++++++++++++++++++++++++++++++++++++++-
> >   1 file changed, 61 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index 69cb04406b47..af8d3b36a4bb 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -193,6 +193,8 @@ struct nvme_tcp_ctrl {
> >   	struct sockaddr_storage src_addr;
> >   	struct nvme_ctrl	ctrl;
> >   
> > +	struct work_struct	fencing_work;
> > +	struct delayed_work	fenced_work;
> >   	struct work_struct	err_work;
> >   	struct delayed_work	connect_work;
> >   	struct nvme_tcp_request async_req;
> > @@ -611,6 +613,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
> >   
> >   static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
> >   {
> > +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCING)) {
> > +		dev_warn(ctrl->device, "starting controller fencing\n");
> > +		queue_work(nvme_wq, &to_tcp_ctrl(ctrl)->fencing_work);
> > +		return;
> > +	}
> > +
> 
> Don't you need to flush any outstanding 'fenced_work' queue items here
> before calling 'queue_work()'?

I do not think we need to flush ctr->fencing_work. It can not be running
at this time.

> 
> >   	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> >   		return;
> >   
> > @@ -2470,12 +2478,59 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
> >   	nvme_tcp_reconnect_or_remove(ctrl, ret);
> >   }
> >   
> > +static void nvme_tcp_fenced_work(struct work_struct *work)
> > +{
> > +	struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
> > +					struct nvme_tcp_ctrl, fenced_work);
> > +	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> > +
> > +	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> > +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> > +		queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
> > +}
> > +
> > +static void nvme_tcp_fencing_work(struct work_struct *work)
> > +{
> > +	struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> > +			struct nvme_tcp_ctrl, fencing_work);
> > +	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> > +	unsigned long rem;
> > +
> > +	rem = nvme_fence_ctrl(ctrl);
> > +	if (!rem)
> > +		goto done;
> > +
> > +	if (!ctrl->cqt) {
> > +		dev_info(ctrl->device,
> > +			 "CCR failed, CQT not supported, skip time-based recovery\n");
> > +		goto done;
> > +	}
> > +
> 
> As mentioned, cqt handling should be part of another patchset.

Let us suppose we drop cqt from this patchset

- How will we be able to calculate CCR time budget?
  Currently it is calculated by nvme_fence_timeout_ms()

- What should we do if CCR fails? Retry requests immediately?

> > +	dev_info(ctrl->device,
> > +		 "CCR failed, switch to time-based recovery, timeout = %ums\n",
> > +		 jiffies_to_msecs(rem));
> > +	queue_delayed_work(nvme_wq, &tcp_ctrl->fenced_work, rem);
> > +	return;
> > +
> 
> Why do you need the 'fenced' workqueue at all? All it does is queing yet 
> another workqueue item, which certainly can be done from the 'fencing' 
> workqueue directly, no?

It is possible to drop ctr->fenced_work and requeue ctrl->fencing_work
as delayed work to implement request hold time. If we do that then we
need to modify nvme_tcp_fencing_work() to tell if it is being called for
'fencing' or 'fenced'. The first version of this patch used a controller
flag RECOVERED for that and it has been suggested to use a separate work
to simplify the logic and drop the controller flag.

> 
> > +done:
> > +	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> > +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> > +		queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
> > +}
> > +
> > +static void nvme_tcp_flush_fencing_work(struct nvme_ctrl *ctrl)
> > +{
> > +	flush_work(&to_tcp_ctrl(ctrl)->fencing_work);
> > +	flush_delayed_work(&to_tcp_ctrl(ctrl)->fenced_work);
> > +}
> > +
> >   static void nvme_tcp_error_recovery_work(struct work_struct *work)
> >   {
> >   	struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> >   				struct nvme_tcp_ctrl, err_work);
> >   	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> >   
> > +	nvme_tcp_flush_fencing_work(ctrl);
> 
> Why not 'fenced_work' ?

You mean rename nvme_tcp_flush_fencing_work() to
nvme_tcp_flush_fenced_work()?

If yes, then I can do that if you think it makes more sense.

> 
> >   	if (nvme_tcp_key_revoke_needed(ctrl))
> >   		nvme_auth_revoke_tls_key(ctrl);
> >   	nvme_stop_keep_alive(ctrl);
> > @@ -2518,6 +2573,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
> >   		container_of(work, struct nvme_ctrl, reset_work);
> >   	int ret;
> >   
> > +	nvme_tcp_flush_fencing_work(ctrl);
> 
> Same.
> 
> >   	if (nvme_tcp_key_revoke_needed(ctrl))
> >   		nvme_auth_revoke_tls_key(ctrl);
> >   	nvme_stop_ctrl(ctrl);
> > @@ -2643,13 +2699,15 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
> >   	struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
> >   	struct nvme_command *cmd = &pdu->cmd;
> >   	int qid = nvme_tcp_queue_id(req->queue);
> > +	enum nvme_ctrl_state state;
> >   
> >   	dev_warn(ctrl->device,
> >   		 "I/O tag %d (%04x) type %d opcode %#x (%s) QID %d timeout\n",
> >   		 rq->tag, nvme_cid(rq), pdu->hdr.type, cmd->common.opcode,
> >   		 nvme_fabrics_opcode_str(qid, cmd), qid);
> >   
> > -	if (nvme_ctrl_state(ctrl) != NVME_CTRL_LIVE) {
> > +	state = nvme_ctrl_state(ctrl);
> > +	if (state != NVME_CTRL_LIVE && state != NVME_CTRL_FENCING) {
> 
> 'FENCED' too, presumably?

I do not think it makes a difference here. FENCED and RESETTING are
almost the same states.

> 
> >   		/*
> >   		 * If we are resetting, connecting or deleting we should
> >   		 * complete immediately because we may block controller
> > @@ -2904,6 +2962,8 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
> >   
> >   	INIT_DELAYED_WORK(&ctrl->connect_work,
> >   			nvme_tcp_reconnect_ctrl_work);
> > +	INIT_DELAYED_WORK(&ctrl->fenced_work, nvme_tcp_fenced_work);
> > +	INIT_WORK(&ctrl->fencing_work, nvme_tcp_fencing_work);
> >   	INIT_WORK(&ctrl->err_work, nvme_tcp_error_recovery_work);
> >   	INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
> >   
> 
> Here you are calling CCR whenever error recovery is triggered.
> This will cause CCR to be send from a command timeout, which is
> technically wrong (CCR should be send when the KATO timeout expires,
> not when a command timout expires). Both could be vastly different.

KATO is driven by the host. What does KTO expires mean?
I think KATO expiry is more applicable to target, no?

KATO timeout is a signal of an error that target is not reachable or
something is wrong with the target?

> 
> So I'd prefer to have CCR send whenever KATO timeout triggers, and
> lease to current command timeout mechanism in place.

Assuming we used CCR only when KATO request times out.
What should we do when we hit other errors?

should nvme_tcp_error_recovery() is called from many places to handle
errors and it effectively resets the controller. What should this
function do if not trigger CCR?

> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset
  2026-02-03  5:40   ` Hannes Reinecke
@ 2026-02-03 21:29     ` Mohamed Khalfella
  0 siblings, 0 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-03 21:29 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On Tue 2026-02-03 06:40:28 +0100, Hannes Reinecke wrote:
> On 1/30/26 23:34, Mohamed Khalfella wrote:
> > nvme_fc_error_recovery() called from nvme_fc_timeout() while controller
> > in CONNECTING state results in deadlock reported in link below. Update
> > nvme_fc_timeout() to schedule error recovery to avoid the deadlock.
> > 
> > Previous to this change if controller was LIVE error recovery resets
> > the controller and this does not match nvme-tcp and nvme-rdma. Decouple
> > error recovery from controller reset to match other fabric transports.
> > 
> > Link: https://lore.kernel.org/all/20250529214928.2112990-1-mkhalfella@purestorage.com/
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> >   drivers/nvme/host/fc.c | 94 ++++++++++++++++++------------------------
> >   1 file changed, 41 insertions(+), 53 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> > index 6948de3f438a..f8f6071b78ed 100644
> > --- a/drivers/nvme/host/fc.c
> > +++ b/drivers/nvme/host/fc.c
> > @@ -227,6 +227,8 @@ static DEFINE_IDA(nvme_fc_ctrl_cnt);
> >   static struct device *fc_udev_device;
> >   
> >   static void nvme_fc_complete_rq(struct request *rq);
> > +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> > +					 char *errmsg);
> >   
> >   /* *********************** FC-NVME Port Management ************************ */
> >   
> > @@ -788,7 +790,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
> >   		"Reconnect", ctrl->cnum);
> >   
> >   	set_bit(ASSOC_FAILED, &ctrl->flags);
> > -	nvme_reset_ctrl(&ctrl->ctrl);
> > +	nvme_fc_start_ioerr_recovery(ctrl, "Connectivity Loss");
> >   }
> >   
> >   /**
> > @@ -985,7 +987,7 @@ fc_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> >   static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
> >   static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
> >   
> > -static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg);
> > +static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl);
> >   
> >   static void
> >   __nvme_fc_finish_ls_req(struct nvmefc_ls_req_op *lsop)
> > @@ -1567,9 +1569,8 @@ nvme_fc_ls_disconnect_assoc(struct nvmefc_ls_rcv_op *lsop)
> >   	 * for the association have been ABTS'd by
> >   	 * nvme_fc_delete_association().
> >   	 */
> > -
> > -	/* fail the association */
> > -	nvme_fc_error_recovery(ctrl, "Disconnect Association LS received");
> > +	nvme_fc_start_ioerr_recovery(ctrl,
> > +				     "Disconnect Association LS received");
> >   
> >   	/* release the reference taken by nvme_fc_match_disconn_ls() */
> >   	nvme_fc_ctrl_put(ctrl);
> > @@ -1871,7 +1872,7 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
> >   	struct nvme_fc_ctrl *ctrl =
> >   			container_of(work, struct nvme_fc_ctrl, ioerr_work);
> >   
> > -	nvme_fc_error_recovery(ctrl, "transport detected io error");
> > +	nvme_fc_error_recovery(ctrl);
> >   }
> >   
> >   /*
> > @@ -1892,6 +1893,17 @@ char *nvme_fc_io_getuuid(struct nvmefc_fcp_req *req)
> >   }
> >   EXPORT_SYMBOL_GPL(nvme_fc_io_getuuid);
> >   
> > +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> > +					 char *errmsg)
> > +{
> > +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
> > +		return;
> > +
> > +	dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error recovery %s\n",
> > +		 ctrl->cnum, errmsg);
> > +	queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> > +}
> > +
> >   static void
> >   nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
> >   {
> > @@ -2049,9 +2061,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
> >   		nvme_fc_complete_rq(rq);
> >   
> >   check_error:
> > -	if (terminate_assoc &&
> > -	    nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_RESETTING)
> > -		queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> > +	if (terminate_assoc)
> > +		nvme_fc_start_ioerr_recovery(ctrl, "io error");
> >   }
> >   
> >   static int
> > @@ -2495,39 +2506,6 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
> >   		nvme_unquiesce_admin_queue(&ctrl->ctrl);
> >   }
> >   
> > -static void
> > -nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
> > -{
> > -	enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
> > -
> > -	/*
> > -	 * if an error (io timeout, etc) while (re)connecting, the remote
> > -	 * port requested terminating of the association (disconnect_ls)
> > -	 * or an error (timeout or abort) occurred on an io while creating
> > -	 * the controller.  Abort any ios on the association and let the
> > -	 * create_association error path resolve things.
> > -	 */
> > -	if (state == NVME_CTRL_CONNECTING) {
> > -		__nvme_fc_abort_outstanding_ios(ctrl, true);
> > -		dev_warn(ctrl->ctrl.device,
> > -			"NVME-FC{%d}: transport error during (re)connect\n",
> > -			ctrl->cnum);
> > -		return;
> > -	}
> > -
> > -	/* Otherwise, only proceed if in LIVE state - e.g. on first error */
> > -	if (state != NVME_CTRL_LIVE)
> > -		return;
> > -
> > -	dev_warn(ctrl->ctrl.device,
> > -		"NVME-FC{%d}: transport association event: %s\n",
> > -		ctrl->cnum, errmsg);
> > -	dev_warn(ctrl->ctrl.device,
> > -		"NVME-FC{%d}: resetting controller\n", ctrl->cnum);
> > -
> > -	nvme_reset_ctrl(&ctrl->ctrl);
> > -}
> > -
> >   static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
> >   {
> >   	struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(rq);
> > @@ -2536,24 +2514,14 @@ static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
> >   	struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
> >   	struct nvme_command *sqe = &cmdiu->sqe;
> >   
> > -	/*
> > -	 * Attempt to abort the offending command. Command completion
> > -	 * will detect the aborted io and will fail the connection.
> > -	 */
> >   	dev_info(ctrl->ctrl.device,
> >   		"NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
> >   		"x%08x/x%08x\n",
> >   		ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
> >   		nvme_fabrics_opcode_str(qnum, sqe),
> >   		sqe->common.cdw10, sqe->common.cdw11);
> > -	if (__nvme_fc_abort_op(ctrl, op))
> > -		nvme_fc_error_recovery(ctrl, "io timeout abort failed");
> >   
> > -	/*
> > -	 * the io abort has been initiated. Have the reset timer
> > -	 * restarted and the abort completion will complete the io
> > -	 * shortly. Avoids a synchronous wait while the abort finishes.
> > -	 */
> > +	nvme_fc_start_ioerr_recovery(ctrl, "io timeout");
> >   	return BLK_EH_RESET_TIMER;
> >   }
> >   
> > @@ -3352,6 +3320,26 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
> >   	}
> >   }
> >   
> > +static void
> > +nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
> > +{
> > +	nvme_stop_keep_alive(&ctrl->ctrl);
> > +	nvme_stop_ctrl(&ctrl->ctrl);
> > +
> > +	/* will block while waiting for io to terminate */
> > +	nvme_fc_delete_association(ctrl);
> > +
> > +	/* Do not reconnect if controller is being deleted */
> > +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
> > +		return;
> > +
> > +	if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
> > +		queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
> > +		return;
> > +	}
> > +
> > +	nvme_fc_reconnect_or_delete(ctrl, -ENOTCONN);
> > +}
> >   
> >   static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
> >   	.name			= "fc",
> 
> I really don't get it. Why do you need to do additional steps here, when
> all you do is split an existing function in half?
> 

Can you help me and point out to the part that is additional?

Is it nvme_stop_keep_alive()? If yes, then it matches other transports.
I am okay with removing it and we assume that the work will stop when it
hits an error, like what it does today.

> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset
  2026-02-03 19:19   ` James Smart
@ 2026-02-03 22:49     ` James Smart
  2026-02-04  0:15       ` Mohamed Khalfella
  2026-02-04  0:11     ` Mohamed Khalfella
  1 sibling, 1 reply; 82+ messages in thread
From: James Smart @ 2026-02-03 22:49 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, jsmart833426

On 2/3/2026 11:19 AM, James Smart wrote:
> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
...
>>   static void
>>   nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
>>   {
>> @@ -2049,9 +2061,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
>>           nvme_fc_complete_rq(rq);
>>   check_error:
>> -    if (terminate_assoc &&
>> -        nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_RESETTING)
>> -        queue_work(nvme_reset_wq, &ctrl->ioerr_work);
>> +    if (terminate_assoc)
>> +        nvme_fc_start_ioerr_recovery(ctrl, "io error");
> 
> this is ok. the ioerr_recovery will bounce the RESETTING state if it's 
> already in the state. So this is a little cleaner.a

What is problematic here is - if the start_ioerr path includes the 
CONNECTING logic that terminates i/o's, it's running in the LLDD's 
context that called this iodone routine. Not good. In existing code, the 
LLDD context was swapped to the work queue where error_recovery was called.

> 
>>   }
>>   static int
>> @@ -2495,39 +2506,6 @@ __nvme_fc_abort_outstanding_ios(struct 
>> nvme_fc_ctrl *ctrl, bool start_queues)
>>           nvme_unquiesce_admin_queue(&ctrl->ctrl);
>>   }
>> -static void
>> -nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
>> -{
>> -    enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
>> -
>> -    /*
>> -     * if an error (io timeout, etc) while (re)connecting, the remote
>> -     * port requested terminating of the association (disconnect_ls)
>> -     * or an error (timeout or abort) occurred on an io while creating
>> -     * the controller.  Abort any ios on the association and let the
>> -     * create_association error path resolve things.
>> -     */
>> -    if (state == NVME_CTRL_CONNECTING) {
>> -        __nvme_fc_abort_outstanding_ios(ctrl, true);
>> -        dev_warn(ctrl->ctrl.device,
>> -            "NVME-FC{%d}: transport error during (re)connect\n",
>> -            ctrl->cnum);
>> -        return;
>> -    }
> 
> This logic needs to be preserved. Its no longer part of 
> nvme_fc_start_ioerr_recovery(). Failures during CONNECTING should not be 
> "fenced". They should fail immediately.

this logic, if left in start_ioerr_recovery


-- james


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset
  2026-02-03 19:19   ` James Smart
  2026-02-03 22:49     ` James Smart
@ 2026-02-04  0:11     ` Mohamed Khalfella
  2026-02-05  0:08       ` James Smart
  1 sibling, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-04  0:11 UTC (permalink / raw)
  To: James Smart
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

On Tue 2026-02-03 11:19:28 -0800, James Smart wrote:
> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> > nvme_fc_error_recovery() called from nvme_fc_timeout() while controller
> > in CONNECTING state results in deadlock reported in link below. Update
> > nvme_fc_timeout() to schedule error recovery to avoid the deadlock.
> > 
> > Previous to this change if controller was LIVE error recovery resets
> > the controller and this does not match nvme-tcp and nvme-rdma.
> 
> It is not intended to match tcp/rda. Using the reset path was done to 
> avoid code duplication of paths to teardown the association.  FC, given 
> we interact with an HBA for device and io state and have a lot of async 
> io completions, requires a lot more work than straight data structure 
> teardown in rdma/tcp.
> 
> I agree with wanting to changeup the execution thread for the deadlock.
> 
> 
> > Decouple> error recovery from controller reset to match other fabric transports.>
> > Link: https://lore.kernel.org/all/20250529214928.2112990-1-mkhalfella@purestorage.com/
> > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> > ---
> >   drivers/nvme/host/fc.c | 94 ++++++++++++++++++------------------------
> >   1 file changed, 41 insertions(+), 53 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> > index 6948de3f438a..f8f6071b78ed 100644
> > --- a/drivers/nvme/host/fc.c
> > +++ b/drivers/nvme/host/fc.c
> > @@ -227,6 +227,8 @@ static DEFINE_IDA(nvme_fc_ctrl_cnt);
> >   static struct device *fc_udev_device;
> >   
> >   static void nvme_fc_complete_rq(struct request *rq);
> > +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> > +					 char *errmsg);
> >   
> >   /* *********************** FC-NVME Port Management ************************ */
> >   
> > @@ -788,7 +790,7 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
> >   		"Reconnect", ctrl->cnum);
> >   
> >   	set_bit(ASSOC_FAILED, &ctrl->flags);
> > -	nvme_reset_ctrl(&ctrl->ctrl);
> > +	nvme_fc_start_ioerr_recovery(ctrl, "Connectivity Loss");
> >   }
> >   
> >   /**
> > @@ -985,7 +987,7 @@ fc_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
> >   static void nvme_fc_ctrl_put(struct nvme_fc_ctrl *);
> >   static int nvme_fc_ctrl_get(struct nvme_fc_ctrl *);
> >   
> > -static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg);
> > +static void nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl);
> >   
> >   static void
> >   __nvme_fc_finish_ls_req(struct nvmefc_ls_req_op *lsop)
> > @@ -1567,9 +1569,8 @@ nvme_fc_ls_disconnect_assoc(struct nvmefc_ls_rcv_op *lsop)
> >   	 * for the association have been ABTS'd by
> >   	 * nvme_fc_delete_association().
> >   	 */
> > -
> > -	/* fail the association */
> > -	nvme_fc_error_recovery(ctrl, "Disconnect Association LS received");
> > +	nvme_fc_start_ioerr_recovery(ctrl,
> > +				     "Disconnect Association LS received");
> >   
> >   	/* release the reference taken by nvme_fc_match_disconn_ls() */
> >   	nvme_fc_ctrl_put(ctrl);
> > @@ -1871,7 +1872,7 @@ nvme_fc_ctrl_ioerr_work(struct work_struct *work)
> >   	struct nvme_fc_ctrl *ctrl =
> >   			container_of(work, struct nvme_fc_ctrl, ioerr_work);
> >   
> > -	nvme_fc_error_recovery(ctrl, "transport detected io error");
> > +	nvme_fc_error_recovery(ctrl);
> 
> hmm.. not sure how I feel about this. There is at least a break in reset 
> processing that is no longer present - e.g. prior queued ioerr_work, 
> which would then queue reset_work. This effectively calls the reset_work 
> handler directly. I assume it should be ok.
> 
> >   }
> >   
> >   /*
> > @@ -1892,6 +1893,17 @@ char *nvme_fc_io_getuuid(struct nvmefc_fcp_req *req)
> >   }
> >   EXPORT_SYMBOL_GPL(nvme_fc_io_getuuid);
> >   
> > +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> > +					 char *errmsg)
> > +{
> > +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
> > +		return;
>  > +> +	dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error 
> recovery %s\n",
> > +		 ctrl->cnum, errmsg);
> > +	queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> > +}
> > +
> 
> Disagree with this.
> 
> The clause in error_recovery around the CONNECTING state is pretty 
> important to terminate io occurring during connect/reconnect where the 
> ctrl state should not change. we don't want start_ioerr making it RESETTING.
> 
> This should be reworked.

Like you pointed out this changes the current behavior for CONNECTING
state.

Before this change, as you pointed out the controller state stays in
CONNECTING while all IOs are aborted. Aborting the IOs causes
nvme_fc_create_association() to fail and reconnect might be attempted
again.

The new behavior switches to RESETTING and queues ctr->ioerr_work.
ioerr_work will abort oustanding IOs, swich back to CONNECING and
attempt reconnect.

nvme_fc_error_recovery() ->
  nvme_stop_keep_alive() /* should not make a difference */
  nvme_stop_ctrl()       /* should be okay to run */
  nvme_fc_delete_association() ->
    __nvme_fc_abort_outstanding_ios(ctrl, false)
    nvme_unquiesce_admin_queue()
    nvme_unquiesce_io_queues()
    nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)
    if (port_state == ONLINE)
      queue_work(ctrl->connect)
    else
      nvme_fc_reconnect_or_delete();

Yes, this is a different behavior. IMO it is simpler to follow and
closer to what other transports do, keeping in mind async abort nature
of fc.

Aside from it is different, what is wrong with it?

> 
> >   static void
> >   nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
> >   {
> > @@ -2049,9 +2061,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
> >   		nvme_fc_complete_rq(rq);
> >   
> >   check_error:
> > -	if (terminate_assoc &&
> > -	    nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_RESETTING)
> > -		queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> > +	if (terminate_assoc)
> > +		nvme_fc_start_ioerr_recovery(ctrl, "io error");
> 
> this is ok. the ioerr_recovery will bounce the RESETTING state if it's 
> already in the state. So this is a little cleaner.
> 
> >   }
> >   
> >   static int
> > @@ -2495,39 +2506,6 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
> >   		nvme_unquiesce_admin_queue(&ctrl->ctrl);
> >   }
> >   
> > -static void
> > -nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
> > -{
> > -	enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
> > -
> > -	/*
> > -	 * if an error (io timeout, etc) while (re)connecting, the remote
> > -	 * port requested terminating of the association (disconnect_ls)
> > -	 * or an error (timeout or abort) occurred on an io while creating
> > -	 * the controller.  Abort any ios on the association and let the
> > -	 * create_association error path resolve things.
> > -	 */
> > -	if (state == NVME_CTRL_CONNECTING) {
> > -		__nvme_fc_abort_outstanding_ios(ctrl, true);
> > -		dev_warn(ctrl->ctrl.device,
> > -			"NVME-FC{%d}: transport error during (re)connect\n",
> > -			ctrl->cnum);
> > -		return;
> > -	}
> 
> This logic needs to be preserved. Its no longer part of 
> nvme_fc_start_ioerr_recovery(). Failures during CONNECTING should not be 
> "fenced". They should fail immediately.

I think this is similar to the point above.

> 
> > -
> > -	/* Otherwise, only proceed if in LIVE state - e.g. on first error */
> > -	if (state != NVME_CTRL_LIVE)
> > -		return;
> 
> This was to filter out multiple requests of the reset. I guess that is 
> what happens now in start_ioerr when attempting to set state to 
> RESETTING and already RESETTING.

Yes. In this case nvme_fc_start_ioerr_recovery() will do nothing.

> 
> There is a small difference here in that The existing code avoids doing 
> the ctrl reset if the controller is NEW. start_ioerr will change the 
> ctrl to RESETTING. I'm not sure how much of an impact that is.
> 

I think there is little done while controller in NEW state.
Let me know if I am missing something.

> 
> > -
> > -	dev_warn(ctrl->ctrl.device,
> > -		"NVME-FC{%d}: transport association event: %s\n",
> > -		ctrl->cnum, errmsg);
> > -	dev_warn(ctrl->ctrl.device,
> > -		"NVME-FC{%d}: resetting controller\n", ctrl->cnum);
> 
> I haven't paid much attention, but keeping the transport messages for 
> these cases is very very useful for diagnosis.
> 
> > -
> > -	nvme_reset_ctrl(&ctrl->ctrl);
> > -}
> > -
> >   static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
> >   {
> >   	struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(rq);
> > @@ -2536,24 +2514,14 @@ static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
> >   	struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
> >   	struct nvme_command *sqe = &cmdiu->sqe;
> >   
> > -	/*
> > -	 * Attempt to abort the offending command. Command completion
> > -	 * will detect the aborted io and will fail the connection.
> > -	 */
> >   	dev_info(ctrl->ctrl.device,
> >   		"NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
> >   		"x%08x/x%08x\n",
> >   		ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
> >   		nvme_fabrics_opcode_str(qnum, sqe),
> >   		sqe->common.cdw10, sqe->common.cdw11);
> > -	if (__nvme_fc_abort_op(ctrl, op))
> > -		nvme_fc_error_recovery(ctrl, "io timeout abort failed");
> >   
> > -	/*
> > -	 * the io abort has been initiated. Have the reset timer
> > -	 * restarted and the abort completion will complete the io
> > -	 * shortly. Avoids a synchronous wait while the abort finishes.
> > -	 */
> > +	nvme_fc_start_ioerr_recovery(ctrl, "io timeout");
> 
> Why get rid of the abort logic ?
> Note: the error recovery/controller reset is only called when the abort 
> failed.
> 
> I believe you should continue to abort the op.  The fence logic will 
> kick in when the op completes later (along with other io completions). 
> If nothing else, it allows a hw resource to be freed up.

The abort logic from nvme_fc_timeout() is problematic and it does not
play well with abort initiatored from ioerr_work or reset_work. The
problem is that op aborted from nvme_fc_timeout() is not accounted for
when the controller is reset.

Here is an example scenario.

The first time a request times out it gets aborted we see this codepath

nvme_fc_timeout() ->
  __nvme_fc_abort_op() ->
    atomic_xchg(&op->state, FCPOP_STATE_ABORTED)
      ops->abort()
        return 0;

nvme_fc_timeout() always return BLK_EH_RESET_TIMER so the same request
can timeout again. If the same request hits timeout again then
__nvme_fc_abort_op() returns -ECANCELED and nvme_fc_error_recovery()
gets called. Assuming the controller is LIVE it will be reset.

nvme_fc_reset_ctrl_work() ->
  nvme_fc_delete_association() ->
    __nvme_fc_abort_outstanding_ios() ->
      nvme_fc_terminate_exchange() ->
        __nvme_fc_abort_op()

__nvme_fc_abort_op() finds that op already aborted. As a result of that
ctrl->iocnt will not be incrmented for this op. This means that
nvme_fc_delete_association() will not wait for this op to be aborted.

I do not think we wait this behavior.

To continue the scenario above. The controller switches to CONNECTING
and the request times out again. This time we hit the deadlock described
in [1].

I think the first abort is the cause of the issue here. with this change
we should not hit the scenario described above.

1 - https://lore.kernel.org/all/20250529214928.2112990-1-mkhalfella@purestorage.com/

> 
> 
> >   	return BLK_EH_RESET_TIMER;
> >   }
> >   
> > @@ -3352,6 +3320,26 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
> >   	}
> >   }
> >   
> > +static void
> > +nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
> > +{
> > +	nvme_stop_keep_alive(&ctrl->ctrl);
> 
> Curious, why did the stop_keep_alive() call get added to this ?
> Doesn't hurt.
> 
> I assume it was due to other transports having it as they originally 
> were calling stop_ctrl, but then moved to stop_keep_alive. Shouldn't 
> this be followed by flush_work((&ctrl->ctrl.async_event_work) ?

Yes. I added it because it matches what other transports do.

nvme_fc_error_recovery() ->
  nvme_fc_delete_association() ->
    nvme_fc_abort_aen_ops() ->
      nvme_fc_term_aen_ops() ->
        cancel_work_sync(&ctrl->ctrl.async_event_work);

The above codepath takes care of async_event_work.

> 
> > +	nvme_stop_ctrl(&ctrl->ctrl);
> > +
> > +	/* will block while waiting for io to terminate */
> > +	nvme_fc_delete_association(ctrl);
> > +
> > +	/* Do not reconnect if controller is being deleted */
> > +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
> > +		return;
> > +
> > +	if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
> > +		queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
> > +		return;
> > +	}
> > +
> > +	nvme_fc_reconnect_or_delete(ctrl, -ENOTCONN);
> > +}
> 
> This code and that in nvme_fc_reset_ctrl_work() need to be collapsed 
> into a common helper function invoked by the 2 routines.  Also addresses 
> the missing flush_delayed work in this routine.
> 

Agree, nvme_fc_error_recovery() and nvme_fc_reset_ctrl_work() have
common code that can be refactored. However, I do not plan to do this
part of this change. I will take a look after I get CCR work done.

> >   
> >   static const struct nvme_ctrl_ops nvme_fc_ctrl_ops = {
> >   	.name			= "fc",
> 
> 
> -- James
> 
> (new email address. can always reach me at james.smart@broadcom.com as well)
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset
  2026-02-03 22:49     ` James Smart
@ 2026-02-04  0:15       ` Mohamed Khalfella
  0 siblings, 0 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-04  0:15 UTC (permalink / raw)
  To: James Smart
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

On Tue 2026-02-03 14:49:01 -0800, James Smart wrote:
> On 2/3/2026 11:19 AM, James Smart wrote:
> > On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> ...
> >>   static void
> >>   nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
> >>   {
> >> @@ -2049,9 +2061,8 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
> >>           nvme_fc_complete_rq(rq);
> >>   check_error:
> >> -    if (terminate_assoc &&
> >> -        nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_RESETTING)
> >> -        queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> >> +    if (terminate_assoc)
> >> +        nvme_fc_start_ioerr_recovery(ctrl, "io error");
> > 
> > this is ok. the ioerr_recovery will bounce the RESETTING state if it's 
> > already in the state. So this is a little cleaner.a
> 
> What is problematic here is - if the start_ioerr path includes the 
> CONNECTING logic that terminates i/o's, it's running in the LLDD's 
> context that called this iodone routine. Not good. In existing code, the 
> LLDD context was swapped to the work queue where error_recovery was called.

nvme_fc_start_ioerr_recovery() does not do the work in LLDD context. It
queues ctrl->ioerr_work. This is similar to existing code. I responed to
the issue with CONNECING state in another email.

> 
> > 
> >>   }
> >>   static int
> >> @@ -2495,39 +2506,6 @@ __nvme_fc_abort_outstanding_ios(struct 
> >> nvme_fc_ctrl *ctrl, bool start_queues)
> >>           nvme_unquiesce_admin_queue(&ctrl->ctrl);
> >>   }
> >> -static void
> >> -nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
> >> -{
> >> -    enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
> >> -
> >> -    /*
> >> -     * if an error (io timeout, etc) while (re)connecting, the remote
> >> -     * port requested terminating of the association (disconnect_ls)
> >> -     * or an error (timeout or abort) occurred on an io while creating
> >> -     * the controller.  Abort any ios on the association and let the
> >> -     * create_association error path resolve things.
> >> -     */
> >> -    if (state == NVME_CTRL_CONNECTING) {
> >> -        __nvme_fc_abort_outstanding_ios(ctrl, true);
> >> -        dev_warn(ctrl->ctrl.device,
> >> -            "NVME-FC{%d}: transport error during (re)connect\n",
> >> -            ctrl->cnum);
> >> -        return;
> >> -    }
> > 
> > This logic needs to be preserved. Its no longer part of 
> > nvme_fc_start_ioerr_recovery(). Failures during CONNECTING should not be 
> > "fenced". They should fail immediately.
> 
> this logic, if left in start_ioerr_recovery

I think it should be okay to rely on error recovery to handle this
situation.

> 
> 
> -- james


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields
  2026-02-03 18:14     ` Mohamed Khalfella
@ 2026-02-04  0:34       ` Hannes Reinecke
  2026-02-07 13:41         ` Sagi Grimberg
  0 siblings, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-04  0:34 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 2/3/26 19:14, Mohamed Khalfella wrote:
> On Tue 2026-02-03 04:03:22 +0100, Hannes Reinecke wrote:
>> On 1/30/26 23:34, Mohamed Khalfella wrote:
>>> TP8028 Rapid Path Failure Recovery defined new fields in controller
>>> identify response. The newly defined fields are:
>>>
>>> - CIU (Controller Instance UNIQUIFIER): is an 8bit non-zero value that
>>> is assigned a random value when controller first created. The value is
>>> expected to be incremented when RDY bit in CSTS register is asserted
>>> - CIRN (Controller Instance Random Number): is 64bit random value that
>>> gets generated when controller is crated. CIRN is regenerated everytime
>>> RDY bit is CSTS register is asserted.
>>> - CCRL (Cross-Controller Reset Limit) is an 8bit value that defines the
>>> maximum number of in-progress controller reset operations. CCRL is
>>> hardcoded to 4 as recommended by TP8028.
>>>
>>> TP4129 KATO Corrections and Clarifications defined CQT (Command Quiesce
>>> Time) which is used along with KATO (Keep Alive Timeout) to set an upper
>>> time limit for attempting Cross-Controller Recovery. For NVME subsystem
>>> CQT is set to 0 by default to keep the current behavior. The value can
>>> be set from configfs if needed.
>>>
>>> Make the new fields available for IO controllers only since TP8028 is
>>> not very useful for discovery controllers.
>>>
>>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
>>> ---
>>>    drivers/nvme/target/admin-cmd.c |  6 ++++++
>>>    drivers/nvme/target/configfs.c  | 31 +++++++++++++++++++++++++++++++
>>>    drivers/nvme/target/core.c      | 12 ++++++++++++
>>>    drivers/nvme/target/nvmet.h     |  4 ++++
>>>    include/linux/nvme.h            | 15 ++++++++++++---
>>>    5 files changed, 65 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
>>> index 3da31bb1183e..ade1145df72d 100644
>>> --- a/drivers/nvme/target/admin-cmd.c
>>> +++ b/drivers/nvme/target/admin-cmd.c
>>> @@ -696,6 +696,12 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
>>>    
>>>    	id->cntlid = cpu_to_le16(ctrl->cntlid);
>>>    	id->ver = cpu_to_le32(ctrl->subsys->ver);
>>> +	if (!nvmet_is_disc_subsys(ctrl->subsys)) {
>>> +		id->cqt = cpu_to_le16(ctrl->cqt);
>>> +		id->ciu = ctrl->ciu;
>>> +		id->cirn = cpu_to_le64(ctrl->cirn);
>>> +		id->ccrl = NVMF_CCR_LIMIT;
>>> +	}
>>>    
>>>    	/* XXX: figure out what to do about RTD3R/RTD3 */
>>>    	id->oaes = cpu_to_le32(NVMET_AEN_CFG_OPTIONAL);
>>> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
>>> index e44ef69dffc2..035f6e75a818 100644
>>> --- a/drivers/nvme/target/configfs.c
>>> +++ b/drivers/nvme/target/configfs.c
>>> @@ -1636,6 +1636,36 @@ static ssize_t nvmet_subsys_attr_pi_enable_store(struct config_item *item,
>>>    CONFIGFS_ATTR(nvmet_subsys_, attr_pi_enable);
>>>    #endif
>>>    
>>> +static ssize_t nvmet_subsys_attr_cqt_show(struct config_item *item,
>>> +					  char *page)
>>> +{
>>> +	return snprintf(page, PAGE_SIZE, "%u\n", to_subsys(item)->cqt);
>>> +}
>>> +
>>> +static ssize_t nvmet_subsys_attr_cqt_store(struct config_item *item,
>>> +					   const char *page, size_t cnt)
>>> +{
>>> +	struct nvmet_subsys *subsys = to_subsys(item);
>>> +	struct nvmet_ctrl *ctrl;
>>> +	u16 cqt;
>>> +
>>> +	if (sscanf(page, "%hu\n", &cqt) != 1)
>>> +		return -EINVAL;
>>> +
>>> +	down_write(&nvmet_config_sem);
>>> +	if (subsys->cqt == cqt)
>>> +		goto out;
>>> +
>>> +	subsys->cqt = cqt;
>>> +	/* Force reconnect */
>>> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
>>> +		ctrl->ops->delete_ctrl(ctrl);
>>> +out:
>>> +	up_write(&nvmet_config_sem);
>>> +	return cnt;
>>> +}
>>> +CONFIGFS_ATTR(nvmet_subsys_, attr_cqt);
>>> +
>>>    static ssize_t nvmet_subsys_attr_qid_max_show(struct config_item *item,
>>>    					      char *page)
>>>    {
>>> @@ -1676,6 +1706,7 @@ static struct configfs_attribute *nvmet_subsys_attrs[] = {
>>>    	&nvmet_subsys_attr_attr_vendor_id,
>>>    	&nvmet_subsys_attr_attr_subsys_vendor_id,
>>>    	&nvmet_subsys_attr_attr_model,
>>> +	&nvmet_subsys_attr_attr_cqt,
>>>    	&nvmet_subsys_attr_attr_qid_max,
>>>    	&nvmet_subsys_attr_attr_ieee_oui,
>>>    	&nvmet_subsys_attr_attr_firmware,
>>
>> I do think that TP8028 (ie the CQT defintions) are somewhat independent
>> on CCR. So I'm not sure if they should be integrated in this patchset;
>> personally I would prefer to have it moved to another patchset.
> 
> Agreed that CQT is not directly related to CCR from the target
> perspective. But there is a relationship when it comes to how the
> initiator uses CQT to calculate the time budget for CCR. As you know on
> the host side if CCR fails and CQT is supported the requests needs to be
> held for certain amount of time before they are retried. So CQT value is
> needed and that I why I included it in this patchset.
> 
Oh, I don't disagree that CQT is needed to get full spec compliance.
But my point here is that there is no requirement that we need to
achieve full spec compliance _with this patchset_.
We are not compliant with the current implementation, and we don't
lose anything if we stay non-compliant after this patch.
 From my POV CCR is about resetting controllers and aborting commands,
and CQT is about inserting a delay before retrying commands.
And there is no dependency between those two; you can send CCR
without inserting a delay before retrying, and you can insert a
delay before retrying commands without having sent CCR.
So we should have two patchsets here. And of course, the
CQT patchset should be based on the CCR patchset.
But there is no requirements to have both in a single patchset.

>>
>>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
>>> index cc88e5a28c8a..0d2a1206e08f 100644
>>> --- a/drivers/nvme/target/core.c
>>> +++ b/drivers/nvme/target/core.c
>>> @@ -1393,6 +1393,10 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl)
>>>    		return;
>>>    	}
>>>    
>>> +	if (!nvmet_is_disc_subsys(ctrl->subsys)) {
>>> +		ctrl->ciu = ((u8)(ctrl->ciu + 1)) ? : 1;
>>> +		ctrl->cirn = get_random_u64();
>>> +	}
>>>    	ctrl->csts = NVME_CSTS_RDY;
>>>    
>>>    	/*
>>> @@ -1661,6 +1665,12 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
>>>    	}
>>>    	ctrl->cntlid = ret;
>>>    
>>> +	if (!nvmet_is_disc_subsys(ctrl->subsys)) {
>>> +		ctrl->cqt = subsys->cqt;
>>> +		ctrl->ciu = get_random_u8() ? : 1;
>>> +		ctrl->cirn = get_random_u64();
>>> +	}
>>> +
>>>    	/*
>>>    	 * Discovery controllers may use some arbitrary high value
>>>    	 * in order to cleanup stale discovery sessions
>>> @@ -1853,10 +1863,12 @@ struct nvmet_subsys *nvmet_subsys_alloc(const char *subsysnqn,
>>>    
>>>    	switch (type) {
>>>    	case NVME_NQN_NVME:
>>> +		subsys->cqt = NVMF_CQT_MS;
>>>    		subsys->max_qid = NVMET_NR_QUEUES;
>>>    		break;
>>
>> And I would not set the CQT default here.
>> Thing is, implementing CQT to the letter would inflict a CQT delay
>> during failover for _every_ installation, thereby resulting in a
>> regression to previous implementations where we would fail over
>> with _no_ delay.
>> So again, we should make it a different patchset.
> 
> CQT defaults to 0 to avoid introducing surprise delay. The initiator will
> skip holding requests if it sees CQT set to 0.

Precisely my point. If CQT defaults to zero no delay will be inserted,
but we _still_ have CCR handling. Just proving that both really are
independent on each other. Hence I would prefer to have two patchsets.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 03/14] nvmet: Implement CCR nvme command
  2026-02-03 18:40     ` Mohamed Khalfella
@ 2026-02-04  0:38       ` Hannes Reinecke
  2026-02-04  0:44         ` Mohamed Khalfella
  0 siblings, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-04  0:38 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 2/3/26 19:40, Mohamed Khalfella wrote:
> On Tue 2026-02-03 04:19:50 +0100, Hannes Reinecke wrote:
>> On 1/30/26 23:34, Mohamed Khalfella wrote:
>>> @@ -1501,6 +1516,38 @@ struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
>>>    	return ctrl;
>>>    }
>>>    
>>> +struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
>>> +					   const char *hostnqn, u8 ciu,
>>> +					   u16 cntlid, u64 cirn)
>>> +{
>>> +	struct nvmet_ctrl *ctrl;
>>> +	bool found = false;
>>> +
>>> +	mutex_lock(&subsys->lock);
>>> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
>>> +		if (ctrl->cntlid != cntlid)
>>> +			continue;
>>> +		if (strncmp(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE))
>>> +			continue;
>>> +
>> Why do we compare the hostnqn here, too? To my understanding the host
>> NQN is tied to the controller, so the controller ID should be sufficient
>> here.
> 
> We got cntlid from CCR nvme command and we do not trust the value sent by
> the host. We check hostnqn to confirm that host is actually connected to
> the impacted controller. A host should not be allowed to reset a
> controller connected to another host.
> 
Errm. So we're starting to not trust values in NVMe commands?
That is a very slippery road.
Ultimately it would require us to validate the cntlid on each
admin command. Which we don't.
And really there is no difference between CCR and any other
admin command; you get even worse effects if you would assume
a misdirected 'FORMAT' command.

Please don't. Security is _not_ a concern here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 05/14] nvmet: Send an AEN on CCR completion
  2026-02-03 18:48     ` Mohamed Khalfella
@ 2026-02-04  0:43       ` Hannes Reinecke
  0 siblings, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-04  0:43 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 2/3/26 19:48, Mohamed Khalfella wrote:
> On Tue 2026-02-03 04:27:39 +0100, Hannes Reinecke wrote:
>> On 1/30/26 23:34, Mohamed Khalfella wrote:
>>> Send an AEN to initiator when impacted controller exists. The
>>> notification points to CCR log page that initiator can read to check
>>> which CCR operation completed.
>>>
>>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
>>> ---
>>>    drivers/nvme/target/core.c  | 25 ++++++++++++++++++++++---
>>>    drivers/nvme/target/nvmet.h |  3 ++-
>>>    include/linux/nvme.h        |  3 +++
>>>    3 files changed, 27 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
>>> index 54dd0dcfa12b..ae2fe9f90bcd 100644
>>> --- a/drivers/nvme/target/core.c
>>> +++ b/drivers/nvme/target/core.c
>>> @@ -202,7 +202,7 @@ static void nvmet_async_event_work(struct work_struct *work)
>>>    	nvmet_async_events_process(ctrl);
>>>    }
>>>    
>>> -void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
>>> +static void nvmet_add_async_event_locked(struct nvmet_ctrl *ctrl, u8 event_type,
>>>    		u8 event_info, u8 log_page)
>>>    {
>>>    	struct nvmet_async_event *aen;
>>> @@ -215,13 +215,19 @@ void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
>>>    	aen->event_info = event_info;
>>>    	aen->log_page = log_page;
>>>    
>>> -	mutex_lock(&ctrl->lock);
>>>    	list_add_tail(&aen->entry, &ctrl->async_events);
>>> -	mutex_unlock(&ctrl->lock);
>>>    
>>>    	queue_work(nvmet_wq, &ctrl->async_event_work);
>>>    }
>>>    
>>> +void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type,
>>> +		u8 event_info, u8 log_page)
>>> +{
>>> +	mutex_lock(&ctrl->lock);
>>> +	nvmet_add_async_event_locked(ctrl, event_type, event_info, log_page);
>>> +	mutex_unlock(&ctrl->lock);
>>> +}
>>> +
>>>    static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
>>>    {
>>>    	u32 i;
>>> @@ -1788,6 +1794,18 @@ struct nvmet_ctrl *nvmet_alloc_ctrl(struct nvmet_alloc_ctrl_args *args)
>>>    }
>>>    EXPORT_SYMBOL_GPL(nvmet_alloc_ctrl);
>>>    
>>> +static void nvmet_ctrl_notify_ccr(struct nvmet_ctrl *ctrl)
>>> +{
>>> +	lockdep_assert_held(&ctrl->lock);
>>> +
>>> +	if (nvmet_aen_bit_disabled(ctrl, NVME_AEN_BIT_CCR_COMPLETE))
>>> +		return;
>>> +
>>> +	nvmet_add_async_event_locked(ctrl, NVME_AER_NOTICE,
>>> +				     NVME_AER_NOTICE_CCR_COMPLETED,
>>> +				     NVME_LOG_CCR);
>>> +}
>>> +
>>>    static void nvmet_ctrl_complete_pending_ccr(struct nvmet_ctrl *ctrl)
>>>    {
>>>    	struct nvmet_subsys *subsys = ctrl->subsys;
>>
>> But what does the CCR command actually _do_?
>> At the very lease I would have expected it to trigger a controller reset
>> (eg calling into nvmet_ctrl_fatal_error()), yet I don't see it doing
>> that anywhere ...
> 
> [PATCH v2 03/14] nvmet: Implement CCR nvme command
> is where impacted controller is told to fail. It does exactly what you
> mentioned above.
> 
> +out_unlock:
> +       mutex_unlock(&sctrl->lock);
> +       if (status == NVME_SC_SUCCESS)
> +               nvmet_ctrl_fatal_error(ictrl);
> +       nvmet_ctrl_put(ictrl);
> +out:
> +       nvmet_req_complete(req, status);
> 
> I refactored the error handling codepath into success codepath. That is
> why I think it is kind of hidden. If this is not obvious I can separate
> the two codepaths. What do you think?

Ah. Indeed, cleverly hidden.

So ignore my comment.

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 03/14] nvmet: Implement CCR nvme command
  2026-02-04  0:38       ` Hannes Reinecke
@ 2026-02-04  0:44         ` Mohamed Khalfella
  2026-02-04  0:55           ` Hannes Reinecke
  0 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-04  0:44 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On Wed 2026-02-04 01:38:44 +0100, Hannes Reinecke wrote:
> On 2/3/26 19:40, Mohamed Khalfella wrote:
> > On Tue 2026-02-03 04:19:50 +0100, Hannes Reinecke wrote:
> >> On 1/30/26 23:34, Mohamed Khalfella wrote:
> >>> @@ -1501,6 +1516,38 @@ struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
> >>>    	return ctrl;
> >>>    }
> >>>    
> >>> +struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
> >>> +					   const char *hostnqn, u8 ciu,
> >>> +					   u16 cntlid, u64 cirn)
> >>> +{
> >>> +	struct nvmet_ctrl *ctrl;
> >>> +	bool found = false;
> >>> +
> >>> +	mutex_lock(&subsys->lock);
> >>> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> >>> +		if (ctrl->cntlid != cntlid)
> >>> +			continue;
> >>> +		if (strncmp(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE))
> >>> +			continue;
> >>> +
> >> Why do we compare the hostnqn here, too? To my understanding the host
> >> NQN is tied to the controller, so the controller ID should be sufficient
> >> here.
> > 
> > We got cntlid from CCR nvme command and we do not trust the value sent by
> > the host. We check hostnqn to confirm that host is actually connected to
> > the impacted controller. A host should not be allowed to reset a
> > controller connected to another host.
> > 
> Errm. So we're starting to not trust values in NVMe commands?
> That is a very slippery road.
> Ultimately it would require us to validate the cntlid on each
> admin command. Which we don't.
> And really there is no difference between CCR and any other
> admin command; you get even worse effects if you would assume
> a misdirected 'FORMAT' command.
> 
> Please don't. Security is _not_ a concern here.

I do not think the check hurts. If you say it is wrong I will delete it.

> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 10/14] nvme-tcp: Use CCR to recover controller that hits an error
  2026-02-03 21:24     ` Mohamed Khalfella
@ 2026-02-04  0:48       ` Randy Jennings
  2026-02-04  2:57       ` Hannes Reinecke
  1 sibling, 0 replies; 82+ messages in thread
From: Randy Jennings @ 2026-02-04  0:48 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Hannes Reinecke, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Aaron Dailey, Dhaval Giani, linux-nvme,
	linux-kernel

On Tue, Feb 3, 2026 at 1:24 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> On Tue 2026-02-03 06:34:51 +0100, Hannes Reinecke wrote:
> > On 1/30/26 23:34, Mohamed Khalfella wrote:
> > > An alive nvme controller that hits an error now will move to FENCING
> > > state instead of RESETTING state. ctrl->fencing_work attempts CCR to
> > > terminate inflight IOs. If CCR succeeds, switch to FENCED -> RESETTING
> > > and continue error recovery as usual. If CCR fails, the behavior depends
> > > on whether the subsystem supports CQT or not. If CQT is not supported
> > > then reset the controller immediately as if CCR succeeded in order to
> > > maintain the current behavior. If CQT is supported switch to time-based
> > > recovery. Schedule ctrl->fenced_work resets the controller when time
> > > based recovery finishes.
> > >
> > > Either ctrl->err_work or ctrl->reset_work can run after a controller is
> > > fenced. Flush fencing work when either work run.
> > >
> > > Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
...
> > Here you are calling CCR whenever error recovery is triggered.
> > This will cause CCR to be send from a command timeout, which is
> > technically wrong (CCR should be send when the KATO timeout expires,
> > not when a command timout expires). Both could be vastly different.
> > So I'd prefer to have CCR send whenever KATO timeout triggers, and
> > lease to current command timeout mechanism in place.
Hannas, It is incorrect that CCR should be sent when the KATO timeout expires,
not when a command timeout expires.

KATO timeout expiring is what happens on the controller, not the host.  The
controller behavior is specified because the host has to know what the
controller will do.  The host is free to decide that a connection/controller
association should be abandoned whenever the host wants to.  It can be
either a timeout on a keep alive (which is not KATO expiring) or any other
command.

But once the host has decided to abandon and tear down the connection/
controller association, it has to make sure no pending requests are still
outstanding on the controller side.  And that is either through CCR or
through time-based recovery.

So, if, after a command timeout, the host decides to cancel/abort the
command, but not tear down the association, we should not trigger a
CCR.  But, if we are tearing down the connection (and there are pending
commands, we should trigger CCR (and start time based recovery).

Sincerely,
Randy Jennings


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 03/14] nvmet: Implement CCR nvme command
  2026-02-04  0:44         ` Mohamed Khalfella
@ 2026-02-04  0:55           ` Hannes Reinecke
  2026-02-04 17:52             ` Mohamed Khalfella
  0 siblings, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-04  0:55 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 2/4/26 01:44, Mohamed Khalfella wrote:
> On Wed 2026-02-04 01:38:44 +0100, Hannes Reinecke wrote:
>> On 2/3/26 19:40, Mohamed Khalfella wrote:
>>> On Tue 2026-02-03 04:19:50 +0100, Hannes Reinecke wrote:
>>>> On 1/30/26 23:34, Mohamed Khalfella wrote:
>>>>> @@ -1501,6 +1516,38 @@ struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
>>>>>     	return ctrl;
>>>>>     }
>>>>>     
>>>>> +struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
>>>>> +					   const char *hostnqn, u8 ciu,
>>>>> +					   u16 cntlid, u64 cirn)
>>>>> +{
>>>>> +	struct nvmet_ctrl *ctrl;
>>>>> +	bool found = false;
>>>>> +
>>>>> +	mutex_lock(&subsys->lock);
>>>>> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
>>>>> +		if (ctrl->cntlid != cntlid)
>>>>> +			continue;
>>>>> +		if (strncmp(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE))
>>>>> +			continue;
>>>>> +
>>>> Why do we compare the hostnqn here, too? To my understanding the host
>>>> NQN is tied to the controller, so the controller ID should be sufficient
>>>> here.
>>>
>>> We got cntlid from CCR nvme command and we do not trust the value sent by
>>> the host. We check hostnqn to confirm that host is actually connected to
>>> the impacted controller. A host should not be allowed to reset a
>>> controller connected to another host.
>>>
>> Errm. So we're starting to not trust values in NVMe commands?
>> That is a very slippery road.
>> Ultimately it would require us to validate the cntlid on each
>> admin command. Which we don't.
>> And really there is no difference between CCR and any other
>> admin command; you get even worse effects if you would assume
>> a misdirected 'FORMAT' command.
>>
>> Please don't. Security is _not_ a concern here.
> 
> I do not think the check hurts. If you say it is wrong I will delete it.
> 
It's not 'wrong', It's inconsistent. The argument that the contents of
an admin command may be wrong applies to _every_ admin command.
Yet we never check on any of those commands.
So I fail to see why this command requires special treatment.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery
  2026-02-03 20:00     ` Mohamed Khalfella
@ 2026-02-04  1:10       ` Hannes Reinecke
  2026-02-04 23:24         ` Mohamed Khalfella
  0 siblings, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-04  1:10 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 2/3/26 21:00, Mohamed Khalfella wrote:
> On Tue 2026-02-03 06:19:51 +0100, Hannes Reinecke wrote:
>> On 1/30/26 23:34, Mohamed Khalfella wrote:
[ .. ]
>>> +	timeout = nvme_fence_timeout_ms(ictrl);
>>> +	dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
>>> +
>>> +	now = jiffies;
>>> +	deadline = now + msecs_to_jiffies(timeout);
>>> +	while (time_before(now, deadline)) {
>>> +		sctrl = nvme_find_ctrl_ccr(ictrl, min_cntlid);
>>> +		if (!sctrl) {
>>> +			/* CCR failed, switch to time-based recovery */
>>> +			return deadline - now;
>>> +		}
>>> +
>>> +		ret = nvme_issue_wait_ccr(sctrl, ictrl);
>>> +		if (!ret) {
>>> +			dev_info(ictrl->device, "CCR succeeded using %s\n",
>>> +				 dev_name(sctrl->device));
>>> +			nvme_put_ctrl_ccr(sctrl);
>>> +			return 0;
>>> +		}
>>> +
>>> +		/* CCR failed, try another path */
>>> +		min_cntlid = sctrl->cntlid + 1;
>>> +		nvme_put_ctrl_ccr(sctrl);
>>> +		now = jiffies;
>>> +	}
>>
>> That will spin until 'deadline' is reached if 'nvme_issue_wait_ccr()'
>> returns an error. _And_ if the CCR itself runs into a timeout we would
>> never have tried another path (which could have succeeded).
> 
> True. We can do one thing at a time in CCR time budget. Either wait for
> CCR to succeed or give up early and try another path. It is a trade off.
> 
Yes. But I guess my point here is that we should differentiate between
'CCR failed to be sent' and 'CCR completed with error'.
The logic above treats both the same.

>>
>> I'd rather rework this loop to open-code 'issue_and_wait()' in the loop,
>> and only switch to the next controller if the submission of CCR failed.
>> Once that is done we can 'just' wait for completion, as a failure there
>> will be after KATO timeout anyway and any subsequent CCR would be pointless.
> 
> If I understood this correctly then we will stick with the first sctrl
> that accepts the CCR command. We wait for CCR to complete and give up on
> fencing ictrl if CCR operation fails or times out. Did I get this correctly?
> 
Yes.
If a CCR could be send but the controller failed to process it something
very odd is ongoing, and it's extremely questionable whether a CCR to
another controller would be succeeding. That's why I would switch to the
next available controller if we could not _send_ the CCR, but would
rather wait for KATO if CCR processing returned an error.

But the main point is that CCR is a way to _shorten_ the interval
(until KATO timeout) until we can start retrying commands.
If the controller ran into an error during CCR processing chances
are that quite some time has elapsed already, and we might as well
wait for KATO instead of retrying with yet another CCR.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 10/14] nvme-tcp: Use CCR to recover controller that hits an error
  2026-02-03 21:24     ` Mohamed Khalfella
  2026-02-04  0:48       ` Randy Jennings
@ 2026-02-04  2:57       ` Hannes Reinecke
  2026-02-10  1:39         ` Mohamed Khalfella
  1 sibling, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-04  2:57 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On 2/3/26 22:24, Mohamed Khalfella wrote:
> On Tue 2026-02-03 06:34:51 +0100, Hannes Reinecke wrote:
>> On 1/30/26 23:34, Mohamed Khalfella wrote:
>>> An alive nvme controller that hits an error now will move to FENCING
>>> state instead of RESETTING state. ctrl->fencing_work attempts CCR to
>>> terminate inflight IOs. If CCR succeeds, switch to FENCED -> RESETTING
>>> and continue error recovery as usual. If CCR fails, the behavior depends
>>> on whether the subsystem supports CQT or not. If CQT is not supported
>>> then reset the controller immediately as if CCR succeeded in order to
>>> maintain the current behavior. If CQT is supported switch to time-based
>>> recovery. Schedule ctrl->fenced_work resets the controller when time
>>> based recovery finishes.
>>>
>>> Either ctrl->err_work or ctrl->reset_work can run after a controller is
>>> fenced. Flush fencing work when either work run.
>>>
>>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
>>> ---
>>>    drivers/nvme/host/tcp.c | 62 ++++++++++++++++++++++++++++++++++++++++-
>>>    1 file changed, 61 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 69cb04406b47..af8d3b36a4bb 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -193,6 +193,8 @@ struct nvme_tcp_ctrl {
>>>    	struct sockaddr_storage src_addr;
>>>    	struct nvme_ctrl	ctrl;
>>>    
>>> +	struct work_struct	fencing_work;
>>> +	struct delayed_work	fenced_work;
>>>    	struct work_struct	err_work;
>>>    	struct delayed_work	connect_work;
>>>    	struct nvme_tcp_request async_req;
>>> @@ -611,6 +613,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
>>>    
>>>    static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
>>>    {
>>> +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCING)) {
>>> +		dev_warn(ctrl->device, "starting controller fencing\n");
>>> +		queue_work(nvme_wq, &to_tcp_ctrl(ctrl)->fencing_work);
>>> +		return;
>>> +	}
>>> +
>>
>> Don't you need to flush any outstanding 'fenced_work' queue items here
>> before calling 'queue_work()'?
> 
> I do not think we need to flush ctr->fencing_work. It can not be running
> at this time.
> 
Hmm. If you say so ... I'd rather make sure here.
These things have a habit of popping up unexpectdly.

>>
>>>    	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
>>>    		return;
>>>    
>>> @@ -2470,12 +2478,59 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
>>>    	nvme_tcp_reconnect_or_remove(ctrl, ret);
>>>    }
>>>    
>>> +static void nvme_tcp_fenced_work(struct work_struct *work)
>>> +{
>>> +	struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
>>> +					struct nvme_tcp_ctrl, fenced_work);
>>> +	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
>>> +
>>> +	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
>>> +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
>>> +		queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
>>> +}
>>> +
>>> +static void nvme_tcp_fencing_work(struct work_struct *work)
>>> +{
>>> +	struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
>>> +			struct nvme_tcp_ctrl, fencing_work);
>>> +	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
>>> +	unsigned long rem;
>>> +
>>> +	rem = nvme_fence_ctrl(ctrl);
>>> +	if (!rem)
>>> +		goto done;
>>> +
>>> +	if (!ctrl->cqt) {
>>> +		dev_info(ctrl->device,
>>> +			 "CCR failed, CQT not supported, skip time-based recovery\n");
>>> +		goto done;
>>> +	}
>>> +
>>
>> As mentioned, cqt handling should be part of another patchset.
> 
> Let us suppose we drop cqt from this patchset
> 
> - How will we be able to calculate CCR time budget?
>    Currently it is calculated by nvme_fence_timeout_ms()
> 
The CCR time budget is calculated by the current KATO value.
CQT is the time a controler requires _after_ KATO expires
to clear out commands.

> - What should we do if CCR fails? Retry requests immediately?
> 
No. If CCR fails we should fall back to error handling.
In our case it would mean we let the command timeout handler
initate a controller reset.
_Eventually_ (ie after we implemented CCR _and_ CQT) we would
wait for KATO (+ CQT) to expire and _then_ reset the controller.

>>> +	dev_info(ctrl->device,
>>> +		 "CCR failed, switch to time-based recovery, timeout = %ums\n",
>>> +		 jiffies_to_msecs(rem));
>>> +	queue_delayed_work(nvme_wq, &tcp_ctrl->fenced_work, rem);
>>> +	return;
>>> +
>>
>> Why do you need the 'fenced' workqueue at all? All it does is queing yet
>> another workqueue item, which certainly can be done from the 'fencing'
>> workqueue directly, no?
> 
> It is possible to drop ctr->fenced_work and requeue ctrl->fencing_work
> as delayed work to implement request hold time. If we do that then we
> need to modify nvme_tcp_fencing_work() to tell if it is being called for
> 'fencing' or 'fenced'. The first version of this patch used a controller
> flag RECOVERED for that and it has been suggested to use a separate work
> to simplify the logic and drop the controller flag.
> 
But that's just because you integrated CCR within the current error 
recovery.
If you were to implement CCR as to be invoked from 
nvme_keep_alive_end_io() we would not need to touch that,
and we would need just one workqueue.

Let me see if I can draft up a patch.

>>
>>> +done:
>>> +	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
>>> +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
>>> +		queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
>>> +}
>>> +
>>> +static void nvme_tcp_flush_fencing_work(struct nvme_ctrl *ctrl)
>>> +{
>>> +	flush_work(&to_tcp_ctrl(ctrl)->fencing_work);
>>> +	flush_delayed_work(&to_tcp_ctrl(ctrl)->fenced_work);
>>> +}
>>> +
>>>    static void nvme_tcp_error_recovery_work(struct work_struct *work)
>>>    {
>>>    	struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
>>>    				struct nvme_tcp_ctrl, err_work);
>>>    	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
>>>    
>>> +	nvme_tcp_flush_fencing_work(ctrl);
>>
>> Why not 'fenced_work' ?
> 
> You mean rename nvme_tcp_flush_fencing_work() to
> nvme_tcp_flush_fenced_work()?
> 
> If yes, then I can do that if you think it makes more sense.
> 
Thing is, you have two workqueues dealing with CCR.
You really need to avoid that _both_ are empty when this
funciton is called.

>>
>>>    	if (nvme_tcp_key_revoke_needed(ctrl))
>>>    		nvme_auth_revoke_tls_key(ctrl);
>>>    	nvme_stop_keep_alive(ctrl);
>>> @@ -2518,6 +2573,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
>>>    		container_of(work, struct nvme_ctrl, reset_work);
>>>    	int ret;
>>>    
>>> +	nvme_tcp_flush_fencing_work(ctrl);
>>
>> Same.
>>
>>>    	if (nvme_tcp_key_revoke_needed(ctrl))
>>>    		nvme_auth_revoke_tls_key(ctrl);
>>>    	nvme_stop_ctrl(ctrl);
>>> @@ -2643,13 +2699,15 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
>>>    	struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
>>>    	struct nvme_command *cmd = &pdu->cmd;
>>>    	int qid = nvme_tcp_queue_id(req->queue);
>>> +	enum nvme_ctrl_state state;
>>>    
>>>    	dev_warn(ctrl->device,
>>>    		 "I/O tag %d (%04x) type %d opcode %#x (%s) QID %d timeout\n",
>>>    		 rq->tag, nvme_cid(rq), pdu->hdr.type, cmd->common.opcode,
>>>    		 nvme_fabrics_opcode_str(qid, cmd), qid);
>>>    
>>> -	if (nvme_ctrl_state(ctrl) != NVME_CTRL_LIVE) {
>>> +	state = nvme_ctrl_state(ctrl);
>>> +	if (state != NVME_CTRL_LIVE && state != NVME_CTRL_FENCING) {
>>
>> 'FENCED' too, presumably?
> 
> I do not think it makes a difference here. FENCED and RESETTING are
> almost the same states.
> 
Yeah, but in FENCED all commands will be aborted, and the same action 
will be invoked when this if() clause is entered. So you need to avoid
entering the if() clause to avoid races.

>>
>>>    		/*
>>>    		 * If we are resetting, connecting or deleting we should
>>>    		 * complete immediately because we may block controller
>>> @@ -2904,6 +2962,8 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
>>>    
>>>    	INIT_DELAYED_WORK(&ctrl->connect_work,
>>>    			nvme_tcp_reconnect_ctrl_work);
>>> +	INIT_DELAYED_WORK(&ctrl->fenced_work, nvme_tcp_fenced_work);
>>> +	INIT_WORK(&ctrl->fencing_work, nvme_tcp_fencing_work);
>>>    	INIT_WORK(&ctrl->err_work, nvme_tcp_error_recovery_work);
>>>    	INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
>>>    
>>
>> Here you are calling CCR whenever error recovery is triggered.
>> This will cause CCR to be send from a command timeout, which is
>> technically wrong (CCR should be send when the KATO timeout expires,
>> not when a command timout expires). Both could be vastly different.
> 
> KATO is driven by the host. What does KTO expires mean?
> I think KATO expiry is more applicable to target, no?
> 
KATO expiry (for this implementation) means the nvme_keep_alive_end_io()
is called with rtt > deadline.

> KATO timeout is a signal of an error that target is not reachable or
> something is wrong with the target?
> 
Yes. It explicitely means that the Keep-Alive command was not completed
within a time period specified by the Keep-Alive timeout value.

>>
>> So I'd prefer to have CCR send whenever KATO timeout triggers, and
>> lease to current command timeout mechanism in place.
> 
> Assuming we used CCR only when KATO request times out.
> What should we do when we hit other errors?
> 
Leave it. This patchset is _NOT_ about fixing the error handler.
This patchset is about implementing CCR.
And CCR is just a step towared fixing the error handler.

> should nvme_tcp_error_recovery() is called from many places to handle
> errors and it effectively resets the controller. What should this
> function do if not trigger CCR?
> 
The same things as before. CCR needs to be invoked when the Keep-Alive
timeout period expires. And that is what we should be implementing here.

It _might_ (and arguably should) be invoked when error handling needs
to be done. But with a later patchset.

Yes, this will result in a patchset where the error handler still has
gaps. But it will result in a patchset which focusses on a single thing,
with the added benefit that the workings are clearly outlined in the 
spec. So the implementation is pretty easy to review.

Hooking CCR into the error handler itself is _not_ mandated by the spec,
and inevitably needs to larger discussions (as this thread nicely 
demonstrates). So I would prefer to keep it simple first, and focus
on the technical implementation of CCR.
And concentrate on fixing up the error handler once CCR is in.

Cheers,

Hannes
>>
>> Cheers,
>>
>> Hannes
>> -- 
>> Dr. Hannes Reinecke                  Kernel Storage Architect
>> hare@suse.de                                +49 911 74053 688
>> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
>> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 03/14] nvmet: Implement CCR nvme command
  2026-02-04  0:55           ` Hannes Reinecke
@ 2026-02-04 17:52             ` Mohamed Khalfella
  2026-02-07 13:58               ` Sagi Grimberg
  0 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-04 17:52 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On Wed 2026-02-04 01:55:18 +0100, Hannes Reinecke wrote:
> On 2/4/26 01:44, Mohamed Khalfella wrote:
> > On Wed 2026-02-04 01:38:44 +0100, Hannes Reinecke wrote:
> >> On 2/3/26 19:40, Mohamed Khalfella wrote:
> >>> On Tue 2026-02-03 04:19:50 +0100, Hannes Reinecke wrote:
> >>>> On 1/30/26 23:34, Mohamed Khalfella wrote:
> >>>>> @@ -1501,6 +1516,38 @@ struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
> >>>>>     	return ctrl;
> >>>>>     }
> >>>>>     
> >>>>> +struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
> >>>>> +					   const char *hostnqn, u8 ciu,
> >>>>> +					   u16 cntlid, u64 cirn)
> >>>>> +{
> >>>>> +	struct nvmet_ctrl *ctrl;
> >>>>> +	bool found = false;
> >>>>> +
> >>>>> +	mutex_lock(&subsys->lock);
> >>>>> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> >>>>> +		if (ctrl->cntlid != cntlid)
> >>>>> +			continue;
> >>>>> +		if (strncmp(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE))
> >>>>> +			continue;
> >>>>> +
> >>>> Why do we compare the hostnqn here, too? To my understanding the host
> >>>> NQN is tied to the controller, so the controller ID should be sufficient
> >>>> here.
> >>>
> >>> We got cntlid from CCR nvme command and we do not trust the value sent by
> >>> the host. We check hostnqn to confirm that host is actually connected to
> >>> the impacted controller. A host should not be allowed to reset a
> >>> controller connected to another host.
> >>>
> >> Errm. So we're starting to not trust values in NVMe commands?
> >> That is a very slippery road.
> >> Ultimately it would require us to validate the cntlid on each
> >> admin command. Which we don't.
> >> And really there is no difference between CCR and any other
> >> admin command; you get even worse effects if you would assume
> >> a misdirected 'FORMAT' command.
> >>
> >> Please don't. Security is _not_ a concern here.
> > 
> > I do not think the check hurts. If you say it is wrong I will delete it.
> > 
> It's not 'wrong', It's inconsistent. The argument that the contents of
> an admin command may be wrong applies to _every_ admin command.
> Yet we never check on any of those commands.
> So I fail to see why this command requires special treatment.

Okay, I will delete this check.

> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery
  2026-02-04  1:10       ` Hannes Reinecke
@ 2026-02-04 23:24         ` Mohamed Khalfella
  2026-02-11  3:44           ` Randy Jennings
  0 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-04 23:24 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On Wed 2026-02-04 02:10:48 +0100, Hannes Reinecke wrote:
> On 2/3/26 21:00, Mohamed Khalfella wrote:
> > On Tue 2026-02-03 06:19:51 +0100, Hannes Reinecke wrote:
> >> On 1/30/26 23:34, Mohamed Khalfella wrote:
> [ .. ]
> >>> +	timeout = nvme_fence_timeout_ms(ictrl);
> >>> +	dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> >>> +
> >>> +	now = jiffies;
> >>> +	deadline = now + msecs_to_jiffies(timeout);
> >>> +	while (time_before(now, deadline)) {
> >>> +		sctrl = nvme_find_ctrl_ccr(ictrl, min_cntlid);
> >>> +		if (!sctrl) {
> >>> +			/* CCR failed, switch to time-based recovery */
> >>> +			return deadline - now;
> >>> +		}
> >>> +
> >>> +		ret = nvme_issue_wait_ccr(sctrl, ictrl);
> >>> +		if (!ret) {
> >>> +			dev_info(ictrl->device, "CCR succeeded using %s\n",
> >>> +				 dev_name(sctrl->device));
> >>> +			nvme_put_ctrl_ccr(sctrl);
> >>> +			return 0;
> >>> +		}
> >>> +
> >>> +		/* CCR failed, try another path */
> >>> +		min_cntlid = sctrl->cntlid + 1;
> >>> +		nvme_put_ctrl_ccr(sctrl);
> >>> +		now = jiffies;
> >>> +	}
> >>
> >> That will spin until 'deadline' is reached if 'nvme_issue_wait_ccr()'
> >> returns an error. _And_ if the CCR itself runs into a timeout we would
> >> never have tried another path (which could have succeeded).
> > 
> > True. We can do one thing at a time in CCR time budget. Either wait for
> > CCR to succeed or give up early and try another path. It is a trade off.
> > 
> Yes. But I guess my point here is that we should differentiate between
> 'CCR failed to be sent' and 'CCR completed with error'.
> The logic above treats both the same.
> 
> >>
> >> I'd rather rework this loop to open-code 'issue_and_wait()' in the loop,
> >> and only switch to the next controller if the submission of CCR failed.
> >> Once that is done we can 'just' wait for completion, as a failure there
> >> will be after KATO timeout anyway and any subsequent CCR would be pointless.
> > 
> > If I understood this correctly then we will stick with the first sctrl
> > that accepts the CCR command. We wait for CCR to complete and give up on
> > fencing ictrl if CCR operation fails or times out. Did I get this correctly?
> > 
> Yes.
> If a CCR could be send but the controller failed to process it something
> very odd is ongoing, and it's extremely questionable whether a CCR to
> another controller would be succeeding. That's why I would switch to the
> next available controller if we could not _send_ the CCR, but would
> rather wait for KATO if CCR processing returned an error.
> 
> But the main point is that CCR is a way to _shorten_ the interval
> (until KATO timeout) until we can start retrying commands.
> If the controller ran into an error during CCR processing chances
> are that quite some time has elapsed already, and we might as well
> wait for KATO instead of retrying with yet another CCR.

Got it. I updated the code to do that.

> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset
  2026-02-04  0:11     ` Mohamed Khalfella
@ 2026-02-05  0:08       ` James Smart
  2026-02-05  0:59         ` Mohamed Khalfella
  2026-02-09 22:53         ` Mohamed Khalfella
  0 siblings, 2 replies; 82+ messages in thread
From: James Smart @ 2026-02-05  0:08 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, jsmart833426

On 2/3/2026 4:11 PM, Mohamed Khalfella wrote:
> On Tue 2026-02-03 11:19:28 -0800, James Smart wrote:
>> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
...
>>>    
>>> +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
>>> +					 char *errmsg)
>>> +{
>>> +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
>>> +		return;
>>   > +> +	dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error
>> recovery %s\n",
>>> +		 ctrl->cnum, errmsg);
>>> +	queue_work(nvme_reset_wq, &ctrl->ioerr_work);
>>> +}
>>> +
>>
>> Disagree with this.
>>
>> The clause in error_recovery around the CONNECTING state is pretty
>> important to terminate io occurring during connect/reconnect where the
>> ctrl state should not change. we don't want start_ioerr making it RESETTING.
>>
>> This should be reworked.
> 
> Like you pointed out this changes the current behavior for CONNECTING
> state.
> 
> Before this change, as you pointed out the controller state stays in
> CONNECTING while all IOs are aborted. Aborting the IOs causes
> nvme_fc_create_association() to fail and reconnect might be attempted
> again.
> The new behavior switches to RESETTING and queues ctr->ioerr_work.
> ioerr_work will abort oustanding IOs, swich back to CONNECING and
> attempt reconnect.

Well, it won't actually switch to RESETTING, as CONNECTING->RESETTING is 
not a valid transition.  So things will silently stop in 
start_ioerr_recovery when the state transition fails (also a reason I 
dislike silent state transition failures).

When I look a little further into patch 13, I see the change to FENCING 
added. But that state transition will also fail for CONNECTING->FENCING. 
It will then fall into the resetting state change, which will silently 
fail, and we're stopped.  It says to me there was no consideration or 
testing of failures while CONNECTING with this patch set.  Even if 
RESETTING were allowed, its injecting a new flow into the code paths.

The CONNECTING issue also applies to tcp and rdma transports. I don't 
know if they call the error_recovery routines in the same way.

To be honest I'm not sure I remember the original reasons this loop was 
put in, but I do remember pain I went through when generating it and the 
number of test cases that were needed to cover testing. It may well be 
because I couldn't invoke the reset due to the CONNECTING->RESETTING 
block.  I'm being pedantic as I still feel residual pain for it.

> 
> nvme_fc_error_recovery() ->
>    nvme_stop_keep_alive() /* should not make a difference */
>    nvme_stop_ctrl()       /* should be okay to run */
>    nvme_fc_delete_association() ->
>      __nvme_fc_abort_outstanding_ios(ctrl, false)
>      nvme_unquiesce_admin_queue()
>      nvme_unquiesce_io_queues()
>      nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)
>      if (port_state == ONLINE)
>        queue_work(ctrl->connect)
>      else
>        nvme_fc_reconnect_or_delete();
> 
> Yes, this is a different behavior. IMO it is simpler to follow and
> closer to what other transports do, keeping in mind async abort nature
> of fc.
> 
> Aside from it is different, what is wrong with it?

See above.

...
>>>    static int
>>> @@ -2495,39 +2506,6 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
>>>    		nvme_unquiesce_admin_queue(&ctrl->ctrl);
>>>    }
>>>    
>>> -static void
>>> -nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
>>> -{
>>> -	enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
>>> -
>>> -	/*
>>> -	 * if an error (io timeout, etc) while (re)connecting, the remote
>>> -	 * port requested terminating of the association (disconnect_ls)
>>> -	 * or an error (timeout or abort) occurred on an io while creating
>>> -	 * the controller.  Abort any ios on the association and let the
>>> -	 * create_association error path resolve things.
>>> -	 */
>>> -	if (state == NVME_CTRL_CONNECTING) {
>>> -		__nvme_fc_abort_outstanding_ios(ctrl, true);
>>> -		dev_warn(ctrl->ctrl.device,
>>> -			"NVME-FC{%d}: transport error during (re)connect\n",
>>> -			ctrl->cnum);
>>> -		return;
>>> -	}
>>
>> This logic needs to be preserved. Its no longer part of
>> nvme_fc_start_ioerr_recovery(). Failures during CONNECTING should not be
>> "fenced". They should fail immediately.
> 
> I think this is similar to the point above.

Forgetting whether or not the above "works", what I'm pointing out is 
that when in CONNECTING I don't believe you should be enacting the 
FENCED state and delaying. For CONNECTING, the cleanup should be 
immediate with no delay and no CCR attempt.  Only LIVE should transition 
to FENCED.

Looking at patch 14, fencing_work calls nvme_fence_ctrl() which 
unconditionally delays and tries to do CCR. We only want this if LIVE. 
I'll comment on that patch.

>> There is a small difference here in that The existing code avoids doing
>> the ctrl reset if the controller is NEW. start_ioerr will change the
>> ctrl to RESETTING. I'm not sure how much of an impact that is.
>>
> 
> I think there is little done while controller in NEW state.
> Let me know if I am missing something.

No - I had to update my understanding I was really out of date. Used to 
be NEW is what initial controller create was done under. Everybody does 
it now under CONNECTING.

...
>>>    static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
>>>    {
>>>    	struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(rq);
>>> @@ -2536,24 +2514,14 @@ static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
>>>    	struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
>>>    	struct nvme_command *sqe = &cmdiu->sqe;
>>>    
>>> -	/*
>>> -	 * Attempt to abort the offending command. Command completion
>>> -	 * will detect the aborted io and will fail the connection.
>>> -	 */
>>>    	dev_info(ctrl->ctrl.device,
>>>    		"NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
>>>    		"x%08x/x%08x\n",
>>>    		ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
>>>    		nvme_fabrics_opcode_str(qnum, sqe),
>>>    		sqe->common.cdw10, sqe->common.cdw11);
>>> -	if (__nvme_fc_abort_op(ctrl, op))
>>> -		nvme_fc_error_recovery(ctrl, "io timeout abort failed");
>>>    
>>> -	/*
>>> -	 * the io abort has been initiated. Have the reset timer
>>> -	 * restarted and the abort completion will complete the io
>>> -	 * shortly. Avoids a synchronous wait while the abort finishes.
>>> -	 */
>>> +	nvme_fc_start_ioerr_recovery(ctrl, "io timeout");
>>
>> Why get rid of the abort logic ?
>> Note: the error recovery/controller reset is only called when the abort
>> failed.
>>
>> I believe you should continue to abort the op.  The fence logic will
>> kick in when the op completes later (along with other io completions).
>> If nothing else, it allows a hw resource to be freed up.
> 
> The abort logic from nvme_fc_timeout() is problematic and it does not
> play well with abort initiatored from ioerr_work or reset_work. The
> problem is that op aborted from nvme_fc_timeout() is not accounted for
> when the controller is reset.

note: I'll wait to be shown otherwise, but if this were true it would be 
horribly broken for a long time.

> 
> Here is an example scenario.
> 
> The first time a request times out it gets aborted we see this codepath
> 
> nvme_fc_timeout() ->
>    __nvme_fc_abort_op() ->
>      atomic_xchg(&op->state, FCPOP_STATE_ABORTED)
>        ops->abort()
>          return 0;

there's more than this in in the code:
it changes op state to ABORTED, saving the old opstate.
if the opstate wasn't active - it means something else changed and it 
restores the old state (e.g. the aborts for the reset may have hit it).
if it was active (e.g. the aborts the reset haven't hit it yet) it 
checks the ctlr flag to see if the controller is being reset and 
tracking io termination (the TERMIO flag) and if so, increments the 
iocnt. So it is "included" in the reset.

if old state was active, it then sends the ABTS.
if old state wasn't active (we've been here before or io terminated by 
reset) it returns -ECANCELED, which will cause a controller reset to be 
attempted if there's not already one in process.

> 
> nvme_fc_timeout() always return BLK_EH_RESET_TIMER so the same request
> can timeout again. If the same request hits timeout again then
> __nvme_fc_abort_op() returns -ECANCELED and nvme_fc_error_recovery()
> gets called. Assuming the controller is LIVE it will be reset.

The normal case is timeout generates ABTS. ABTS usually completes 
quickly with the io completing and the io callback to iodone, which sees 
abort error status and resets controller. Its very typical for the ABTS 
to complete long before the 2nd EH timer timing out.

Abnormal case is ABTS takes longer to complete than the 2nd EH timer 
timing. Yes, that forces the controller reset.   I am aware that some 
arrays will delay ABTS ACC while they terminate the back end, but there 
are also frame drop conditions to consider.

if the controller is already resetting, all the above is largely n/a.

I see no reason to avoid the ABTS and wait for a 2nd EH timer to fire.

> 
> nvme_fc_reset_ctrl_work() ->
>    nvme_fc_delete_association() ->
>      __nvme_fc_abort_outstanding_ios() ->
>        nvme_fc_terminate_exchange() ->
>          __nvme_fc_abort_op()
> 
> __nvme_fc_abort_op() finds that op already aborted. As a result of that
> ctrl->iocnt will not be incrmented for this op. This means that
> nvme_fc_delete_association() will not wait for this op to be aborted.

see missing code stmt above.

> 
> I do not think we wait this behavior.
> 
> To continue the scenario above. The controller switches to CONNECTING
> and the request times out again. This time we hit the deadlock described
> in [1].
> 
> I think the first abort is the cause of the issue here. with this change
> we should not hit the scenario described above.
> 
> 1 - https://lore.kernel.org/all/20250529214928.2112990-1-mkhalfella@purestorage.com/

Something else happened here. You can't get to CONNECTING state unless 
all outstanding io was reaped in delete association. What is also harder 
to understand is how there was an io to timeout if they've all been 
reaped and queues haven't been restarted.  Timeout on one of the ios to 
instatiate/init the controller maybe, but it shouldn't have been one of 
those in the blk layer.

> 
>>
>>
>>>    	return BLK_EH_RESET_TIMER;
>>>    }
>>>    
>>> @@ -3352,6 +3320,26 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
>>>    	}
>>>    }
>>>    
>>> +static void
>>> +nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
>>> +{
>>> +	nvme_stop_keep_alive(&ctrl->ctrl);
>>
>> Curious, why did the stop_keep_alive() call get added to this ?
>> Doesn't hurt.
>>
>> I assume it was due to other transports having it as they originally
>> were calling stop_ctrl, but then moved to stop_keep_alive. Shouldn't
>> this be followed by flush_work((&ctrl->ctrl.async_event_work) ?
> 
> Yes. I added it because it matches what other transports do.
> 
> nvme_fc_error_recovery() ->
>    nvme_fc_delete_association() ->
>      nvme_fc_abort_aen_ops() ->
>        nvme_fc_term_aen_ops() ->
>          cancel_work_sync(&ctrl->ctrl.async_event_work);
> 
> The above codepath takes care of async_event_work.

True, but the flush_works were added for a reason to the other 
transports so I'm guessing timing matters. So waiting till ther later 
term_aen call isn't great.  But I also guess, we haven't had an issue 
prior and since we did take care if it in the aen routines, its likely 
unneeded now.  Ok to add it but if so, we should keep the flush_work as 
well. Also good to look same as the other transports.

> 
>>
>>> +	nvme_stop_ctrl(&ctrl->ctrl);
>>> +
>>> +	/* will block while waiting for io to terminate */
>>> +	nvme_fc_delete_association(ctrl);
>>> +
>>> +	/* Do not reconnect if controller is being deleted */
>>> +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
>>> +		return;
>>> +
>>> +	if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
>>> +		queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
>>> +		return;
>>> +	}
>>> +
>>> +	nvme_fc_reconnect_or_delete(ctrl, -ENOTCONN);
>>> +}
>>
>> This code and that in nvme_fc_reset_ctrl_work() need to be collapsed
>> into a common helper function invoked by the 2 routines.  Also addresses
>> the missing flush_delayed work in this routine.
>>
> 
> Agree, nvme_fc_error_recovery() and nvme_fc_reset_ctrl_work() have
> common code that can be refactored. However, I do not plan to do this
> part of this change. I will take a look after I get CCR work done.

Don't put it off. You are adding as much code as the refactoring is. 
Just make the change.

-- james

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset
  2026-02-05  0:08       ` James Smart
@ 2026-02-05  0:59         ` Mohamed Khalfella
  2026-02-09 22:53         ` Mohamed Khalfella
  1 sibling, 0 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-05  0:59 UTC (permalink / raw)
  To: James Smart
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

On Wed 2026-02-04 16:08:12 -0800, James Smart wrote:
> On 2/3/2026 4:11 PM, Mohamed Khalfella wrote:
> > On Tue 2026-02-03 11:19:28 -0800, James Smart wrote:
> >> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> ...
> >>>    
> >>> +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> >>> +					 char *errmsg)
> >>> +{
> >>> +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
> >>> +		return;
> >>   > +> +	dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error
> >> recovery %s\n",
> >>> +		 ctrl->cnum, errmsg);
> >>> +	queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> >>> +}
> >>> +
> >>
> >> Disagree with this.
> >>
> >> The clause in error_recovery around the CONNECTING state is pretty
> >> important to terminate io occurring during connect/reconnect where the
> >> ctrl state should not change. we don't want start_ioerr making it RESETTING.
> >>
> >> This should be reworked.
> > 
> > Like you pointed out this changes the current behavior for CONNECTING
> > state.
> > 
> > Before this change, as you pointed out the controller state stays in
> > CONNECTING while all IOs are aborted. Aborting the IOs causes
> > nvme_fc_create_association() to fail and reconnect might be attempted
> > again.
> > The new behavior switches to RESETTING and queues ctr->ioerr_work.
> > ioerr_work will abort oustanding IOs, swich back to CONNECING and
> > attempt reconnect.
> 
> Well, it won't actually switch to RESETTING, as CONNECTING->RESETTING is 
> not a valid transition.  So things will silently stop in 
> start_ioerr_recovery when the state transition fails (also a reason I 
> dislike silent state transition failures).

You are right. I missed the fact that there is no transition from
CONNECING to RESETTING. Need to go back and revisit this part.

> 
> When I look a little further into patch 13, I see the change to FENCING 
> added. But that state transition will also fail for CONNECTING->FENCING. 
> It will then fall into the resetting state change, which will silently 
> fail, and we're stopped.  It says to me there was no consideration or 
> testing of failures while CONNECTING with this patch set.  Even if 
> RESETTING were allowed, its injecting a new flow into the code paths.

I tested dropping ADMIN commands on the target side to see CONNECTING
failures. I have not seen issues, but I will revisit this part.

> 
> The CONNECTING issue also applies to tcp and rdma transports. I don't 
> know if they call the error_recovery routines in the same way.
> 
> To be honest I'm not sure I remember the original reasons this loop was 
> put in, but I do remember pain I went through when generating it and the 
> number of test cases that were needed to cover testing. It may well be 
> because I couldn't invoke the reset due to the CONNECTING->RESETTING 
> block.  I'm being pedantic as I still feel residual pain for it.
> 
> 
> > 
> > nvme_fc_error_recovery() ->
> >    nvme_stop_keep_alive() /* should not make a difference */
> >    nvme_stop_ctrl()       /* should be okay to run */
> >    nvme_fc_delete_association() ->
> >      __nvme_fc_abort_outstanding_ios(ctrl, false)
> >      nvme_unquiesce_admin_queue()
> >      nvme_unquiesce_io_queues()
> >      nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)
> >      if (port_state == ONLINE)
> >        queue_work(ctrl->connect)
> >      else
> >        nvme_fc_reconnect_or_delete();
> > 
> > Yes, this is a different behavior. IMO it is simpler to follow and
> > closer to what other transports do, keeping in mind async abort nature
> > of fc.
> > 
> > Aside from it is different, what is wrong with it?
> 
> See above.
> 
> ...
> >>>    static int
> >>> @@ -2495,39 +2506,6 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
> >>>    		nvme_unquiesce_admin_queue(&ctrl->ctrl);
> >>>    }
> >>>    
> >>> -static void
> >>> -nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
> >>> -{
> >>> -	enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
> >>> -
> >>> -	/*
> >>> -	 * if an error (io timeout, etc) while (re)connecting, the remote
> >>> -	 * port requested terminating of the association (disconnect_ls)
> >>> -	 * or an error (timeout or abort) occurred on an io while creating
> >>> -	 * the controller.  Abort any ios on the association and let the
> >>> -	 * create_association error path resolve things.
> >>> -	 */
> >>> -	if (state == NVME_CTRL_CONNECTING) {
> >>> -		__nvme_fc_abort_outstanding_ios(ctrl, true);
> >>> -		dev_warn(ctrl->ctrl.device,
> >>> -			"NVME-FC{%d}: transport error during (re)connect\n",
> >>> -			ctrl->cnum);
> >>> -		return;
> >>> -	}
> >>
> >> This logic needs to be preserved. Its no longer part of
> >> nvme_fc_start_ioerr_recovery(). Failures during CONNECTING should not be
> >> "fenced". They should fail immediately.
> > 
> > I think this is similar to the point above.
> 
> Forgetting whether or not the above "works", what I'm pointing out is 
> that when in CONNECTING I don't believe you should be enacting the 
> FENCED state and delaying. For CONNECTING, the cleanup should be 
> immediate with no delay and no CCR attempt.  Only LIVE should transition 
> to FENCED.
> 
> Looking at patch 14, fencing_work calls nvme_fence_ctrl() which 
> unconditionally delays and tries to do CCR. We only want this if LIVE. 
> I'll comment on that patch.
> 
> 
> >> There is a small difference here in that The existing code avoids doing
> >> the ctrl reset if the controller is NEW. start_ioerr will change the
> >> ctrl to RESETTING. I'm not sure how much of an impact that is.
> >>
> > 
> > I think there is little done while controller in NEW state.
> > Let me know if I am missing something.
> 
> No - I had to update my understanding I was really out of date. Used to 
> be NEW is what initial controller create was done under. Everybody does 
> it now under CONNECTING.
> 
> ...
> >>>    static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
> >>>    {
> >>>    	struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(rq);
> >>> @@ -2536,24 +2514,14 @@ static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
> >>>    	struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
> >>>    	struct nvme_command *sqe = &cmdiu->sqe;
> >>>    
> >>> -	/*
> >>> -	 * Attempt to abort the offending command. Command completion
> >>> -	 * will detect the aborted io and will fail the connection.
> >>> -	 */
> >>>    	dev_info(ctrl->ctrl.device,
> >>>    		"NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
> >>>    		"x%08x/x%08x\n",
> >>>    		ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
> >>>    		nvme_fabrics_opcode_str(qnum, sqe),
> >>>    		sqe->common.cdw10, sqe->common.cdw11);
> >>> -	if (__nvme_fc_abort_op(ctrl, op))
> >>> -		nvme_fc_error_recovery(ctrl, "io timeout abort failed");
> >>>    
> >>> -	/*
> >>> -	 * the io abort has been initiated. Have the reset timer
> >>> -	 * restarted and the abort completion will complete the io
> >>> -	 * shortly. Avoids a synchronous wait while the abort finishes.
> >>> -	 */
> >>> +	nvme_fc_start_ioerr_recovery(ctrl, "io timeout");
> >>
> >> Why get rid of the abort logic ?
> >> Note: the error recovery/controller reset is only called when the abort
> >> failed.
> >>
> >> I believe you should continue to abort the op.  The fence logic will
> >> kick in when the op completes later (along with other io completions).
> >> If nothing else, it allows a hw resource to be freed up.
> > 
> > The abort logic from nvme_fc_timeout() is problematic and it does not
> > play well with abort initiatored from ioerr_work or reset_work. The
> > problem is that op aborted from nvme_fc_timeout() is not accounted for
> > when the controller is reset.
> 
> note: I'll wait to be shown otherwise, but if this were true it would be 
> horribly broken for a long time.
> 
> > 
> > Here is an example scenario.
> > 
> > The first time a request times out it gets aborted we see this codepath
> > 
> > nvme_fc_timeout() ->
> >    __nvme_fc_abort_op() ->
> >      atomic_xchg(&op->state, FCPOP_STATE_ABORTED)
> >        ops->abort()
> >          return 0;
> 
> there's more than this in in the code:
> it changes op state to ABORTED, saving the old opstate.
> if the opstate wasn't active - it means something else changed and it 
> restores the old state (e.g. the aborts for the reset may have hit it).
> if it was active (e.g. the aborts the reset haven't hit it yet) it 
> checks the ctlr flag to see if the controller is being reset and 
> tracking io termination (the TERMIO flag) and if so, increments the 
> iocnt. So it is "included" in the reset.
> 
> if old state was active, it then sends the ABTS.
> if old state wasn't active (we've been here before or io terminated by 
> reset) it returns -ECANCELED, which will cause a controller reset to be 
> attempted if there's not already one in process.
> 
> 
> > 
> > nvme_fc_timeout() always return BLK_EH_RESET_TIMER so the same request
> > can timeout again. If the same request hits timeout again then
> > __nvme_fc_abort_op() returns -ECANCELED and nvme_fc_error_recovery()
> > gets called. Assuming the controller is LIVE it will be reset.
> 
> The normal case is timeout generates ABTS. ABTS usually completes 
> quickly with the io completing and the io callback to iodone, which sees 
> abort error status and resets controller. Its very typical for the ABTS 
> to complete long before the 2nd EH timer timing out.
> 
> Abnormal case is ABTS takes longer to complete than the 2nd EH timer 
> timing. Yes, that forces the controller reset.   I am aware that some 
> arrays will delay ABTS ACC while they terminate the back end, but there 
> are also frame drop conditions to consider.
> 
> if the controller is already resetting, all the above is largely n/a.
> 
> I see no reason to avoid the ABTS and wait for a 2nd EH timer to fire.
> 
> > 
> > nvme_fc_reset_ctrl_work() ->
> >    nvme_fc_delete_association() ->
> >      __nvme_fc_abort_outstanding_ios() ->
> >        nvme_fc_terminate_exchange() ->
> >          __nvme_fc_abort_op()
> > 
> > __nvme_fc_abort_op() finds that op already aborted. As a result of that
> > ctrl->iocnt will not be incrmented for this op. This means that
> > nvme_fc_delete_association() will not wait for this op to be aborted.
> 
> see missing code stmt above.
> 
> > 
> > I do not think we wait this behavior.
> > 
> > To continue the scenario above. The controller switches to CONNECTING
> > and the request times out again. This time we hit the deadlock described
> > in [1].
> > 
> > I think the first abort is the cause of the issue here. with this change
> > we should not hit the scenario described above.
> > 
> > 1 - https://lore.kernel.org/all/20250529214928.2112990-1-mkhalfella@purestorage.com/
> 
> Something else happened here. You can't get to CONNECTING state unless 
> all outstanding io was reaped in delete association. What is also harder 
> to understand is how there was an io to timeout if they've all been 
> reaped and queues haven't been restarted.  Timeout on one of the ios to 
> instatiate/init the controller maybe, but it shouldn't have been one of 
> those in the blk layer.

I will revisit this issue and hopefully provide more information.

> 
> > 
> >>
> >>
> >>>    	return BLK_EH_RESET_TIMER;
> >>>    }
> >>>    
> >>> @@ -3352,6 +3320,26 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
> >>>    	}
> >>>    }
> >>>    
> >>> +static void
> >>> +nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
> >>> +{
> >>> +	nvme_stop_keep_alive(&ctrl->ctrl);
> >>
> >> Curious, why did the stop_keep_alive() call get added to this ?
> >> Doesn't hurt.
> >>
> >> I assume it was due to other transports having it as they originally
> >> were calling stop_ctrl, but then moved to stop_keep_alive. Shouldn't
> >> this be followed by flush_work((&ctrl->ctrl.async_event_work) ?
> > 
> > Yes. I added it because it matches what other transports do.
> > 
> > nvme_fc_error_recovery() ->
> >    nvme_fc_delete_association() ->
> >      nvme_fc_abort_aen_ops() ->
> >        nvme_fc_term_aen_ops() ->
> >          cancel_work_sync(&ctrl->ctrl.async_event_work);
> > 
> > The above codepath takes care of async_event_work.
> 
> True, but the flush_works were added for a reason to the other 
> transports so I'm guessing timing matters. So waiting till ther later 
> term_aen call isn't great.  But I also guess, we haven't had an issue 
> prior and since we did take care if it in the aen routines, its likely 
> unneeded now.  Ok to add it but if so, we should keep the flush_work as 
> well. Also good to look same as the other transports.

It does not hard. Maybe I am missing something. I can put it back just
to be safe.

> 
> > 
> >>
> >>> +	nvme_stop_ctrl(&ctrl->ctrl);
> >>> +
> >>> +	/* will block while waiting for io to terminate */
> >>> +	nvme_fc_delete_association(ctrl);
> >>> +
> >>> +	/* Do not reconnect if controller is being deleted */
> >>> +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
> >>> +		return;
> >>> +
> >>> +	if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
> >>> +		queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
> >>> +		return;
> >>> +	}
> >>> +
> >>> +	nvme_fc_reconnect_or_delete(ctrl, -ENOTCONN);
> >>> +}
> >>
> >> This code and that in nvme_fc_reset_ctrl_work() need to be collapsed
> >> into a common helper function invoked by the 2 routines.  Also addresses
> >> the missing flush_delayed work in this routine.
> >>
> > 
> > Agree, nvme_fc_error_recovery() and nvme_fc_reset_ctrl_work() have
> > common code that can be refactored. However, I do not plan to do this
> > part of this change. I will take a look after I get CCR work done.
> 
> Don't put it off. You are adding as much code as the refactoring is. 
> Just make the change.

Okay. I will revist this change in light of CONNECTING issue and see if
I can merge tht two codepaths.

> 
> -- james
> 
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields
  2026-02-04  0:34       ` Hannes Reinecke
@ 2026-02-07 13:41         ` Sagi Grimberg
  2026-02-14  0:42           ` Randy Jennings
  0 siblings, 1 reply; 82+ messages in thread
From: Sagi Grimberg @ 2026-02-07 13:41 UTC (permalink / raw)
  To: Hannes Reinecke, Mohamed Khalfella
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Aaron Dailey,
	Randy Jennings, Dhaval Giani, linux-nvme, linux-kernel


> Precisely my point. If CQT defaults to zero no delay will be inserted,
> but we _still_ have CCR handling. Just proving that both really are
> independent on each other. Hence I would prefer to have two patchsets.

Agreed. I think it would be cleaner to review each separately. the CCR 
can be based
on top of the CQT patchset, for simplicity.



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 02/14] nvmet/debugfs: Add ctrl uniquifier and random values
  2026-01-30 22:34 ` [PATCH v2 02/14] nvmet/debugfs: Add ctrl uniquifier and random values Mohamed Khalfella
  2026-02-03  3:04   ` Hannes Reinecke
@ 2026-02-07 13:47   ` Sagi Grimberg
  2026-02-11  0:50   ` Randy Jennings
  2 siblings, 0 replies; 82+ messages in thread
From: Sagi Grimberg @ 2026-02-07 13:47 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 03/14] nvmet: Implement CCR nvme command
  2026-02-04 17:52             ` Mohamed Khalfella
@ 2026-02-07 13:58               ` Sagi Grimberg
  2026-02-08 23:10                 ` Mohamed Khalfella
  0 siblings, 1 reply; 82+ messages in thread
From: Sagi Grimberg @ 2026-02-07 13:58 UTC (permalink / raw)
  To: Mohamed Khalfella, Hannes Reinecke
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Aaron Dailey,
	Randy Jennings, Dhaval Giani, linux-nvme, linux-kernel



On 04/02/2026 19:52, Mohamed Khalfella wrote:
> On Wed 2026-02-04 01:55:18 +0100, Hannes Reinecke wrote:
>> On 2/4/26 01:44, Mohamed Khalfella wrote:
>>> On Wed 2026-02-04 01:38:44 +0100, Hannes Reinecke wrote:
>>>> On 2/3/26 19:40, Mohamed Khalfella wrote:
>>>>> On Tue 2026-02-03 04:19:50 +0100, Hannes Reinecke wrote:
>>>>>> On 1/30/26 23:34, Mohamed Khalfella wrote:
>>>>>>> @@ -1501,6 +1516,38 @@ struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
>>>>>>>      	return ctrl;
>>>>>>>      }
>>>>>>>      
>>>>>>> +struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
>>>>>>> +					   const char *hostnqn, u8 ciu,
>>>>>>> +					   u16 cntlid, u64 cirn)
>>>>>>> +{
>>>>>>> +	struct nvmet_ctrl *ctrl;
>>>>>>> +	bool found = false;
>>>>>>> +
>>>>>>> +	mutex_lock(&subsys->lock);
>>>>>>> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
>>>>>>> +		if (ctrl->cntlid != cntlid)
>>>>>>> +			continue;
>>>>>>> +		if (strncmp(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE))
>>>>>>> +			continue;
>>>>>>> +
>>>>>> Why do we compare the hostnqn here, too? To my understanding the host
>>>>>> NQN is tied to the controller, so the controller ID should be sufficient
>>>>>> here.
>>>>> We got cntlid from CCR nvme command and we do not trust the value sent by
>>>>> the host. We check hostnqn to confirm that host is actually connected to
>>>>> the impacted controller. A host should not be allowed to reset a
>>>>> controller connected to another host.
>>>>>
>>>> Errm. So we're starting to not trust values in NVMe commands?
>>>> That is a very slippery road.
>>>> Ultimately it would require us to validate the cntlid on each
>>>> admin command. Which we don't.
>>>> And really there is no difference between CCR and any other
>>>> admin command; you get even worse effects if you would assume
>>>> a misdirected 'FORMAT' command.
>>>>
>>>> Please don't. Security is _not_ a concern here.
>>> I do not think the check hurts. If you say it is wrong I will delete it.
>>>
>> It's not 'wrong', It's inconsistent. The argument that the contents of
>> an admin command may be wrong applies to _every_ admin command.
>> Yet we never check on any of those commands.
>> So I fail to see why this command requires special treatment.
> Okay, I will delete this check.

It is a very different command than other commands that nvmet serves. Format
is different because it is an attached namespace, hence the host should 
be able
to format it. If it would have been possible to access a namespace that 
is not mapped
to a controller, then this check would have been warranted I think.

There have been some issues lately opened on nvme-tcp that expose 
attacks that can
crash the kernel with some hand-crafted commands, I'd say that this is a 
potential attack vector.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 03/14] nvmet: Implement CCR nvme command
  2026-01-30 22:34 ` [PATCH v2 03/14] nvmet: Implement CCR nvme command Mohamed Khalfella
  2026-02-03  3:19   ` Hannes Reinecke
@ 2026-02-07 14:11   ` Sagi Grimberg
  1 sibling, 0 replies; 82+ messages in thread
From: Sagi Grimberg @ 2026-02-07 14:11 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

Other than Hannes nit comments, this patch looks good.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 04/14] nvmet: Implement CCR logpage
  2026-01-30 22:34 ` [PATCH v2 04/14] nvmet: Implement CCR logpage Mohamed Khalfella
  2026-02-03  3:21   ` Hannes Reinecke
@ 2026-02-07 14:11   ` Sagi Grimberg
  2026-02-11  1:49   ` Randy Jennings
  2 siblings, 0 replies; 82+ messages in thread
From: Sagi Grimberg @ 2026-02-07 14:11 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 05/14] nvmet: Send an AEN on CCR completion
  2026-01-30 22:34 ` [PATCH v2 05/14] nvmet: Send an AEN on CCR completion Mohamed Khalfella
  2026-02-03  3:27   ` Hannes Reinecke
@ 2026-02-07 14:12   ` Sagi Grimberg
  2026-02-11  1:52   ` Randy Jennings
  2 siblings, 0 replies; 82+ messages in thread
From: Sagi Grimberg @ 2026-02-07 14:12 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 06/14] nvme: Rapid Path Failure Recovery read controller identify fields
  2026-01-30 22:34 ` [PATCH v2 06/14] nvme: Rapid Path Failure Recovery read controller identify fields Mohamed Khalfella
  2026-02-03  3:28   ` Hannes Reinecke
@ 2026-02-07 14:13   ` Sagi Grimberg
  2026-02-11  1:56   ` Randy Jennings
  2 siblings, 0 replies; 82+ messages in thread
From: Sagi Grimberg @ 2026-02-07 14:13 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

Aside from splitting CQT to a separate patch set:
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 03/14] nvmet: Implement CCR nvme command
  2026-02-07 13:58               ` Sagi Grimberg
@ 2026-02-08 23:10                 ` Mohamed Khalfella
  2026-02-09 19:27                   ` Mohamed Khalfella
  0 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-08 23:10 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On Sat 2026-02-07 15:58:49 +0200, Sagi Grimberg wrote:
> 
> 
> On 04/02/2026 19:52, Mohamed Khalfella wrote:
> > On Wed 2026-02-04 01:55:18 +0100, Hannes Reinecke wrote:
> >> On 2/4/26 01:44, Mohamed Khalfella wrote:
> >>> On Wed 2026-02-04 01:38:44 +0100, Hannes Reinecke wrote:
> >>>> On 2/3/26 19:40, Mohamed Khalfella wrote:
> >>>>> On Tue 2026-02-03 04:19:50 +0100, Hannes Reinecke wrote:
> >>>>>> On 1/30/26 23:34, Mohamed Khalfella wrote:
> >>>>>>> @@ -1501,6 +1516,38 @@ struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
> >>>>>>>      	return ctrl;
> >>>>>>>      }
> >>>>>>>      
> >>>>>>> +struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
> >>>>>>> +					   const char *hostnqn, u8 ciu,
> >>>>>>> +					   u16 cntlid, u64 cirn)
> >>>>>>> +{
> >>>>>>> +	struct nvmet_ctrl *ctrl;
> >>>>>>> +	bool found = false;
> >>>>>>> +
> >>>>>>> +	mutex_lock(&subsys->lock);
> >>>>>>> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> >>>>>>> +		if (ctrl->cntlid != cntlid)
> >>>>>>> +			continue;
> >>>>>>> +		if (strncmp(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE))
> >>>>>>> +			continue;
> >>>>>>> +
> >>>>>> Why do we compare the hostnqn here, too? To my understanding the host
> >>>>>> NQN is tied to the controller, so the controller ID should be sufficient
> >>>>>> here.
> >>>>> We got cntlid from CCR nvme command and we do not trust the value sent by
> >>>>> the host. We check hostnqn to confirm that host is actually connected to
> >>>>> the impacted controller. A host should not be allowed to reset a
> >>>>> controller connected to another host.
> >>>>>
> >>>> Errm. So we're starting to not trust values in NVMe commands?
> >>>> That is a very slippery road.
> >>>> Ultimately it would require us to validate the cntlid on each
> >>>> admin command. Which we don't.
> >>>> And really there is no difference between CCR and any other
> >>>> admin command; you get even worse effects if you would assume
> >>>> a misdirected 'FORMAT' command.
> >>>>
> >>>> Please don't. Security is _not_ a concern here.
> >>> I do not think the check hurts. If you say it is wrong I will delete it.
> >>>
> >> It's not 'wrong', It's inconsistent. The argument that the contents of
> >> an admin command may be wrong applies to _every_ admin command.
> >> Yet we never check on any of those commands.
> >> So I fail to see why this command requires special treatment.
> > Okay, I will delete this check.
> 
> It is a very different command than other commands that nvmet serves. Format
> is different because it is an attached namespace, hence the host should 
> be able
> to format it. If it would have been possible to access a namespace that 
> is not mapped
> to a controller, then this check would have been warranted I think.
> 
> There have been some issues lately opened on nvme-tcp that expose 
> attacks that can
> crash the kernel with some hand-crafted commands, I'd say that this is a 
> potential attack vector.

For an attacker to exploit CCR command they will have to guess both CUI
(8bit) and CIRN(64bit) random values correctly. I do not see how an
attacker can find these values without being connected to the impacted
controller.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 03/14] nvmet: Implement CCR nvme command
  2026-02-08 23:10                 ` Mohamed Khalfella
@ 2026-02-09 19:27                   ` Mohamed Khalfella
  2026-02-11  1:34                     ` Randy Jennings
  0 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-09 19:27 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On Sun 2026-02-08 15:10:36 -0800, Mohamed Khalfella wrote:
> On Sat 2026-02-07 15:58:49 +0200, Sagi Grimberg wrote:
> > 
> > 
> > On 04/02/2026 19:52, Mohamed Khalfella wrote:
> > > On Wed 2026-02-04 01:55:18 +0100, Hannes Reinecke wrote:
> > >> On 2/4/26 01:44, Mohamed Khalfella wrote:
> > >>> On Wed 2026-02-04 01:38:44 +0100, Hannes Reinecke wrote:
> > >>>> On 2/3/26 19:40, Mohamed Khalfella wrote:
> > >>>>> On Tue 2026-02-03 04:19:50 +0100, Hannes Reinecke wrote:
> > >>>>>> On 1/30/26 23:34, Mohamed Khalfella wrote:
> > >>>>>>> @@ -1501,6 +1516,38 @@ struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
> > >>>>>>>      	return ctrl;
> > >>>>>>>      }
> > >>>>>>>      
> > >>>>>>> +struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
> > >>>>>>> +					   const char *hostnqn, u8 ciu,
> > >>>>>>> +					   u16 cntlid, u64 cirn)
> > >>>>>>> +{
> > >>>>>>> +	struct nvmet_ctrl *ctrl;
> > >>>>>>> +	bool found = false;
> > >>>>>>> +
> > >>>>>>> +	mutex_lock(&subsys->lock);
> > >>>>>>> +	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> > >>>>>>> +		if (ctrl->cntlid != cntlid)
> > >>>>>>> +			continue;
> > >>>>>>> +		if (strncmp(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE))
> > >>>>>>> +			continue;
> > >>>>>>> +
> > >>>>>> Why do we compare the hostnqn here, too? To my understanding the host
> > >>>>>> NQN is tied to the controller, so the controller ID should be sufficient
> > >>>>>> here.
> > >>>>> We got cntlid from CCR nvme command and we do not trust the value sent by
> > >>>>> the host. We check hostnqn to confirm that host is actually connected to
> > >>>>> the impacted controller. A host should not be allowed to reset a
> > >>>>> controller connected to another host.
> > >>>>>
> > >>>> Errm. So we're starting to not trust values in NVMe commands?
> > >>>> That is a very slippery road.
> > >>>> Ultimately it would require us to validate the cntlid on each
> > >>>> admin command. Which we don't.
> > >>>> And really there is no difference between CCR and any other
> > >>>> admin command; you get even worse effects if you would assume
> > >>>> a misdirected 'FORMAT' command.
> > >>>>
> > >>>> Please don't. Security is _not_ a concern here.
> > >>> I do not think the check hurts. If you say it is wrong I will delete it.
> > >>>
> > >> It's not 'wrong', It's inconsistent. The argument that the contents of
> > >> an admin command may be wrong applies to _every_ admin command.
> > >> Yet we never check on any of those commands.
> > >> So I fail to see why this command requires special treatment.
> > > Okay, I will delete this check.
> > 
> > It is a very different command than other commands that nvmet serves. Format
> > is different because it is an attached namespace, hence the host should 
> > be able
> > to format it. If it would have been possible to access a namespace that 
> > is not mapped
> > to a controller, then this check would have been warranted I think.
> > 
> > There have been some issues lately opened on nvme-tcp that expose 
> > attacks that can
> > crash the kernel with some hand-crafted commands, I'd say that this is a 
> > potential attack vector.
> 
> For an attacker to exploit CCR command they will have to guess both CUI
> (8bit) and CIRN(64bit) random values correctly. I do not see how an
> attacker can find these values without being connected to the impacted
> controller.

I changed ciu and cirn sysfs controller attributes on the host from S_IRUGO
to S_IRUSR. This should mitigate an attack by unprivileged process
running on host. On the target side debugfs files already have S_IRUSR
so no change is needed.



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset
  2026-02-05  0:08       ` James Smart
  2026-02-05  0:59         ` Mohamed Khalfella
@ 2026-02-09 22:53         ` Mohamed Khalfella
  1 sibling, 0 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-09 22:53 UTC (permalink / raw)
  To: James Smart
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

On Wed 2026-02-04 16:08:12 -0800, James Smart wrote:
> On 2/3/2026 4:11 PM, Mohamed Khalfella wrote:
> > On Tue 2026-02-03 11:19:28 -0800, James Smart wrote:
> >> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> ...
> >>>    
> >>> +static void nvme_fc_start_ioerr_recovery(struct nvme_fc_ctrl *ctrl,
> >>> +					 char *errmsg)
> >>> +{
> >>> +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
> >>> +		return;
> >>   > +> +	dev_warn(ctrl->ctrl.device, "NVME-FC{%d}: starting error
> >> recovery %s\n",
> >>> +		 ctrl->cnum, errmsg);
> >>> +	queue_work(nvme_reset_wq, &ctrl->ioerr_work);
> >>> +}
> >>> +
> >>
> >> Disagree with this.
> >>
> >> The clause in error_recovery around the CONNECTING state is pretty
> >> important to terminate io occurring during connect/reconnect where the
> >> ctrl state should not change. we don't want start_ioerr making it RESETTING.
> >>
> >> This should be reworked.
> > 
> > Like you pointed out this changes the current behavior for CONNECTING
> > state.
> > 
> > Before this change, as you pointed out the controller state stays in
> > CONNECTING while all IOs are aborted. Aborting the IOs causes
> > nvme_fc_create_association() to fail and reconnect might be attempted
> > again.
> > The new behavior switches to RESETTING and queues ctr->ioerr_work.
> > ioerr_work will abort oustanding IOs, swich back to CONNECING and
> > attempt reconnect.
> 
> Well, it won't actually switch to RESETTING, as CONNECTING->RESETTING is 
> not a valid transition.  So things will silently stop in 
> start_ioerr_recovery when the state transition fails (also a reason I 
> dislike silent state transition failures).

I changed nvme_fc_start_ioerr_recovery() such that if the controller is
in CONNECTING state it will queue ctrl->ioerr_work  without changing the
controller state. This should address this issue. Will do more testing
before putting this change in the next version of the patchset.

> 
> When I look a little further into patch 13, I see the change to FENCING 
> added. But that state transition will also fail for CONNECTING->FENCING. 
> It will then fall into the resetting state change, which will silently 
> fail, and we're stopped.  It says to me there was no consideration or 
> testing of failures while CONNECTING with this patch set.  Even if 
> RESETTING were allowed, its injecting a new flow into the code paths.

This should be addressed by the change above.

> 
> The CONNECTING issue also applies to tcp and rdma transports. I don't 
> know if they call the error_recovery routines in the same way.

Other fabric transports tcp and rdma rely on timeout callback to
complete the timedout request with an error, without changing controller
state. This what tcp does for example.

nvme_tcp_timeout() ->
  nvme_tcp_complete_timed_out() ->
    blk_mq_complete_request()

This causes connect work to notice the error and fail. Then it should
decide wether to try connect again or delete the controller.

> 
> To be honest I'm not sure I remember the original reasons this loop was 
> put in, but I do remember pain I went through when generating it and the 
> number of test cases that were needed to cover testing. It may well be 
> because I couldn't invoke the reset due to the CONNECTING->RESETTING 
> block.  I'm being pedantic as I still feel residual pain for it.
> 
> 
> > 
> > nvme_fc_error_recovery() ->
> >    nvme_stop_keep_alive() /* should not make a difference */
> >    nvme_stop_ctrl()       /* should be okay to run */
> >    nvme_fc_delete_association() ->
> >      __nvme_fc_abort_outstanding_ios(ctrl, false)
> >      nvme_unquiesce_admin_queue()
> >      nvme_unquiesce_io_queues()
> >      nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)
> >      if (port_state == ONLINE)
> >        queue_work(ctrl->connect)
> >      else
> >        nvme_fc_reconnect_or_delete();
> > 
> > Yes, this is a different behavior. IMO it is simpler to follow and
> > closer to what other transports do, keeping in mind async abort nature
> > of fc.
> > 
> > Aside from it is different, what is wrong with it?
> 
> See above.
> 
> ...
> >>>    static int
> >>> @@ -2495,39 +2506,6 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
> >>>    		nvme_unquiesce_admin_queue(&ctrl->ctrl);
> >>>    }
> >>>    
> >>> -static void
> >>> -nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
> >>> -{
> >>> -	enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
> >>> -
> >>> -	/*
> >>> -	 * if an error (io timeout, etc) while (re)connecting, the remote
> >>> -	 * port requested terminating of the association (disconnect_ls)
> >>> -	 * or an error (timeout or abort) occurred on an io while creating
> >>> -	 * the controller.  Abort any ios on the association and let the
> >>> -	 * create_association error path resolve things.
> >>> -	 */
> >>> -	if (state == NVME_CTRL_CONNECTING) {
> >>> -		__nvme_fc_abort_outstanding_ios(ctrl, true);
> >>> -		dev_warn(ctrl->ctrl.device,
> >>> -			"NVME-FC{%d}: transport error during (re)connect\n",
> >>> -			ctrl->cnum);
> >>> -		return;
> >>> -	}
> >>
> >> This logic needs to be preserved. Its no longer part of
> >> nvme_fc_start_ioerr_recovery(). Failures during CONNECTING should not be
> >> "fenced". They should fail immediately.
> > 
> > I think this is similar to the point above.
> 
> Forgetting whether or not the above "works", what I'm pointing out is 
> that when in CONNECTING I don't believe you should be enacting the 
> FENCED state and delaying. For CONNECTING, the cleanup should be 
> immediate with no delay and no CCR attempt.  Only LIVE should transition 
> to FENCED.

Agreed. FENCING state is entered only from LIVE state.

> 
> Looking at patch 14, fencing_work calls nvme_fence_ctrl() which 
> unconditionally delays and tries to do CCR. We only want this if LIVE. 
> I'll comment on that patch.

Yes. Fencing only applies for live controllers where we expect to have
inflight IO requests. Failure of controller in other states should not
result in requests delay.

> 
> 
> >> There is a small difference here in that The existing code avoids doing
> >> the ctrl reset if the controller is NEW. start_ioerr will change the
> >> ctrl to RESETTING. I'm not sure how much of an impact that is.
> >>
> > 
> > I think there is little done while controller in NEW state.
> > Let me know if I am missing something.
> 
> No - I had to update my understanding I was really out of date. Used to 
> be NEW is what initial controller create was done under. Everybody does 
> it now under CONNECTING.
> 
> ...
> >>>    static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
> >>>    {
> >>>    	struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(rq);
> >>> @@ -2536,24 +2514,14 @@ static enum blk_eh_timer_return nvme_fc_timeout(struct request *rq)
> >>>    	struct nvme_fc_cmd_iu *cmdiu = &op->cmd_iu;
> >>>    	struct nvme_command *sqe = &cmdiu->sqe;
> >>>    
> >>> -	/*
> >>> -	 * Attempt to abort the offending command. Command completion
> >>> -	 * will detect the aborted io and will fail the connection.
> >>> -	 */
> >>>    	dev_info(ctrl->ctrl.device,
> >>>    		"NVME-FC{%d.%d}: io timeout: opcode %d fctype %d (%s) w10/11: "
> >>>    		"x%08x/x%08x\n",
> >>>    		ctrl->cnum, qnum, sqe->common.opcode, sqe->fabrics.fctype,
> >>>    		nvme_fabrics_opcode_str(qnum, sqe),
> >>>    		sqe->common.cdw10, sqe->common.cdw11);
> >>> -	if (__nvme_fc_abort_op(ctrl, op))
> >>> -		nvme_fc_error_recovery(ctrl, "io timeout abort failed");
> >>>    
> >>> -	/*
> >>> -	 * the io abort has been initiated. Have the reset timer
> >>> -	 * restarted and the abort completion will complete the io
> >>> -	 * shortly. Avoids a synchronous wait while the abort finishes.
> >>> -	 */
> >>> +	nvme_fc_start_ioerr_recovery(ctrl, "io timeout");
> >>
> >> Why get rid of the abort logic ?
> >> Note: the error recovery/controller reset is only called when the abort
> >> failed.
> >>
> >> I believe you should continue to abort the op.  The fence logic will
> >> kick in when the op completes later (along with other io completions).
> >> If nothing else, it allows a hw resource to be freed up.
> > 
> > The abort logic from nvme_fc_timeout() is problematic and it does not
> > play well with abort initiatored from ioerr_work or reset_work. The
> > problem is that op aborted from nvme_fc_timeout() is not accounted for
> > when the controller is reset.
> 
> note: I'll wait to be shown otherwise, but if this were true it would be 
> horribly broken for a long time.

I think calling __nvme_fc_abort_op() from nvme_fc_timeout() causes reset
to not account for the aborted requests while waiting for IO to complete.
This results in controller transitions RESETTING->CONNECTING->LIVE while
some requests are inflight.

I tested kernel version 8dfce8991b95d8625d0a1d2896e42f93b9d7f68d

[  839.513016] nvme nvme0: NVME-FC{0}: create association : host wwpn 0x100070b7e41ed630  rport wwpn 0x524a937d171c5e08: NQN "nqn.2010-06.com.purestorage:flasharray.5452b4ac21023d99"
[  839.537061] nvme nvme0: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
[  839.584019] nvme nvme0: NVME-FC{0}: controller connect complete
[  839.592280] nvme nvme0: NVME-FC{0}: new ctrl: NQN "nqn.2010-06.com.purestorage:flasharray.5452b4ac21023d99", hostnqn: nqn.2014-08.org.nvmexpress:uuid:e622dc39-ebed-4641-9d5c-6936ac234566
[  891.507377] nvme nvme0: NVME-FC{0.4}: io timeout: opcode 2 fctype 10 (Read) w10/11: x00000000/x00000000, rq = ff110020edc3aa80, op = ff110020edc3ab80, cmdid = 17
[  891.524718] nvme nvme0: __nvme_fc_abort_op: op = ff110020edc3ab80
[  891.533355] nvme nvme0: NVME-FC{0}: transport association event: transport detected io error
[  891.543575] nvme nvme0: NVME-FC{0}: resetting controller
[  891.560361] nvme nvme0: __nvme_fc_abort_op: op = ff110021c3c04980
[  891.572329] nvme_ns_head_submit_bio: 10 callbacks suppressed
[  891.572335] block nvme0n1: no usable path - requeuing I/O
[  891.587334] nvme nvme0: __nvme_fc_abort_op: op = ff1100217375c5d0
[  891.603886] nvme nvme0: NVME-FC{0}: create association : host wwpn 0x100070b7e41ed630  rport wwpn 0x524a937d171c5e08: NQN "nqn.2010-06.com.purestorage:flasharray.5452b4ac21023d99"
[  891.652728] nvme nvme0: NVME-FC{0}: controller connect complete


              dd-3571    [075] .....   862.875279: nvme_setup_cmd: nvme0: disk=nvme0c0n1, qid=4, cmdid=17, nsid=10, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=0, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
  kworker/u385:6-611     [009] .....   871.458388: nvme_setup_cmd: nvme0: qid=0, cmdid=12289, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_keep_alive cdw10=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
    ksoftirqd/10-79      [010] ..s..   871.458587: nvme_complete_rq: nvme0: qid=0, cmdid=12289, res=0x0, retries=0, flags=0x0, status=0x0
  kworker/u386:0-3234    [027] .....   886.818707: nvme_setup_cmd: nvme0: qid=0, cmdid=16384, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_keep_alive cdw10=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
    ksoftirqd/28-188     [028] ..s..   886.818907: nvme_complete_rq: nvme0: qid=0, cmdid=16384, res=0x0, retries=0, flags=0x0, status=0x0
 irq/734-lpfc:27-1475    [028] .....   892.998612: nvme_complete_rq: nvme0: disk=nvme0c0n1, qid=4, cmdid=17, res=0x46, retries=0, flags=0x4, status=0x371

The target has bee configured to drop IO requests while allowing for
admin requests to complete. io_timeout is 30s and keepalive is 5s.

op ff110020edc3ab80 cmdid = 17 was issued at 862.875279. The request
timedout at 891.507377. Another transport error was detected at
891.533355 causing controller reset for this live controller. Couple
of ops were aborted because of the controller reset and controller
connected again. Only at 892.998612 we get cmdid17 completed.

Removing __nvme_fc_abort_op() in nvme_fc_timeout() should avoid this
problem. When hit an error jump straight to error recovery. This way we
account for all aborted requests and wait for them to complete.

> 
> > 
> > Here is an example scenario.
> > 
> > The first time a request times out it gets aborted we see this codepath
> > 
> > nvme_fc_timeout() ->
> >    __nvme_fc_abort_op() ->
> >      atomic_xchg(&op->state, FCPOP_STATE_ABORTED)
> >        ops->abort()
> >          return 0;
> 
> there's more than this in in the code:
> it changes op state to ABORTED, saving the old opstate.
> if the opstate wasn't active - it means something else changed and it 
> restores the old state (e.g. the aborts for the reset may have hit it).
> if it was active (e.g. the aborts the reset haven't hit it yet) it 
> checks the ctlr flag to see if the controller is being reset and 
> tracking io termination (the TERMIO flag) and if so, increments the 
> iocnt. So it is "included" in the reset.
> 
> if old state was active, it then sends the ABTS.
> if old state wasn't active (we've been here before or io terminated by 
> reset) it returns -ECANCELED, which will cause a controller reset to be 
> attempted if there's not already one in process.
> 
> 
> > 
> > nvme_fc_timeout() always return BLK_EH_RESET_TIMER so the same request
> > can timeout again. If the same request hits timeout again then
> > __nvme_fc_abort_op() returns -ECANCELED and nvme_fc_error_recovery()
> > gets called. Assuming the controller is LIVE it will be reset.
> 
> The normal case is timeout generates ABTS. ABTS usually completes 
> quickly with the io completing and the io callback to iodone, which sees 
> abort error status and resets controller. Its very typical for the ABTS 
> to complete long before the 2nd EH timer timing out.
> 
> Abnormal case is ABTS takes longer to complete than the 2nd EH timer 
> timing. Yes, that forces the controller reset.   I am aware that some 
> arrays will delay ABTS ACC while they terminate the back end, but there 
> are also frame drop conditions to consider.
> 
> if the controller is already resetting, all the above is largely n/a.
> 
> I see no reason to avoid the ABTS and wait for a 2nd EH timer to fire.
> 
> > 
> > nvme_fc_reset_ctrl_work() ->
> >    nvme_fc_delete_association() ->
> >      __nvme_fc_abort_outstanding_ios() ->
> >        nvme_fc_terminate_exchange() ->
> >          __nvme_fc_abort_op()
> > 
> > __nvme_fc_abort_op() finds that op already aborted. As a result of that
> > ctrl->iocnt will not be incrmented for this op. This means that
> > nvme_fc_delete_association() will not wait for this op to be aborted.
> 
> see missing code stmt above.
> 
> > 
> > I do not think we wait this behavior.
> > 
> > To continue the scenario above. The controller switches to CONNECTING
> > and the request times out again. This time we hit the deadlock described
> > in [1].
> > 
> > I think the first abort is the cause of the issue here. with this change
> > we should not hit the scenario described above.
> > 
> > 1 - https://lore.kernel.org/all/20250529214928.2112990-1-mkhalfella@purestorage.com/
> 
> Something else happened here. You can't get to CONNECTING state unless 
> all outstanding io was reaped in delete association. What is also harder 
> to understand is how there was an io to timeout if they've all been 
> reaped and queues haven't been restarted.  Timeout on one of the ios to 
> instatiate/init the controller maybe, but it shouldn't have been one of 
> those in the blk layer.

I think it is possible to get to CONNECTING state while IOs are inflight
because aborted requests are not counted as described above.

[ 3166.239944] nvme nvme0: NVME-FC{0}: create association : host wwpn 0x100070b7e41ed630  rport wwpn 0x524a937d171c5e08: NQN "nqn.2010-06.com.purestorage:flasharray.5452b4ac21023d99"
[ 3176.327597] nvme nvme0: NVME-FC{0.0}: io timeout: opcode 6 fctype 0 (Identify) w10/11: x00000001/x00000000, rq = ff110001f4292300, op = ff110001f4292400, cmdid = 14
[ 3176.345180] nvme nvme0: __nvme_fc_abort_op: op = ff110001f4292400
[ 3186.373826] nvme nvme0: NVME-FC{0.0}: io timeout: opcode 6 fctype 0 (Identify) w10/11: x00000001/x00000000, rq = ff110001f4292300, op = ff110001f4292400, cmdid = 14

  kworker/u386:9-4677    [081] .....  3167.681805: nvme_setup_cmd: nvme0: qid=0, cmdid=14, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_identify cns=1, ctrlid=0)
    ksoftirqd/34-225     [034] ..s..  3219.097132: nvme_complete_rq: nvme0: qid=0, cmdid=14, res=0x22, retries=0, flags=0x0, status=0x371

This is another way to repro the deadlock above. admin_timeout is set to
10s. Identify request was issued at 3167.681805 and timedout twice. The
second time it was not aborted and we head straight to the deadlock.
Eventually the HBA did abort the request, but it was too late by then.

[ 3325.117253] INFO: task kworker/3:4H:3337 blocked for more than 122 seconds.
[ 3325.125600]       Tainted: G S          E       6.19.0-rc7-vanilla+ #1
[ 3325.133408] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3325.142643] task:kworker/3:4H    state:D stack:0     pid:3337  tgid:3337  ppid:2      task_flags:0x4208060 flags:0x00080000
[ 3325.155587] Workqueue: kblockd blk_mq_timeout_work
[ 3325.161412] Call Trace:
[ 3325.164589]  <TASK>
[ 3325.167389]  __schedule+0x81f/0x1230
[ 3325.195863]  schedule+0xcd/0x250
[ 3325.199864]  schedule_timeout+0x163/0x240
[ 3325.242655]  wait_for_completion+0x16f/0x3c0
[ 3325.263211]  __flush_work+0xfe/0x190
[ 3325.287450]  cancel_work_sync+0x82/0xb0
[ 3325.292094]  __nvme_fc_abort_outstanding_ios+0x1da/0x320 [nvme_fc]
[ 3325.299326]  nvme_fc_timeout+0x482/0x4f0 [nvme_fc]
[ 3325.317080]  blk_mq_handle_expired+0x194/0x2d0
[ 3325.328586]  bt_iter+0x24e/0x380
[ 3325.336998]  blk_mq_queue_tag_busy_iter+0x6ec/0x1660
[ 3325.360841]  blk_mq_timeout_work+0x3f0/0x5c0
[ 3325.385455]  process_one_work+0x82c/0x1450
[ 3325.404744]  worker_thread+0x5ee/0xfd0
[ 3325.414173]  kthread+0x3a0/0x750
[ 3325.441260]  ret_from_fork+0x439/0x670
[ 3325.459645]  ret_from_fork_asm+0x1a/0x30
[ 3325.464267]  </TASK>

> 
> > 
> >>
> >>
> >>>    	return BLK_EH_RESET_TIMER;
> >>>    }
> >>>    
> >>> @@ -3352,6 +3320,26 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
> >>>    	}
> >>>    }
> >>>    
> >>> +static void
> >>> +nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl)
> >>> +{
> >>> +	nvme_stop_keep_alive(&ctrl->ctrl);
> >>
> >> Curious, why did the stop_keep_alive() call get added to this ?
> >> Doesn't hurt.
> >>
> >> I assume it was due to other transports having it as they originally
> >> were calling stop_ctrl, but then moved to stop_keep_alive. Shouldn't
> >> this be followed by flush_work((&ctrl->ctrl.async_event_work) ?
> > 
> > Yes. I added it because it matches what other transports do.
> > 
> > nvme_fc_error_recovery() ->
> >    nvme_fc_delete_association() ->
> >      nvme_fc_abort_aen_ops() ->
> >        nvme_fc_term_aen_ops() ->
> >          cancel_work_sync(&ctrl->ctrl.async_event_work);
> > 
> > The above codepath takes care of async_event_work.
> 
> True, but the flush_works were added for a reason to the other 
> transports so I'm guessing timing matters. So waiting till ther later 
> term_aen call isn't great.  But I also guess, we haven't had an issue 
> prior and since we did take care if it in the aen routines, its likely 
> unneeded now.  Ok to add it but if so, we should keep the flush_work as 
> well. Also good to look same as the other transports.

I added flush_work(&ctrl->ctrl.async_event_work) to nvme_fc_error_recovery()
just after nvme_stop_ctrl(&ctrl->ctrl).

> 
> > 
> >>
> >>> +	nvme_stop_ctrl(&ctrl->ctrl);
> >>> +
> >>> +	/* will block while waiting for io to terminate */
> >>> +	nvme_fc_delete_association(ctrl);
> >>> +
> >>> +	/* Do not reconnect if controller is being deleted */
> >>> +	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING))
> >>> +		return;
> >>> +
> >>> +	if (ctrl->rport->remoteport.port_state == FC_OBJSTATE_ONLINE) {
> >>> +		queue_delayed_work(nvme_wq, &ctrl->connect_work, 0);
> >>> +		return;
> >>> +	}
> >>> +
> >>> +	nvme_fc_reconnect_or_delete(ctrl, -ENOTCONN);
> >>> +}
> >>
> >> This code and that in nvme_fc_reset_ctrl_work() need to be collapsed
> >> into a common helper function invoked by the 2 routines.  Also addresses
> >> the missing flush_delayed work in this routine.
> >>

flush_delayed_work(&ctrl->connect_work) has been added to
nvme_fc_reset_ctrl_work() in commit ac9b820e713b ("nvme-fc: remove
nvme_fc_terminate_io()"). I understood the idea is to make reset to wait
for at least the first connect attempt.

This should not be applicable to error recovery, right?

> > 
> > Agree, nvme_fc_error_recovery() and nvme_fc_reset_ctrl_work() have
> > common code that can be refactored. However, I do not plan to do this
> > part of this change. I will take a look after I get CCR work done.
> 
> Don't put it off. You are adding as much code as the refactoring is. 
> Just make the change.
> 
> -- james
> 
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 10/14] nvme-tcp: Use CCR to recover controller that hits an error
  2026-02-04  2:57       ` Hannes Reinecke
@ 2026-02-10  1:39         ` Mohamed Khalfella
  0 siblings, 0 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-10  1:39 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, linux-nvme,
	linux-kernel

On Wed 2026-02-04 03:57:20 +0100, Hannes Reinecke wrote:
> On 2/3/26 22:24, Mohamed Khalfella wrote:
> > On Tue 2026-02-03 06:34:51 +0100, Hannes Reinecke wrote:
> >> On 1/30/26 23:34, Mohamed Khalfella wrote:
> >>> An alive nvme controller that hits an error now will move to FENCING
> >>> state instead of RESETTING state. ctrl->fencing_work attempts CCR to
> >>> terminate inflight IOs. If CCR succeeds, switch to FENCED -> RESETTING
> >>> and continue error recovery as usual. If CCR fails, the behavior depends
> >>> on whether the subsystem supports CQT or not. If CQT is not supported
> >>> then reset the controller immediately as if CCR succeeded in order to
> >>> maintain the current behavior. If CQT is supported switch to time-based
> >>> recovery. Schedule ctrl->fenced_work resets the controller when time
> >>> based recovery finishes.
> >>>
> >>> Either ctrl->err_work or ctrl->reset_work can run after a controller is
> >>> fenced. Flush fencing work when either work run.
> >>>
> >>> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> >>> ---
> >>>    drivers/nvme/host/tcp.c | 62 ++++++++++++++++++++++++++++++++++++++++-
> >>>    1 file changed, 61 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> >>> index 69cb04406b47..af8d3b36a4bb 100644
> >>> --- a/drivers/nvme/host/tcp.c
> >>> +++ b/drivers/nvme/host/tcp.c
> >>> @@ -193,6 +193,8 @@ struct nvme_tcp_ctrl {
> >>>    	struct sockaddr_storage src_addr;
> >>>    	struct nvme_ctrl	ctrl;
> >>>    
> >>> +	struct work_struct	fencing_work;
> >>> +	struct delayed_work	fenced_work;
> >>>    	struct work_struct	err_work;
> >>>    	struct delayed_work	connect_work;
> >>>    	struct nvme_tcp_request async_req;
> >>> @@ -611,6 +613,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
> >>>    
> >>>    static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
> >>>    {
> >>> +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCING)) {
> >>> +		dev_warn(ctrl->device, "starting controller fencing\n");
> >>> +		queue_work(nvme_wq, &to_tcp_ctrl(ctrl)->fencing_work);
> >>> +		return;
> >>> +	}
> >>> +
> >>
> >> Don't you need to flush any outstanding 'fenced_work' queue items here
> >> before calling 'queue_work()'?
> > 
> > I do not think we need to flush ctr->fencing_work. It can not be running
> > at this time.
> > 
> Hmm. If you say so ... I'd rather make sure here.
> These things have a habit of popping up unexpectdly.

A LIVE controller that hits an error queues ctrl->fencing_work.
fencing_work will do CCR. If CCR succeeds then it will queue
ctrl->err_work. If CCR fails then it will queue ctrl->fenced_work.
ctrl->err_work flushes both fencing_work and fenced_work before doing
anything. In the event where a controller is reset just after transition
FENCING->FENCED before FENCED->RESETTING then reset work will run.
ctrl->reset_work also flushes both work.

In all cases these works should be flushed before going to CONNECTING
state. That means by the time they are queued again they should not be
running.

> 
> >>
> >>>    	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> >>>    		return;
> >>>    
> >>> @@ -2470,12 +2478,59 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
> >>>    	nvme_tcp_reconnect_or_remove(ctrl, ret);
> >>>    }
> >>>    
> >>> +static void nvme_tcp_fenced_work(struct work_struct *work)
> >>> +{
> >>> +	struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
> >>> +					struct nvme_tcp_ctrl, fenced_work);
> >>> +	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> >>> +
> >>> +	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> >>> +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> >>> +		queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
> >>> +}
> >>> +
> >>> +static void nvme_tcp_fencing_work(struct work_struct *work)
> >>> +{
> >>> +	struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> >>> +			struct nvme_tcp_ctrl, fencing_work);
> >>> +	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> >>> +	unsigned long rem;
> >>> +
> >>> +	rem = nvme_fence_ctrl(ctrl);
> >>> +	if (!rem)
> >>> +		goto done;
> >>> +
> >>> +	if (!ctrl->cqt) {
> >>> +		dev_info(ctrl->device,
> >>> +			 "CCR failed, CQT not supported, skip time-based recovery\n");
> >>> +		goto done;
> >>> +	}
> >>> +
> >>
> >> As mentioned, cqt handling should be part of another patchset.
> > 
> > Let us suppose we drop cqt from this patchset
> > 
> > - How will we be able to calculate CCR time budget?
> >    Currently it is calculated by nvme_fence_timeout_ms()
> > 
> The CCR time budget is calculated by the current KATO value.
> CQT is the time a controler requires _after_ KATO expires
> to clear out commands.

So it is the same formulate in nvme_fence_timeout_ms() with cqt=0?

> 
> > - What should we do if CCR fails? Retry requests immediately?
> > 
> No. If CCR fails we should fall back to error handling.
> In our case it would mean we let the command timeout handler
> initate a controller reset.

Assuming that we are not implementing CQT delay then we should just
retry requests after CCR finishes. Regardless of the CCR success or
failure. What else we can do if we do not want to hold requests?

What is being suggested is to transition the controller to FENCED state
and leave it there. timeout callback handler should trigger error
recovery which transition to RESETTING as usual. Did I get this part
correctly?

If this is correct then why wait for antoher timeout, which can be 60
seconds for admin requests? We know the controller should be reset I
think we should do what the code does now and proceed with reset.

> _Eventually_ (ie after we implemented CCR _and_ CQT) we would
> wait for KATO (+ CQT) to expire and _then_ reset the controller.

This is what has been implemented in this patchset, right?

> 
> >>> +	dev_info(ctrl->device,
> >>> +		 "CCR failed, switch to time-based recovery, timeout = %ums\n",
> >>> +		 jiffies_to_msecs(rem));
> >>> +	queue_delayed_work(nvme_wq, &tcp_ctrl->fenced_work, rem);
> >>> +	return;
> >>> +
> >>
> >> Why do you need the 'fenced' workqueue at all? All it does is queing yet
> >> another workqueue item, which certainly can be done from the 'fencing'
> >> workqueue directly, no?
> > 
> > It is possible to drop ctr->fenced_work and requeue ctrl->fencing_work
> > as delayed work to implement request hold time. If we do that then we
> > need to modify nvme_tcp_fencing_work() to tell if it is being called for
> > 'fencing' or 'fenced'. The first version of this patch used a controller
> > flag RECOVERED for that and it has been suggested to use a separate work
> > to simplify the logic and drop the controller flag.
> > 
> But that's just because you integrated CCR within the current error 
> recovery.
> If you were to implement CCR as to be invoked from 
> nvme_keep_alive_end_io() we would not need to touch that,
> and we would need just one workqueue.
> 
> Let me see if I can draft up a patch.

I look forward to seeing this draft patch.

> 
> >>
> >>> +done:
> >>> +	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> >>> +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> >>> +		queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
> >>> +}
> >>> +
> >>> +static void nvme_tcp_flush_fencing_work(struct nvme_ctrl *ctrl)
> >>> +{
> >>> +	flush_work(&to_tcp_ctrl(ctrl)->fencing_work);
> >>> +	flush_delayed_work(&to_tcp_ctrl(ctrl)->fenced_work);
> >>> +}
> >>> +
> >>>    static void nvme_tcp_error_recovery_work(struct work_struct *work)
> >>>    {
> >>>    	struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
> >>>    				struct nvme_tcp_ctrl, err_work);
> >>>    	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> >>>    
> >>> +	nvme_tcp_flush_fencing_work(ctrl);
> >>
> >> Why not 'fenced_work' ?
> > 
> > You mean rename nvme_tcp_flush_fencing_work() to
> > nvme_tcp_flush_fenced_work()?
> > 
> > If yes, then I can do that if you think it makes more sense.
> > 
> Thing is, you have two workqueues dealing with CCR.
> You really need to avoid that _both_ are empty when this
> funciton is called.

Yes, that is why we call nvme_tcp_flush_fencing_work().

> 
> >>
> >>>    	if (nvme_tcp_key_revoke_needed(ctrl))
> >>>    		nvme_auth_revoke_tls_key(ctrl);
> >>>    	nvme_stop_keep_alive(ctrl);
> >>> @@ -2518,6 +2573,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
> >>>    		container_of(work, struct nvme_ctrl, reset_work);
> >>>    	int ret;
> >>>    
> >>> +	nvme_tcp_flush_fencing_work(ctrl);
> >>
> >> Same.
> >>
> >>>    	if (nvme_tcp_key_revoke_needed(ctrl))
> >>>    		nvme_auth_revoke_tls_key(ctrl);
> >>>    	nvme_stop_ctrl(ctrl);
> >>> @@ -2643,13 +2699,15 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
> >>>    	struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
> >>>    	struct nvme_command *cmd = &pdu->cmd;
> >>>    	int qid = nvme_tcp_queue_id(req->queue);
> >>> +	enum nvme_ctrl_state state;
> >>>    
> >>>    	dev_warn(ctrl->device,
> >>>    		 "I/O tag %d (%04x) type %d opcode %#x (%s) QID %d timeout\n",
> >>>    		 rq->tag, nvme_cid(rq), pdu->hdr.type, cmd->common.opcode,
> >>>    		 nvme_fabrics_opcode_str(qid, cmd), qid);
> >>>    
> >>> -	if (nvme_ctrl_state(ctrl) != NVME_CTRL_LIVE) {
> >>> +	state = nvme_ctrl_state(ctrl);
> >>> +	if (state != NVME_CTRL_LIVE && state != NVME_CTRL_FENCING) {
> >>
> >> 'FENCED' too, presumably?
> > 
> > I do not think it makes a difference here. FENCED and RESETTING are
> > almost the same states.
> > 
> Yeah, but in FENCED all commands will be aborted, and the same action 
> will be invoked when this if() clause is entered. So you need to avoid
> entering the if() clause to avoid races.

That is okay. FENCED controller means either CCR succeeded or requests
have been held for time-based recovery. In both cases it should be okay
for these requests to be canceled. The only reason FENCED state exists
is that we do not want to have state transition from FENCING ->
RESETTING because we do not want anyone to reset the controller while it
is being fenced. Otherwise FENCED = RESETTING.

> 
> >>
> >>>    		/*
> >>>    		 * If we are resetting, connecting or deleting we should
> >>>    		 * complete immediately because we may block controller
> >>> @@ -2904,6 +2962,8 @@ static struct nvme_tcp_ctrl *nvme_tcp_alloc_ctrl(struct device *dev,
> >>>    
> >>>    	INIT_DELAYED_WORK(&ctrl->connect_work,
> >>>    			nvme_tcp_reconnect_ctrl_work);
> >>> +	INIT_DELAYED_WORK(&ctrl->fenced_work, nvme_tcp_fenced_work);
> >>> +	INIT_WORK(&ctrl->fencing_work, nvme_tcp_fencing_work);
> >>>    	INIT_WORK(&ctrl->err_work, nvme_tcp_error_recovery_work);
> >>>    	INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
> >>>    
> >>
> >> Here you are calling CCR whenever error recovery is triggered.
> >> This will cause CCR to be send from a command timeout, which is
> >> technically wrong (CCR should be send when the KATO timeout expires,
> >> not when a command timout expires). Both could be vastly different.
> > 
> > KATO is driven by the host. What does KTO expires mean?
> > I think KATO expiry is more applicable to target, no?
> > 
> KATO expiry (for this implementation) means the nvme_keep_alive_end_io()
> is called with rtt > deadline.
> 
> > KATO timeout is a signal of an error that target is not reachable or
> > something is wrong with the target?
> > 
> Yes. It explicitely means that the Keep-Alive command was not completed
> within a time period specified by the Keep-Alive timeout value.
> 
> >>
> >> So I'd prefer to have CCR send whenever KATO timeout triggers, and
> >> lease to current command timeout mechanism in place.
> > 
> > Assuming we used CCR only when KATO request times out.
> > What should we do when we hit other errors?
> > 
> Leave it. This patchset is _NOT_ about fixing the error handler.
> This patchset is about implementing CCR.
> And CCR is just a step towared fixing the error handler.

If we leave it then an IO request timeout will cause controller reset
and inflight IOs will be retried immediately, is that what we want?

What if we have a target that responds to keepalive commands but not IO
commands, should we reset the controller if we hit IO timeout?

If we should not reset the controller, then what else we should do?

If we should reset the controller, then what should happen to inflight
IOs?

I do not agree that CCR should be triggered only if keepalive request
timesout. I do not see the link between the two.

> 
> > should nvme_tcp_error_recovery() is called from many places to handle
> > errors and it effectively resets the controller. What should this
> > function do if not trigger CCR?
> > 
> The same things as before. CCR needs to be invoked when the Keep-Alive
> timeout period expires. And that is what we should be implementing here.
> 
> It _might_ (and arguably should) be invoked when error handling needs
> to be done. But with a later patchset.
> 
> Yes, this will result in a patchset where the error handler still has
> gaps. But it will result in a patchset which focusses on a single thing,
> with the added benefit that the workings are clearly outlined in the 
> spec. So the implementation is pretty easy to review.
> 
> Hooking CCR into the error handler itself is _not_ mandated by the spec,
> and inevitably needs to larger discussions (as this thread nicely 
> demonstrates). So I would prefer to keep it simple first, and focus
> on the technical implementation of CCR.
> And concentrate on fixing up the error handler once CCR is in.

Let us say CCR is only triggered by keepalive timeout, and we landed CCR
this way and now it is time to fix the error handler.

What should error handler do to handle say IO timeout?

Let us say we attempted to cancel the timedout comamnd and cancel
command failed. What is the next step in this case?

> 
> Cheers,
> 
> Hannes
> >>
> >> Cheers,
> >>
> >> Hannes
> >> -- 
> >> Dr. Hannes Reinecke                  Kernel Storage Architect
> >> hare@suse.de                                +49 911 74053 688
> >> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> >> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
> 
> 
> -- 
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery
  2026-01-30 22:34 ` [PATCH v2 08/14] nvme: Implement cross-controller reset recovery Mohamed Khalfella
  2026-02-03  5:19   ` Hannes Reinecke
@ 2026-02-10 22:09   ` James Smart
  2026-02-10 22:27     ` Mohamed Khalfella
  1 sibling, 1 reply; 82+ messages in thread
From: James Smart @ 2026-02-10 22:09 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, jsmart833426

On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
...
> +unsigned long nvme_fence_ctrl(struct nvme_ctrl *ictrl)
> +{
> +	unsigned long deadline, now, timeout;
> +	struct nvme_ctrl *sctrl;
> +	u32 min_cntlid = 0;
> +	int ret;
> +
> +	timeout = nvme_fence_timeout_ms(ictrl);
> +	dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> +
> +	now = jiffies;
> +	deadline = now + msecs_to_jiffies(timeout);
> +	while (time_before(now, deadline)) {

Q: don't we have something to identify the controller's subsystem 
supports CCR before we starting selecting controllers and sending CCR ?

I would think on older devices that don't support it we should be 
skipping this loop.   The loop could delay the Time-Based delay without 
any CCR.

-- james



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 13/14] nvme-fc: Use CCR to recover controller that hits an error
  2026-01-30 22:34 ` [PATCH v2 13/14] nvme-fc: Use CCR to recover controller that hits an error Mohamed Khalfella
  2026-02-03  5:43   ` Hannes Reinecke
@ 2026-02-10 22:12   ` James Smart
  2026-02-10 22:20     ` Mohamed Khalfella
  1 sibling, 1 reply; 82+ messages in thread
From: James Smart @ 2026-02-10 22:12 UTC (permalink / raw)
  To: Mohamed Khalfella, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg
  Cc: Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel, jsmart833426

On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> +static void nvme_fc_fenced_work(struct work_struct *work)
> +{
> +	struct nvme_fc_ctrl *fc_ctrl = container_of(to_delayed_work(work),
> +			struct nvme_fc_ctrl, fenced_work);
> +	struct nvme_ctrl *ctrl = &fc_ctrl->ctrl;
> +
> +	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> +		queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
> +}

I'm not a fan of 1, maybe 2, state changes that may silently fail. Some 
trace message would be worthwhile to state fencing cancelled/ended.

-- james



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 13/14] nvme-fc: Use CCR to recover controller that hits an error
  2026-02-10 22:12   ` James Smart
@ 2026-02-10 22:20     ` Mohamed Khalfella
  2026-02-13 19:29       ` Mohamed Khalfella
  0 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-10 22:20 UTC (permalink / raw)
  To: James Smart
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

On Tue 2026-02-10 14:12:24 -0800, James Smart wrote:
> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> > +static void nvme_fc_fenced_work(struct work_struct *work)
> > +{
> > +	struct nvme_fc_ctrl *fc_ctrl = container_of(to_delayed_work(work),
> > +			struct nvme_fc_ctrl, fenced_work);
> > +	struct nvme_ctrl *ctrl = &fc_ctrl->ctrl;
> > +
> > +	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> > +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> > +		queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
> > +}
> 
> I'm not a fan of 1, maybe 2, state changes that may silently fail. Some 
> trace message would be worthwhile to state fencing cancelled/ended.
> 
> -- james
> 

The change to FENCED should never fail. This is the only transition
allowed from FENCING state and this is the only place we do that.
Do you suggest I put WARN_ON() around it?

The second transition from FENCED to RESETTING can fail if someone
resets the controller. It should be fine to do nothing in this case
because they will have queued reset or error work.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery
  2026-02-10 22:09   ` James Smart
@ 2026-02-10 22:27     ` Mohamed Khalfella
  2026-02-10 22:49       ` James Smart
  0 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-10 22:27 UTC (permalink / raw)
  To: James Smart
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

On Tue 2026-02-10 14:09:27 -0800, James Smart wrote:
> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> ...
> > +unsigned long nvme_fence_ctrl(struct nvme_ctrl *ictrl)
> > +{
> > +	unsigned long deadline, now, timeout;
> > +	struct nvme_ctrl *sctrl;
> > +	u32 min_cntlid = 0;
> > +	int ret;
> > +
> > +	timeout = nvme_fence_timeout_ms(ictrl);
> > +	dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> > +
> > +	now = jiffies;
> > +	deadline = now + msecs_to_jiffies(timeout);
> > +	while (time_before(now, deadline)) {
> 
> Q: don't we have something to identify the controller's subsystem 
> supports CCR before we starting selecting controllers and sending CCR ?
> 
> I would think on older devices that don't support it we should be 
> skipping this loop.   The loop could delay the Time-Based delay without 
> any CCR.

I do not think we have something that identifies CCR support at
subsystem level. The spec defines CCRL at the controller level. The loop
should not that bad. nvme_find_ctrl_ccr() should return NULL if CCR is
not supported and nvme_fence_ctrl() will return immediately.

> 
> -- james
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery
  2026-02-10 22:27     ` Mohamed Khalfella
@ 2026-02-10 22:49       ` James Smart
  2026-02-10 23:25         ` Mohamed Khalfella
  0 siblings, 1 reply; 82+ messages in thread
From: James Smart @ 2026-02-10 22:49 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

On 2/10/2026 2:27 PM, Mohamed Khalfella wrote:
> On Tue 2026-02-10 14:09:27 -0800, James Smart wrote:
>> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
>> ...
>>> +unsigned long nvme_fence_ctrl(struct nvme_ctrl *ictrl)
>>> +{
>>> +	unsigned long deadline, now, timeout;
>>> +	struct nvme_ctrl *sctrl;
>>> +	u32 min_cntlid = 0;
>>> +	int ret;
>>> +
>>> +	timeout = nvme_fence_timeout_ms(ictrl);
>>> +	dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
>>> +
>>> +	now = jiffies;
>>> +	deadline = now + msecs_to_jiffies(timeout);
>>> +	while (time_before(now, deadline)) {
>>
>> Q: don't we have something to identify the controller's subsystem
>> supports CCR before we starting selecting controllers and sending CCR ?
>>
>> I would think on older devices that don't support it we should be
>> skipping this loop.   The loop could delay the Time-Based delay without
>> any CCR.
> 
> I do not think we have something that identifies CCR support at
> subsystem level. The spec defines CCRL at the controller level. The loop
> should not that bad. nvme_find_ctrl_ccr() should return NULL if CCR is
> not supported and nvme_fence_ctrl() will return immediately.
> 
>>
>> -- james
>>

I would think CCRL on the failed controller would be enough to assume 
the subsystem supports it.

I'm not worried about the coding on the host is so bad. It's more the 
multiple paths that must have cmds sent to them and getting error 
responses for unknown cmds (should be responded to ok, but you never 
know) as well as creating conditions for other errors where there will 
be no return for it - e.g. other paths losing connectivity while the ccr 
outstanding, etc. yes, they all have to work, but why bother adding 
these flows to an old controller that would never do CCR ?

-- james


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery
  2026-02-10 22:49       ` James Smart
@ 2026-02-10 23:25         ` Mohamed Khalfella
  2026-02-11  0:12           ` Mohamed Khalfella
  0 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-10 23:25 UTC (permalink / raw)
  To: James Smart
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

On Tue 2026-02-10 14:49:15 -0800, James Smart wrote:
> On 2/10/2026 2:27 PM, Mohamed Khalfella wrote:
> > On Tue 2026-02-10 14:09:27 -0800, James Smart wrote:
> >> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> >> ...
> >>> +unsigned long nvme_fence_ctrl(struct nvme_ctrl *ictrl)
> >>> +{
> >>> +	unsigned long deadline, now, timeout;
> >>> +	struct nvme_ctrl *sctrl;
> >>> +	u32 min_cntlid = 0;
> >>> +	int ret;
> >>> +
> >>> +	timeout = nvme_fence_timeout_ms(ictrl);
> >>> +	dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> >>> +
> >>> +	now = jiffies;
> >>> +	deadline = now + msecs_to_jiffies(timeout);
> >>> +	while (time_before(now, deadline)) {
> >>
> >> Q: don't we have something to identify the controller's subsystem
> >> supports CCR before we starting selecting controllers and sending CCR ?
> >>
> >> I would think on older devices that don't support it we should be
> >> skipping this loop.   The loop could delay the Time-Based delay without
> >> any CCR.
> > 
> > I do not think we have something that identifies CCR support at
> > subsystem level. The spec defines CCRL at the controller level. The loop
> > should not that bad. nvme_find_ctrl_ccr() should return NULL if CCR is
> > not supported and nvme_fence_ctrl() will return immediately.
> > 
> >>
> >> -- james
> >>
> 
> I would think CCRL on the failed controller would be enough to assume 
> the subsystem supports it.

ictrl->ccr_limit is a good indication that subsystem supports CCR. I do
not think it is enough though. I say that for two reasons:

- May be this controller does not support CCR but others do on the same
  subsystem. There is nothing prevents subsystem from putting a cap of
  CCR at subsytem level.
- May be this controller supports CCR command but not now because all
  CCR slots are used now. This can happen in the case of cascading
  failure.

> 
> I'm not worried about the coding on the host is so bad. It's more the 
> multiple paths that must have cmds sent to them and getting error 
> responses for unknown cmds (should be responded to ok, but you never 
> know) as well as creating conditions for other errors where there will 
> be no return for it - e.g. other paths losing connectivity while the ccr 
> outstanding, etc. yes, they all have to work, but why bother adding 
> these flows to an old controller that would never do CCR ?

If nvme_find_ctrl_ccr() returns a source controller to use then we know
the controller supports CCR and does have an available slot to process
this CCR request. I do not see how this code will send CCR request to an
old controller that does not know about CCR command.

I am not fully opposed against using ictrl->ccr_limit to return early. I
do not see the need for it. If you feel strongly about it I can update
nvme_fence_ctrl() to do so.



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery
  2026-02-10 23:25         ` Mohamed Khalfella
@ 2026-02-11  0:12           ` Mohamed Khalfella
  2026-02-11  3:33             ` Randy Jennings
  0 siblings, 1 reply; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-11  0:12 UTC (permalink / raw)
  To: James Smart
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

On Tue 2026-02-10 15:25:55 -0800, Mohamed Khalfella wrote:
> On Tue 2026-02-10 14:49:15 -0800, James Smart wrote:
> > On 2/10/2026 2:27 PM, Mohamed Khalfella wrote:
> > > On Tue 2026-02-10 14:09:27 -0800, James Smart wrote:
> > >> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> > >> ...
> > >>> +unsigned long nvme_fence_ctrl(struct nvme_ctrl *ictrl)
> > >>> +{
> > >>> +	unsigned long deadline, now, timeout;
> > >>> +	struct nvme_ctrl *sctrl;
> > >>> +	u32 min_cntlid = 0;
> > >>> +	int ret;
> > >>> +
> > >>> +	timeout = nvme_fence_timeout_ms(ictrl);
> > >>> +	dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> > >>> +
> > >>> +	now = jiffies;
> > >>> +	deadline = now + msecs_to_jiffies(timeout);
> > >>> +	while (time_before(now, deadline)) {
> > >>
> > >> Q: don't we have something to identify the controller's subsystem
> > >> supports CCR before we starting selecting controllers and sending CCR ?
> > >>
> > >> I would think on older devices that don't support it we should be
> > >> skipping this loop.   The loop could delay the Time-Based delay without
> > >> any CCR.
> > > 
> > > I do not think we have something that identifies CCR support at
> > > subsystem level. The spec defines CCRL at the controller level. The loop
> > > should not that bad. nvme_find_ctrl_ccr() should return NULL if CCR is
> > > not supported and nvme_fence_ctrl() will return immediately.
> > > 
> > >>
> > >> -- james
> > >>
> > 
> > I would think CCRL on the failed controller would be enough to assume 
> > the subsystem supports it.
> 
> ictrl->ccr_limit is a good indication that subsystem supports CCR. I do
> not think it is enough though. I say that for two reasons:
> 
> - May be this controller does not support CCR but others do on the same
>   subsystem. There is nothing prevents subsystem from putting a cap of
>   CCR at subsytem level.
> - May be this controller supports CCR command but not now because all
>   CCR slots are used now. This can happen in the case of cascading
>   failure.
> 
> > 
> > I'm not worried about the coding on the host is so bad. It's more the 
> > multiple paths that must have cmds sent to them and getting error 
> > responses for unknown cmds (should be responded to ok, but you never 
> > know) as well as creating conditions for other errors where there will 
> > be no return for it - e.g. other paths losing connectivity while the ccr 
> > outstanding, etc. yes, they all have to work, but why bother adding 
> > these flows to an old controller that would never do CCR ?
> 
> If nvme_find_ctrl_ccr() returns a source controller to use then we know
> the controller supports CCR and does have an available slot to process
> this CCR request. I do not see how this code will send CCR request to an
> old controller that does not know about CCR command.
> 
> I am not fully opposed against using ictrl->ccr_limit to return early. I
> do not see the need for it. If you feel strongly about it I can update
> nvme_fence_ctrl() to do so.
> 

I forgot to mention that ctrl->ccr_limit is initialized from id->ccrl in 
nvme_init_identify(). If this value is greater than zero then we know
the controller does support CCR. nvme_find_ctrl_ccr() checks for that
and the returned source controller must support CCR and has a slot
available for it.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 02/14] nvmet/debugfs: Add ctrl uniquifier and random values
  2026-01-30 22:34 ` [PATCH v2 02/14] nvmet/debugfs: Add ctrl uniquifier and random values Mohamed Khalfella
  2026-02-03  3:04   ` Hannes Reinecke
  2026-02-07 13:47   ` Sagi Grimberg
@ 2026-02-11  0:50   ` Randy Jennings
  2026-02-11  1:02     ` Mohamed Khalfella
  2 siblings, 1 reply; 82+ messages in thread
From: Randy Jennings @ 2026-02-11  0:50 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Dhaval Giani, Hannes Reinecke, linux-nvme,
	linux-kernel

On Fri, Jan 30, 2026 at 2:36 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> Export ctrl->random and ctrl->uniquifier as debugfs files under
Please update to the new names.

> controller debugfs directory.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>  drivers/nvme/target/debugfs.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
>
> diff --git a/drivers/nvme/target/debugfs.c b/drivers/nvme/target/debugfs.c
> index 5dcbd5aa86e1..1300adf6c1fb 100644
> --- a/drivers/nvme/target/debugfs.c
> +++ b/drivers/nvme/target/debugfs.c
> @@ -152,6 +152,23 @@ static int nvmet_ctrl_tls_concat_show(struct seq_file *m, void *p)
>  }
>  NVMET_DEBUGFS_ATTR(nvmet_ctrl_tls_concat);
>  #endif
> +static int nvmet_ctrl_instance_ciu_show(struct seq_file *m, void *p)
> +{
> +       struct nvmet_ctrl *ctrl = m->private;
> +
> +       seq_printf(m, "%02x\n", ctrl->ciu);
> +       return 0;
> +}
> +NVMET_DEBUGFS_ATTR(nvmet_ctrl_instance_ciu);
> +
> +static int nvmet_ctrl_instance_cirn_show(struct seq_file *m, void *p)
> +{
> +       struct nvmet_ctrl *ctrl = m->private;
> +
> +       seq_printf(m, "%016llx\n", ctrl->cirn);
> +       return 0;
> +}
> +NVMET_DEBUGFS_ATTR(nvmet_ctrl_instance_cirn);
>
>  int nvmet_debugfs_ctrl_setup(struct nvmet_ctrl *ctrl)
>  {
> @@ -184,6 +201,10 @@ int nvmet_debugfs_ctrl_setup(struct nvmet_ctrl *ctrl)
>         debugfs_create_file("tls_key", S_IRUSR, ctrl->debugfs_dir, ctrl,
>                             &nvmet_ctrl_tls_key_fops);
>  #endif
> +       debugfs_create_file("ciu", S_IRUSR, ctrl->debugfs_dir, ctrl,
> +                           &nvmet_ctrl_instance_ciu_fops);
> +       debugfs_create_file("cirn", S_IRUSR, ctrl->debugfs_dir, ctrl,
> +                           &nvmet_ctrl_instance_cirn_fops);
>         return 0;
>  }
>
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 02/14] nvmet/debugfs: Add ctrl uniquifier and random values
  2026-02-11  0:50   ` Randy Jennings
@ 2026-02-11  1:02     ` Mohamed Khalfella
  0 siblings, 0 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-11  1:02 UTC (permalink / raw)
  To: Randy Jennings
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Dhaval Giani, Hannes Reinecke, linux-nvme,
	linux-kernel

On Tue 2026-02-10 16:50:51 -0800, Randy Jennings wrote:
> On Fri, Jan 30, 2026 at 2:36 PM Mohamed Khalfella
> <mkhalfella@purestorage.com> wrote:
> >
> > Export ctrl->random and ctrl->uniquifier as debugfs files under
> Please update to the new names.

Good catch. Done.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 03/14] nvmet: Implement CCR nvme command
  2026-02-09 19:27                   ` Mohamed Khalfella
@ 2026-02-11  1:34                     ` Randy Jennings
  0 siblings, 0 replies; 82+ messages in thread
From: Randy Jennings @ 2026-02-11  1:34 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Sagi Grimberg, Hannes Reinecke, Justin Tee, Naresh Gottumukkala,
	Paul Ely, Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe,
	Keith Busch, Aaron Dailey, Dhaval Giani, linux-nvme, linux-kernel

On Mon, Feb 9, 2026 at 11:27 AM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> On Sun 2026-02-08 15:10:36 -0800, Mohamed Khalfella wrote:
> > On Sat 2026-02-07 15:58:49 +0200, Sagi Grimberg wrote:
> > >
> > >
> > > On 04/02/2026 19:52, Mohamed Khalfella wrote:
> > > > On Wed 2026-02-04 01:55:18 +0100, Hannes Reinecke wrote:
> > > >> On 2/4/26 01:44, Mohamed Khalfella wrote:
> > > >>> On Wed 2026-02-04 01:38:44 +0100, Hannes Reinecke wrote:
> > > >>>> On 2/3/26 19:40, Mohamed Khalfella wrote:
> > > >>>>> On Tue 2026-02-03 04:19:50 +0100, Hannes Reinecke wrote:
> > > >>>>>> On 1/30/26 23:34, Mohamed Khalfella wrote:
> > > >>>>>>> @@ -1501,6 +1516,38 @@ struct nvmet_ctrl *nvmet_ctrl_find_get(const char *subsysnqn,
> > > >>>>>>>             return ctrl;
> > > >>>>>>>      }
> > > >>>>>>>
> > > >>>>>>> +struct nvmet_ctrl *nvmet_ctrl_find_get_ccr(struct nvmet_subsys *subsys,
> > > >>>>>>> +                                      const char *hostnqn, u8 ciu,
> > > >>>>>>> +                                      u16 cntlid, u64 cirn)
> > > >>>>>>> +{
> > > >>>>>>> +   struct nvmet_ctrl *ctrl;
> > > >>>>>>> +   bool found = false;
> > > >>>>>>> +
> > > >>>>>>> +   mutex_lock(&subsys->lock);
> > > >>>>>>> +   list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
> > > >>>>>>> +           if (ctrl->cntlid != cntlid)
> > > >>>>>>> +                   continue;
> > > >>>>>>> +           if (strncmp(ctrl->hostnqn, hostnqn, NVMF_NQN_SIZE))
> > > >>>>>>> +                   continue;
> > > >>>>>>> +
> > > >>>>>> Why do we compare the hostnqn here, too? To my understanding the host
> > > >>>>>> NQN is tied to the controller, so the controller ID should be sufficient
> > > >>>>>> here.
> > > >>>>> We got cntlid from CCR nvme command and we do not trust the value sent by
> > > >>>>> the host. We check hostnqn to confirm that host is actually connected to
> > > >>>>> the impacted controller. A host should not be allowed to reset a
> > > >>>>> controller connected to another host.
> > > >>>>>
> > > >>>> Errm. So we're starting to not trust values in NVMe commands?
> > > >>>> That is a very slippery road.
> > > >>>> Ultimately it would require us to validate the cntlid on each
> > > >>>> admin command. Which we don't.
> > > >>>> And really there is no difference between CCR and any other
> > > >>>> admin command; you get even worse effects if you would assume
> > > >>>> a misdirected 'FORMAT' command.
> > > >>>>
> > > >>>> Please don't. Security is _not_ a concern here.
> > > >>> I do not think the check hurts. If you say it is wrong I will delete it.
> > > >>>
> > > >> It's not 'wrong', It's inconsistent. The argument that the contents of
> > > >> an admin command may be wrong applies to _every_ admin command.
> > > >> Yet we never check on any of those commands.
> > > >> So I fail to see why this command requires special treatment.
> > > > Okay, I will delete this check.
Security is a concern.  But not to validate the ctrlId per se.

CCR is a different type of command from other NVMe commands
because it affects a different controller, one that could be connected
to a different host.

The Host NQN check was put in specifically to prevent one host from
aborting connections on another host.  One host terminating the
association with another host is definitely an attack vector, and it was
considered when the command was designed.

Authentication in NVMe is tied around the NQN.  Yes, another host
could fake its NQN, and authentication is supposed to catch that.  Of
course you have to implement authentication and have it turned on.

Hannas, why do you think security is not a concern?  Because
authentication is not implemented?  I can buy that; a comment
that says "Not verifying host nqn because authentication is not
implemented either." would work.

> > >
> > > It is a very different command than other commands that nvmet serves. Format
> > > is different because it is an attached namespace, hence the host should
> > > be able
> > > to format it. If it would have been possible to access a namespace that
> > > is not mapped
> > > to a controller, then this check would have been warranted I think.
> > >
> > > There have been some issues lately opened on nvme-tcp that expose
> > > attacks that can
> > > crash the kernel with some hand-crafted commands, I'd say that this is a
> > > potential attack vector.
The two NQNs being compared comes from the ctrl structures, which have
already had their NQNs logged on connect.  What attack vector are you
seeing here with the NQNs?

> >
> > For an attacker to exploit CCR command they will have to guess both CUI
> > (8bit) and CIRN(64bit) random values correctly. I do not see how an
> > attacker can find these values without being connected to the impacted
> > controller.
>
> I changed ciu and cirn sysfs controller attributes on the host from S_IRUGO
> to S_IRUSR. This should mitigate an attack by unprivileged process
> running on host. On the target side debugfs files already have S_IRUSR
> so no change is needed.
>

Sincerely,
Randy Jennings


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 04/14] nvmet: Implement CCR logpage
  2026-01-30 22:34 ` [PATCH v2 04/14] nvmet: Implement CCR logpage Mohamed Khalfella
  2026-02-03  3:21   ` Hannes Reinecke
  2026-02-07 14:11   ` Sagi Grimberg
@ 2026-02-11  1:49   ` Randy Jennings
  2 siblings, 0 replies; 82+ messages in thread
From: Randy Jennings @ 2026-02-11  1:49 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Dhaval Giani, Hannes Reinecke, linux-nvme,
	linux-kernel

On Fri, Jan 30, 2026 at 2:36 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> Defined by TP8028 Rapid Path Failure Recovery, CCR (Cross-Controller
> Reset) log page contains an entry for each CCR request submitted to
> source controller. Implement CCR logpage for nvme linux target.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 05/14] nvmet: Send an AEN on CCR completion
  2026-01-30 22:34 ` [PATCH v2 05/14] nvmet: Send an AEN on CCR completion Mohamed Khalfella
  2026-02-03  3:27   ` Hannes Reinecke
  2026-02-07 14:12   ` Sagi Grimberg
@ 2026-02-11  1:52   ` Randy Jennings
  2 siblings, 0 replies; 82+ messages in thread
From: Randy Jennings @ 2026-02-11  1:52 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Dhaval Giani, Hannes Reinecke, linux-nvme,
	linux-kernel

On Fri, Jan 30, 2026 at 2:36 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> Send an AEN to initiator when impacted controller exists. The
> notification points to CCR log page that initiator can read to check
> which CCR operation completed.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 06/14] nvme: Rapid Path Failure Recovery read controller identify fields
  2026-01-30 22:34 ` [PATCH v2 06/14] nvme: Rapid Path Failure Recovery read controller identify fields Mohamed Khalfella
  2026-02-03  3:28   ` Hannes Reinecke
  2026-02-07 14:13   ` Sagi Grimberg
@ 2026-02-11  1:56   ` Randy Jennings
  2 siblings, 0 replies; 82+ messages in thread
From: Randy Jennings @ 2026-02-11  1:56 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Dhaval Giani, Hannes Reinecke, linux-nvme,
	linux-kernel

On Fri, Jan 30, 2026 at 2:36 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> TP8028 Rapid path failure added new fileds to controller identify
> response. Read CIU (Controller Instance Uniquifier), CIRN (Controller
> Instance Random Number), and CCRL (Cross-Controller Reset Limit) from
> controller identify response. Expose CIU and CIRN as sysfs attributes
> so the values can be used directrly by user if needed.
>
> TP4129 KATO Corrections and Clarifications defined CQT (Command Quiesce
> Time) which is used along with KATO (Keep Alive Timeout) to set an upper
> limit for attempting Cross-Controller Recovery.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery
  2026-02-11  0:12           ` Mohamed Khalfella
@ 2026-02-11  3:33             ` Randy Jennings
  0 siblings, 0 replies; 82+ messages in thread
From: Randy Jennings @ 2026-02-11  3:33 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: James Smart, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Aaron Dailey, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

On Tue, Feb 10, 2026 at 4:12 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> On Tue 2026-02-10 15:25:55 -0800, Mohamed Khalfella wrote:
> > On Tue 2026-02-10 14:49:15 -0800, James Smart wrote:
> > > On 2/10/2026 2:27 PM, Mohamed Khalfella wrote:
> > > > On Tue 2026-02-10 14:09:27 -0800, James Smart wrote:
> > > >> On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> > > >> ...
> > > >>> +unsigned long nvme_fence_ctrl(struct nvme_ctrl *ictrl)
> > > >>> +{
> > > >>> +       unsigned long deadline, now, timeout;
> > > >>> +       struct nvme_ctrl *sctrl;
> > > >>> +       u32 min_cntlid = 0;
> > > >>> +       int ret;
> > > >>> +
> > > >>> +       timeout = nvme_fence_timeout_ms(ictrl);
> > > >>> +       dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> > > >>> +
> > > >>> +       now = jiffies;
> > > >>> +       deadline = now + msecs_to_jiffies(timeout);
> > > >>> +       while (time_before(now, deadline)) {
> > > >>
> > > >> Q: don't we have something to identify the controller's subsystem
> > > >> supports CCR before we starting selecting controllers and sending CCR ?
> > > >>
> > > >> I would think on older devices that don't support it we should be
> > > >> skipping this loop.   The loop could delay the Time-Based delay without
> > > >> any CCR.
> > > >
> > > > I do not think we have something that identifies CCR support at
> > > > subsystem level. The spec defines CCRL at the controller level. The loop
> > > > should not that bad. nvme_find_ctrl_ccr() should return NULL if CCR is
> > > > not supported and nvme_fence_ctrl() will return immediately.
> > > >
> > > >>
> > > >> -- james
> > > >>
> > >
> > > I would think CCRL on the failed controller would be enough to assume
> > > the subsystem supports it.
> >
> > ictrl->ccr_limit is a good indication that subsystem supports CCR. I do
> > not think it is enough though. I say that for two reasons:
> >
> > - May be this controller does not support CCR but others do on the same
> >   subsystem. There is nothing prevents subsystem from putting a cap of
> >   CCR at subsytem level.
This is not a concern.  Controllers should support CCR uniformly in
a subsystem, and it would be unusual for them not to.

> > - May be this controller supports CCR command but not now because all
> >   CCR slots are used now. This can happen in the case of cascading
> >   failure.
This is the concern.  Because the code modifies ccrl as it has ccr cmds
outstanding, checking ictrl may not give an accurate picture if ccr is
supported.


> >
> > >
> > > I'm not worried about the coding on the host is so bad. It's more the
> > > multiple paths that must have cmds sent to them and getting error
> > > responses for unknown cmds (should be responded to ok, but you never
> > > know) as well as creating conditions for other errors where there will
> > > be no return for it - e.g. other paths losing connectivity while the ccr
> > > outstanding, etc. yes, they all have to work, but why bother adding
> > > these flows to an old controller that would never do CCR ?
> >
> > If nvme_find_ctrl_ccr() returns a source controller to use then we know
> > the controller supports CCR and does have an available slot to process
> > this CCR request. I do not see how this code will send CCR request to an
> > old controller that does not know about CCR command.
> >
> > I am not fully opposed against using ictrl->ccr_limit to return early. I
> > do not see the need for it. If you feel strongly about it I can update
> > nvme_fence_ctrl() to do so.
> >
>
> I forgot to mention that ctrl->ccr_limit is initialized from id->ccrl in
> nvme_init_identify(). If this value is greater than zero then we know
> the controller does support CCR. nvme_find_ctrl_ccr() checks for that
> and the returned source controller must support CCR and has a slot
> available for it.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery
  2026-02-04 23:24         ` Mohamed Khalfella
@ 2026-02-11  3:44           ` Randy Jennings
  2026-02-11 15:19             ` Hannes Reinecke
  0 siblings, 1 reply; 82+ messages in thread
From: Randy Jennings @ 2026-02-11  3:44 UTC (permalink / raw)
  To: Mohamed Khalfella
  Cc: Hannes Reinecke, Justin Tee, Naresh Gottumukkala, Paul Ely,
	Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Aaron Dailey, Dhaval Giani, linux-nvme,
	linux-kernel

On Wed, Feb 4, 2026 at 3:24 PM Mohamed Khalfella
<mkhalfella@purestorage.com> wrote:
>
> On Wed 2026-02-04 02:10:48 +0100, Hannes Reinecke wrote:
> > On 2/3/26 21:00, Mohamed Khalfella wrote:
> > > On Tue 2026-02-03 06:19:51 +0100, Hannes Reinecke wrote:
> > >> On 1/30/26 23:34, Mohamed Khalfella wrote:
> > [ .. ]
> > >>> + timeout = nvme_fence_timeout_ms(ictrl);
> > >>> + dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
> > >>> +
> > >>> + now = jiffies;
> > >>> + deadline = now + msecs_to_jiffies(timeout);
> > >>> + while (time_before(now, deadline)) {
> > >>> +         sctrl = nvme_find_ctrl_ccr(ictrl, min_cntlid);
> > >>> +         if (!sctrl) {
> > >>> +                 /* CCR failed, switch to time-based recovery */
> > >>> +                 return deadline - now;
> > >>> +         }
> > >>> +
> > >>> +         ret = nvme_issue_wait_ccr(sctrl, ictrl);
> > >>> +         if (!ret) {
> > >>> +                 dev_info(ictrl->device, "CCR succeeded using %s\n",
> > >>> +                          dev_name(sctrl->device));
> > >>> +                 nvme_put_ctrl_ccr(sctrl);
> > >>> +                 return 0;
> > >>> +         }
> > >>> +
> > >>> +         /* CCR failed, try another path */
> > >>> +         min_cntlid = sctrl->cntlid + 1;
> > >>> +         nvme_put_ctrl_ccr(sctrl);
> > >>> +         now = jiffies;
> > >>> + }
> > >>
> > >> That will spin until 'deadline' is reached if 'nvme_issue_wait_ccr()'
> > >> returns an error. _And_ if the CCR itself runs into a timeout we would
> > >> never have tried another path (which could have succeeded).
> > >
> > > True. We can do one thing at a time in CCR time budget. Either wait for
> > > CCR to succeed or give up early and try another path. It is a trade off.
> > >
> > Yes. But I guess my point here is that we should differentiate between
> > 'CCR failed to be sent' and 'CCR completed with error'.
> > The logic above treats both the same.
> >
> > >>
> > >> I'd rather rework this loop to open-code 'issue_and_wait()' in the loop,
> > >> and only switch to the next controller if the submission of CCR failed.
> > >> Once that is done we can 'just' wait for completion, as a failure there
> > >> will be after KATO timeout anyway and any subsequent CCR would be pointless.
> > >
> > > If I understood this correctly then we will stick with the first sctrl
> > > that accepts the CCR command. We wait for CCR to complete and give up on
> > > fencing ictrl if CCR operation fails or times out. Did I get this correctly?
> > >
> > Yes.
> > If a CCR could be send but the controller failed to process it something
> > very odd is ongoing, and it's extremely questionable whether a CCR to
> > another controller would be succeeding. That's why I would switch to the
> > next available controller if we could not _send_ the CCR, but would
> > rather wait for KATO if CCR processing returned an error.
> >
> > But the main point is that CCR is a way to _shorten_ the interval
> > (until KATO timeout) until we can start retrying commands.
> > If the controller ran into an error during CCR processing chances
> > are that quite some time has elapsed already, and we might as well
> > wait for KATO instead of retrying with yet another CCR.
>
> Got it. I updated the code to do that.
It is not true that CCR failing means something odd is going on.  In a
tightly-coupled storage HA pair, hopefully, all the NVMe controllers
will be able to figure out the status of the other NVMe controllers.
However, I know of multiple systems (one of which I care about) where
the NVMe controllers may have no way of figuring out the state of some
other NVMe controllers.  In that case, the log page entry indicates
that the CCR might succeed on some other NVMe controller (and in these
systems, I expect they would not be able to be particularly specific
about which one).  Very little time will elapse for that to happen.

It is important for those systems to have a retry on another NVMe
controller.

Sincerely,
Randy Jennings


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 08/14] nvme: Implement cross-controller reset recovery
  2026-02-11  3:44           ` Randy Jennings
@ 2026-02-11 15:19             ` Hannes Reinecke
  0 siblings, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2026-02-11 15:19 UTC (permalink / raw)
  To: Randy Jennings, Mohamed Khalfella
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Dhaval Giani, linux-nvme, linux-kernel

On 2/11/26 04:44, Randy Jennings wrote:
> On Wed, Feb 4, 2026 at 3:24 PM Mohamed Khalfella
> <mkhalfella@purestorage.com> wrote:
>>
>> On Wed 2026-02-04 02:10:48 +0100, Hannes Reinecke wrote:
>>> On 2/3/26 21:00, Mohamed Khalfella wrote:
>>>> On Tue 2026-02-03 06:19:51 +0100, Hannes Reinecke wrote:
>>>>> On 1/30/26 23:34, Mohamed Khalfella wrote:
>>> [ .. ]
>>>>>> + timeout = nvme_fence_timeout_ms(ictrl);
>>>>>> + dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
>>>>>> +
>>>>>> + now = jiffies;
>>>>>> + deadline = now + msecs_to_jiffies(timeout);
>>>>>> + while (time_before(now, deadline)) {
>>>>>> +         sctrl = nvme_find_ctrl_ccr(ictrl, min_cntlid);
>>>>>> +         if (!sctrl) {
>>>>>> +                 /* CCR failed, switch to time-based recovery */
>>>>>> +                 return deadline - now;
>>>>>> +         }
>>>>>> +
>>>>>> +         ret = nvme_issue_wait_ccr(sctrl, ictrl);
>>>>>> +         if (!ret) {
>>>>>> +                 dev_info(ictrl->device, "CCR succeeded using %s\n",
>>>>>> +                          dev_name(sctrl->device));
>>>>>> +                 nvme_put_ctrl_ccr(sctrl);
>>>>>> +                 return 0;
>>>>>> +         }
>>>>>> +
>>>>>> +         /* CCR failed, try another path */
>>>>>> +         min_cntlid = sctrl->cntlid + 1;
>>>>>> +         nvme_put_ctrl_ccr(sctrl);
>>>>>> +         now = jiffies;
>>>>>> + }
>>>>>
>>>>> That will spin until 'deadline' is reached if 'nvme_issue_wait_ccr()'
>>>>> returns an error. _And_ if the CCR itself runs into a timeout we would
>>>>> never have tried another path (which could have succeeded).
>>>>
>>>> True. We can do one thing at a time in CCR time budget. Either wait for
>>>> CCR to succeed or give up early and try another path. It is a trade off.
>>>>
>>> Yes. But I guess my point here is that we should differentiate between
>>> 'CCR failed to be sent' and 'CCR completed with error'.
>>> The logic above treats both the same.
>>>
>>>>>
>>>>> I'd rather rework this loop to open-code 'issue_and_wait()' in the loop,
>>>>> and only switch to the next controller if the submission of CCR failed.
>>>>> Once that is done we can 'just' wait for completion, as a failure there
>>>>> will be after KATO timeout anyway and any subsequent CCR would be pointless.
>>>>
>>>> If I understood this correctly then we will stick with the first sctrl
>>>> that accepts the CCR command. We wait for CCR to complete and give up on
>>>> fencing ictrl if CCR operation fails or times out. Did I get this correctly?
>>>>
>>> Yes.
>>> If a CCR could be send but the controller failed to process it something
>>> very odd is ongoing, and it's extremely questionable whether a CCR to
>>> another controller would be succeeding. That's why I would switch to the
>>> next available controller if we could not _send_ the CCR, but would
>>> rather wait for KATO if CCR processing returned an error.
>>>
>>> But the main point is that CCR is a way to _shorten_ the interval
>>> (until KATO timeout) until we can start retrying commands.
>>> If the controller ran into an error during CCR processing chances
>>> are that quite some time has elapsed already, and we might as well
>>> wait for KATO instead of retrying with yet another CCR.
>>
>> Got it. I updated the code to do that.
> It is not true that CCR failing means something odd is going on.  In a
> tightly-coupled storage HA pair, hopefully, all the NVMe controllers
> will be able to figure out the status of the other NVMe controllers.
> However, I know of multiple systems (one of which I care about) where
> the NVMe controllers may have no way of figuring out the state of some
> other NVMe controllers.  In that case, the log page entry indicates
> that the CCR might succeed on some other NVMe controller (and in these
> systems, I expect they would not be able to be particularly specific
> about which one).  Very little time will elapse for that to happen.
> 
> It is important for those systems to have a retry on another NVMe
> controller.
> 
Ah, well; that's me being mainly focused on command timeouts.
If we get an NVMe status back indicating we should retry on
another controller then clearly we should be doing that.
The comment above was primarily geared for a CCR command for
which we do _not_ get a result back.

Or, put it another way: as long as we're within the KATO timeout
range we should retry the CCR command on another path.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 13/14] nvme-fc: Use CCR to recover controller that hits an error
  2026-02-10 22:20     ` Mohamed Khalfella
@ 2026-02-13 19:29       ` Mohamed Khalfella
  0 siblings, 0 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-13 19:29 UTC (permalink / raw)
  To: James Smart
  Cc: Justin Tee, Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Aaron Dailey, Randy Jennings, Dhaval Giani, Hannes Reinecke,
	linux-nvme, linux-kernel

On Tue 2026-02-10 14:20:37 -0800, Mohamed Khalfella wrote:
> On Tue 2026-02-10 14:12:24 -0800, James Smart wrote:
> > On 1/30/2026 2:34 PM, Mohamed Khalfella wrote:
> > > +static void nvme_fc_fenced_work(struct work_struct *work)
> > > +{
> > > +	struct nvme_fc_ctrl *fc_ctrl = container_of(to_delayed_work(work),
> > > +			struct nvme_fc_ctrl, fenced_work);
> > > +	struct nvme_ctrl *ctrl = &fc_ctrl->ctrl;
> > > +
> > > +	nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> > > +	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> > > +		queue_work(nvme_reset_wq, &fc_ctrl->ioerr_work);
> > > +}
> > 
> > I'm not a fan of 1, maybe 2, state changes that may silently fail. Some 
> > trace message would be worthwhile to state fencing cancelled/ended.
> > 
> > -- james
> > 
> 
> The change to FENCED should never fail. This is the only transition
> allowed from FENCING state and this is the only place we do that.
> Do you suggest I put WARN_ON() around it?
> 
> The second transition from FENCED to RESETTING can fail if someone
> resets the controller. It should be fine to do nothing in this case
> because they will have queued reset or error work.

Sorry I missed to comment about adding trace message when fencing
finishes. Added it now. 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields
  2026-02-07 13:41         ` Sagi Grimberg
@ 2026-02-14  0:42           ` Randy Jennings
  2026-02-14  3:56             ` Mohamed Khalfella
  0 siblings, 1 reply; 82+ messages in thread
From: Randy Jennings @ 2026-02-14  0:42 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, Mohamed Khalfella, Justin Tee,
	Naresh Gottumukkala, Paul Ely, Chaitanya Kulkarni,
	Christoph Hellwig, Jens Axboe, Keith Busch, Aaron Dailey,
	Dhaval Giani, linux-nvme, linux-kernel

On Sat, Feb 7, 2026 at 5:41 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
>
> > Precisely my point. If CQT defaults to zero no delay will be inserted,
> > but we _still_ have CCR handling. Just proving that both really are
> > independent on each other. Hence I would prefer to have two patchsets.
>
> Agreed. I think it would be cleaner to review each separately. the CCR
> can be based
> on top of the CQT patchset, for simplicity.
>
Hannes, I have some concerns about structuring the CQT patches
separate from the CCR patches.

We are trying to fix the Ghost Write data corruption, and this has
been a long process (involving 3 modifications to the spec and a
number of proposed patch sets).  We’ve already been told (explicitly
and implicitly) that CCR is the more interesting solution (a
euphemism) for solving this data corruption.  I have some concern
that, if  separated, the CCR patches will get accepted without the CQT
patches.

CCR does not work in all cases, so I do not see them as independent as
you are saying here. Also, CQT is a natural time cap on the retries
needed for CCR to handle CCR retries when it cannot succeed down some
paths.  When used as a time cap, CQT fits well with the CCR changes.

I am concerned that it will be more complicated to review the CQT
changes, with at least more patches to review.

Structuring the patches with CQT first and CCR second would take care
of many of these concerns.

Sincerely,
Randy Jennings

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v2 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields
  2026-02-14  0:42           ` Randy Jennings
@ 2026-02-14  3:56             ` Mohamed Khalfella
  0 siblings, 0 replies; 82+ messages in thread
From: Mohamed Khalfella @ 2026-02-14  3:56 UTC (permalink / raw)
  To: Randy Jennings
  Cc: Sagi Grimberg, Hannes Reinecke, Justin Tee, Naresh Gottumukkala,
	Paul Ely, Chaitanya Kulkarni, Christoph Hellwig, Jens Axboe,
	Keith Busch, Aaron Dailey, Dhaval Giani, linux-nvme, linux-kernel

On Fri 2026-02-13 16:42:55 -0800, Randy Jennings wrote:
> On Sat, Feb 7, 2026 at 5:41 AM Sagi Grimberg <sagi@grimberg.me> wrote:
> >
> >
> > > Precisely my point. If CQT defaults to zero no delay will be inserted,
> > > but we _still_ have CCR handling. Just proving that both really are
> > > independent on each other. Hence I would prefer to have two patchsets.
> >
> > Agreed. I think it would be cleaner to review each separately. the CCR
> > can be based
> > on top of the CQT patchset, for simplicity.
> >
> Hannes, I have some concerns about structuring the CQT patches
> separate from the CCR patches.
> 
> We are trying to fix the Ghost Write data corruption, and this has
> been a long process (involving 3 modifications to the spec and a
> number of proposed patch sets).  We’ve already been told (explicitly
> and implicitly) that CCR is the more interesting solution (a
> euphemism) for solving this data corruption.  I have some concern
> that, if  separated, the CCR patches will get accepted without the CQT
> patches.
> 
> CCR does not work in all cases, so I do not see them as independent as
> you are saying here. Also, CQT is a natural time cap on the retries
> needed for CCR to handle CCR retries when it cannot succeed down some
> paths.  When used as a time cap, CQT fits well with the CCR changes.
> 
> I am concerned that it will be more complicated to review the CQT
> changes, with at least more patches to review.

True. It results in more patches to review. However, I think it is
easier to see the delay added by CQT after the it has pulled pulled into
separate set of patches.

> 
> Structuring the patches with CQT first and CCR second would take care
> of many of these concerns.
> 
> Sincerely,
> Randy Jennings


^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2026-02-14  3:57 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-30 22:34 [PATCH v2 00/14] TP8028 Rapid Path Failure Recovery Mohamed Khalfella
2026-01-30 22:34 ` [PATCH v2 01/14] nvmet: Rapid Path Failure Recovery set controller identify fields Mohamed Khalfella
2026-02-03  3:03   ` Hannes Reinecke
2026-02-03 18:14     ` Mohamed Khalfella
2026-02-04  0:34       ` Hannes Reinecke
2026-02-07 13:41         ` Sagi Grimberg
2026-02-14  0:42           ` Randy Jennings
2026-02-14  3:56             ` Mohamed Khalfella
2026-01-30 22:34 ` [PATCH v2 02/14] nvmet/debugfs: Add ctrl uniquifier and random values Mohamed Khalfella
2026-02-03  3:04   ` Hannes Reinecke
2026-02-07 13:47   ` Sagi Grimberg
2026-02-11  0:50   ` Randy Jennings
2026-02-11  1:02     ` Mohamed Khalfella
2026-01-30 22:34 ` [PATCH v2 03/14] nvmet: Implement CCR nvme command Mohamed Khalfella
2026-02-03  3:19   ` Hannes Reinecke
2026-02-03 18:40     ` Mohamed Khalfella
2026-02-04  0:38       ` Hannes Reinecke
2026-02-04  0:44         ` Mohamed Khalfella
2026-02-04  0:55           ` Hannes Reinecke
2026-02-04 17:52             ` Mohamed Khalfella
2026-02-07 13:58               ` Sagi Grimberg
2026-02-08 23:10                 ` Mohamed Khalfella
2026-02-09 19:27                   ` Mohamed Khalfella
2026-02-11  1:34                     ` Randy Jennings
2026-02-07 14:11   ` Sagi Grimberg
2026-01-30 22:34 ` [PATCH v2 04/14] nvmet: Implement CCR logpage Mohamed Khalfella
2026-02-03  3:21   ` Hannes Reinecke
2026-02-07 14:11   ` Sagi Grimberg
2026-02-11  1:49   ` Randy Jennings
2026-01-30 22:34 ` [PATCH v2 05/14] nvmet: Send an AEN on CCR completion Mohamed Khalfella
2026-02-03  3:27   ` Hannes Reinecke
2026-02-03 18:48     ` Mohamed Khalfella
2026-02-04  0:43       ` Hannes Reinecke
2026-02-07 14:12   ` Sagi Grimberg
2026-02-11  1:52   ` Randy Jennings
2026-01-30 22:34 ` [PATCH v2 06/14] nvme: Rapid Path Failure Recovery read controller identify fields Mohamed Khalfella
2026-02-03  3:28   ` Hannes Reinecke
2026-02-07 14:13   ` Sagi Grimberg
2026-02-11  1:56   ` Randy Jennings
2026-01-30 22:34 ` [PATCH v2 07/14] nvme: Introduce FENCING and FENCED controller states Mohamed Khalfella
2026-02-03  5:07   ` Hannes Reinecke
2026-02-03 19:13     ` Mohamed Khalfella
2026-01-30 22:34 ` [PATCH v2 08/14] nvme: Implement cross-controller reset recovery Mohamed Khalfella
2026-02-03  5:19   ` Hannes Reinecke
2026-02-03 20:00     ` Mohamed Khalfella
2026-02-04  1:10       ` Hannes Reinecke
2026-02-04 23:24         ` Mohamed Khalfella
2026-02-11  3:44           ` Randy Jennings
2026-02-11 15:19             ` Hannes Reinecke
2026-02-10 22:09   ` James Smart
2026-02-10 22:27     ` Mohamed Khalfella
2026-02-10 22:49       ` James Smart
2026-02-10 23:25         ` Mohamed Khalfella
2026-02-11  0:12           ` Mohamed Khalfella
2026-02-11  3:33             ` Randy Jennings
2026-01-30 22:34 ` [PATCH v2 09/14] nvme: Implement cross-controller reset completion Mohamed Khalfella
2026-02-03  5:22   ` Hannes Reinecke
2026-02-03 20:07     ` Mohamed Khalfella
2026-01-30 22:34 ` [PATCH v2 10/14] nvme-tcp: Use CCR to recover controller that hits an error Mohamed Khalfella
2026-02-03  5:34   ` Hannes Reinecke
2026-02-03 21:24     ` Mohamed Khalfella
2026-02-04  0:48       ` Randy Jennings
2026-02-04  2:57       ` Hannes Reinecke
2026-02-10  1:39         ` Mohamed Khalfella
2026-01-30 22:34 ` [PATCH v2 11/14] nvme-rdma: " Mohamed Khalfella
2026-02-03  5:35   ` Hannes Reinecke
2026-01-30 22:34 ` [PATCH v2 12/14] nvme-fc: Decouple error recovery from controller reset Mohamed Khalfella
2026-02-03  5:40   ` Hannes Reinecke
2026-02-03 21:29     ` Mohamed Khalfella
2026-02-03 19:19   ` James Smart
2026-02-03 22:49     ` James Smart
2026-02-04  0:15       ` Mohamed Khalfella
2026-02-04  0:11     ` Mohamed Khalfella
2026-02-05  0:08       ` James Smart
2026-02-05  0:59         ` Mohamed Khalfella
2026-02-09 22:53         ` Mohamed Khalfella
2026-01-30 22:34 ` [PATCH v2 13/14] nvme-fc: Use CCR to recover controller that hits an error Mohamed Khalfella
2026-02-03  5:43   ` Hannes Reinecke
2026-02-10 22:12   ` James Smart
2026-02-10 22:20     ` Mohamed Khalfella
2026-02-13 19:29       ` Mohamed Khalfella
2026-01-30 22:34 ` [PATCH v2 14/14] nvme-fc: Hold inflight requests while in FENCING state Mohamed Khalfella

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox