[PATCH v5 0/6] nvme-fabrics: short-circuit connect retries

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/6]  nvme-fabrics: short-circuit connect retries
@ 2024-04-09  9:35 Daniel Wagner
  2024-04-09  9:35 ` [PATCH v5 1/6] nvme: authentication error are always non-retryable Daniel Wagner
                   ` (6 more replies)
  0 siblings, 7 replies; 34+ messages in thread
From: Daniel Wagner @ 2024-04-09  9:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, James Smart, Hannes Reinecke,
	linux-nvme, linux-kernel, Daniel Wagner

The first patch returns only kernel error codes now and avoids overwriting error
codes later. Thje newly introduced helper for deciding if a reconnect should be
attempted is the only place where we have the logic (and documentation).

On the target side I've separate the nvme status from the dhchap status handling
which made it a bit clearer. I was tempted to refactor the code in
nvmet_execute_auth_send to avoid hitting the 80 chars limit but didn't came up
with something nice yet. So let's keep this change at a minimum before any
refactoring attempts.

I've tested with blktests and also an real hardware for nvme-fc.

changes:
v5:
  - nvme: do not mix kernel error code with nvme status
  - nvmet: separate nvme status from dhchap status
  - https://lore.kernel.org/all/20240404154500.2101-1-dwagner@suse.de/

v4:
  - rebased
  - added 'nvme: fixes for authentication errors' series
    https://lore.kernel.org/linux-nvme/20240301112823.132570-1-hare@kernel.org/

v3:
  - added my SOB tag
  - fixed indention
  - https://lore.kernel.org/linux-nvme/20240305080005.3638-1-dwagner@suse.de/

v2:
  - refresh/rebase on current head
  - extended blktests (nvme/045) to cover this case
    (see separate post)
  - https://lore.kernel.org/linux-nvme/20240304161006.19328-1-dwagner@suse.de/
  
v1:
  - initial version
  - https://lore.kernel.org/linux-nvme/20210623143250.82445-1-hare@suse.de/


Daniel Wagner (1):
  nvme-fc: use nvme_ctrl_reconnect to decide reconnect attempts

Hannes Reinecke (5):
  nvme: authentication error are always non-retryable
  nvmet: lock config semaphore when accessing DH-HMAC-CHAP key
  nvme-tcp: short-circuit reconnect retries
  nvme-rdma: short-circuit reconnect retries
  nvmet: return DHCHAP status codes from nvmet_setup_auth()

 drivers/nvme/host/core.c               |  6 ++--
 drivers/nvme/host/fabrics.c            | 25 ++++++-------
 drivers/nvme/host/fc.c                 |  4 +--
 drivers/nvme/host/nvme.h               | 26 +++++++++++++-
 drivers/nvme/host/rdma.c               | 22 ++++++++----
 drivers/nvme/host/tcp.c                | 23 +++++++-----
 drivers/nvme/target/auth.c             | 22 ++++++------
 drivers/nvme/target/configfs.c         | 22 +++++++++---
 drivers/nvme/target/fabrics-cmd-auth.c | 49 +++++++++++++-------------
 drivers/nvme/target/fabrics-cmd.c      | 11 +++---
 drivers/nvme/target/nvmet.h            |  8 ++---
 11 files changed, 134 insertions(+), 84 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v5 1/6] nvme: authentication error are always non-retryable
  2024-04-09  9:35 [PATCH v5 0/6] nvme-fabrics: short-circuit connect retries Daniel Wagner
@ 2024-04-09  9:35 ` Daniel Wagner
  2024-04-09 13:59   ` Christoph Hellwig
                     ` (2 more replies)
  2024-04-09  9:35 ` [PATCH v5 2/6] nvmet: lock config semaphore when accessing DH-HMAC-CHAP key Daniel Wagner
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 34+ messages in thread
From: Daniel Wagner @ 2024-04-09  9:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, James Smart, Hannes Reinecke,
	linux-nvme, linux-kernel, Daniel Wagner

From: Hannes Reinecke <hare@suse.de>

Any authentication errors which are generated internally are always
non-retryable, so use negative error codes to ensure they are not
retried.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/nvme/host/core.c    |  6 +++---
 drivers/nvme/host/fabrics.c | 25 +++++++++++++------------
 drivers/nvme/host/nvme.h    |  2 +-
 3 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 504dc352c458..66387bcca8ae 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -383,14 +383,14 @@ static inline enum nvme_disposition nvme_decide_disposition(struct request *req)
 	if (likely(nvme_req(req)->status == 0))
 		return COMPLETE;
 
-	if ((nvme_req(req)->status & 0x7ff) == NVME_SC_AUTH_REQUIRED)
-		return AUTHENTICATE;
-
 	if (blk_noretry_request(req) ||
 	    (nvme_req(req)->status & NVME_SC_DNR) ||
 	    nvme_req(req)->retries >= nvme_max_retries)
 		return COMPLETE;
 
+	if ((nvme_req(req)->status & 0x7ff) == NVME_SC_AUTH_REQUIRED)
+		return AUTHENTICATE;
+
 	if (req->cmd_flags & REQ_NVME_MPATH) {
 		if (nvme_is_path_error(nvme_req(req)->status) ||
 		    blk_queue_dying(req->q))
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 1f0ea1f32d22..0166f8a3a3eb 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -467,7 +467,7 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl)
 		if (result & NVME_CONNECT_AUTHREQ_ASCR) {
 			dev_warn(ctrl->device,
 				 "qid 0: secure concatenation is not supported\n");
-			ret = NVME_SC_AUTH_REQUIRED;
+			ret = -EOPNOTSUPP;
 			goto out_free_data;
 		}
 		/* Authentication required */
@@ -475,14 +475,14 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl)
 		if (ret) {
 			dev_warn(ctrl->device,
 				 "qid 0: authentication setup failed\n");
-			ret = NVME_SC_AUTH_REQUIRED;
 			goto out_free_data;
 		}
 		ret = nvme_auth_wait(ctrl, 0);
-		if (ret)
+		if (ret) {
 			dev_warn(ctrl->device,
-				 "qid 0: authentication failed\n");
-		else
+				 "qid 0: authentication failed, error %d\n",
+				 ret);
+		} else
 			dev_info(ctrl->device,
 				 "qid 0: authenticated\n");
 	}
@@ -542,7 +542,7 @@ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid)
 		if (result & NVME_CONNECT_AUTHREQ_ASCR) {
 			dev_warn(ctrl->device,
 				 "qid 0: secure concatenation is not supported\n");
-			ret = NVME_SC_AUTH_REQUIRED;
+			ret = -EOPNOTSUPP;
 			goto out_free_data;
 		}
 		/* Authentication required */
@@ -550,12 +550,13 @@ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid)
 		if (ret) {
 			dev_warn(ctrl->device,
 				 "qid %d: authentication setup failed\n", qid);
-			ret = NVME_SC_AUTH_REQUIRED;
-		} else {
-			ret = nvme_auth_wait(ctrl, qid);
-			if (ret)
-				dev_warn(ctrl->device,
-					 "qid %u: authentication failed\n", qid);
+			goto out_free_data;
+		}
+		ret = nvme_auth_wait(ctrl, qid);
+		if (ret) {
+			dev_warn(ctrl->device,
+				 "qid %u: authentication failed, error %d\n",
+				 qid, ret);
 		}
 	}
 out_free_data:
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index d0ed64dc7380..9b8904a476b8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -1122,7 +1122,7 @@ static inline int nvme_auth_negotiate(struct nvme_ctrl *ctrl, int qid)
 }
 static inline int nvme_auth_wait(struct nvme_ctrl *ctrl, int qid)
 {
-	return NVME_SC_AUTH_REQUIRED;
+	return -EPROTONOSUPPORT;
 }
 static inline void nvme_auth_free(struct nvme_ctrl *ctrl) {};
 #endif
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v5 2/6] nvmet: lock config semaphore when accessing DH-HMAC-CHAP key
  2024-04-09  9:35 [PATCH v5 0/6] nvme-fabrics: short-circuit connect retries Daniel Wagner
  2024-04-09  9:35 ` [PATCH v5 1/6] nvme: authentication error are always non-retryable Daniel Wagner
@ 2024-04-09  9:35 ` Daniel Wagner
  2024-04-09  9:35 ` [PATCH v5 3/6] nvme-tcp: short-circuit reconnect retries Daniel Wagner
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 34+ messages in thread
From: Daniel Wagner @ 2024-04-09  9:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, James Smart, Hannes Reinecke,
	linux-nvme, linux-kernel, Hannes Reinecke, Daniel Wagner

From: Hannes Reinecke <hare@kernel.org>

When the DH-HMAC-CHAP key is accessed via configfs we need to take the
config semaphore as a reconnect might be running at the same time.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/nvme/target/auth.c     |  2 ++
 drivers/nvme/target/configfs.c | 22 +++++++++++++++++-----
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/target/auth.c b/drivers/nvme/target/auth.c
index 3ddbc3880cac..9afc28f1ffac 100644
--- a/drivers/nvme/target/auth.c
+++ b/drivers/nvme/target/auth.c
@@ -44,6 +44,7 @@ int nvmet_auth_set_key(struct nvmet_host *host, const char *secret,
 	dhchap_secret = kstrdup(secret, GFP_KERNEL);
 	if (!dhchap_secret)
 		return -ENOMEM;
+	down_write(&nvmet_config_sem);
 	if (set_ctrl) {
 		kfree(host->dhchap_ctrl_secret);
 		host->dhchap_ctrl_secret = strim(dhchap_secret);
@@ -53,6 +54,7 @@ int nvmet_auth_set_key(struct nvmet_host *host, const char *secret,
 		host->dhchap_secret = strim(dhchap_secret);
 		host->dhchap_key_hash = key_hash;
 	}
+	up_write(&nvmet_config_sem);
 	return 0;
 }
 
diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index 77a6e817b315..7c28b9c0ee57 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -1990,11 +1990,17 @@ static struct config_group nvmet_ports_group;
 static ssize_t nvmet_host_dhchap_key_show(struct config_item *item,
 		char *page)
 {
-	u8 *dhchap_secret = to_host(item)->dhchap_secret;
+	u8 *dhchap_secret;
+	ssize_t ret;
 
+	down_read(&nvmet_config_sem);
+	dhchap_secret = to_host(item)->dhchap_secret;
 	if (!dhchap_secret)
-		return sprintf(page, "\n");
-	return sprintf(page, "%s\n", dhchap_secret);
+		ret = sprintf(page, "\n");
+	else
+		ret = sprintf(page, "%s\n", dhchap_secret);
+	up_read(&nvmet_config_sem);
+	return ret;
 }
 
 static ssize_t nvmet_host_dhchap_key_store(struct config_item *item,
@@ -2018,10 +2024,16 @@ static ssize_t nvmet_host_dhchap_ctrl_key_show(struct config_item *item,
 		char *page)
 {
 	u8 *dhchap_secret = to_host(item)->dhchap_ctrl_secret;
+	ssize_t ret;
 
+	down_read(&nvmet_config_sem);
+	dhchap_secret = to_host(item)->dhchap_ctrl_secret;
 	if (!dhchap_secret)
-		return sprintf(page, "\n");
-	return sprintf(page, "%s\n", dhchap_secret);
+		ret = sprintf(page, "\n");
+	else
+		ret = sprintf(page, "%s\n", dhchap_secret);
+	up_read(&nvmet_config_sem);
+	return ret;
 }
 
 static ssize_t nvmet_host_dhchap_ctrl_key_store(struct config_item *item,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v5 3/6] nvme-tcp: short-circuit reconnect retries
  2024-04-09  9:35 [PATCH v5 0/6] nvme-fabrics: short-circuit connect retries Daniel Wagner
  2024-04-09  9:35 ` [PATCH v5 1/6] nvme: authentication error are always non-retryable Daniel Wagner
  2024-04-09  9:35 ` [PATCH v5 2/6] nvmet: lock config semaphore when accessing DH-HMAC-CHAP key Daniel Wagner
@ 2024-04-09  9:35 ` Daniel Wagner
  2024-04-09 13:59   ` Christoph Hellwig
                     ` (2 more replies)
  2024-04-09  9:35 ` [PATCH v5 4/6] nvme-rdma: " Daniel Wagner
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 34+ messages in thread
From: Daniel Wagner @ 2024-04-09  9:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, James Smart, Hannes Reinecke,
	linux-nvme, linux-kernel, Daniel Wagner

From: Hannes Reinecke <hare@suse.de>

Returning an nvme status from nvme_tcp_setup_ctrl() indicates that the
association was established and we have received a status from the
controller; consequently we should honour the DNR bit. If not any future
reconnect attempts will just return the same error, so we can
short-circuit the reconnect attempts and fail the connection directly.

Signed-off-by: Hannes Reinecke <hare@suse.de>
[dwagner: add helper to decide to reconnect]
Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/nvme/host/nvme.h | 24 ++++++++++++++++++++++++
 drivers/nvme/host/tcp.c  | 23 +++++++++++++++--------
 2 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9b8904a476b8..dfe103283a3d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -701,6 +701,30 @@ static inline bool nvme_is_path_error(u16 status)
 	return (status & 0x700) == 0x300;
 }
 
+/*
+ * Evaluate the status information returned by the LLDD in order to
+ * decided if a reconnect attempt should be scheduled.
+ *
+ * There are two cases where no reconnect attempt should be attempted:
+ *
+ * 1) The LLDD reports an negative status. There was an error (e.g. no
+ *    memory) on the host side and thus abort the operation.
+ *    Note, there are exception such as ENOTCONN which is
+ *    not an internal driver error, thus we filter these errors
+ *    out and retry later.
+ * 2) The DNR bit is set and the specification states no further
+ *    connect attempts with the same set of paramenters should be
+ *    attempted.
+ */
+static inline bool nvme_ctrl_reconnect(int status)
+{
+	if (status < 0 && status != -ENOTCONN)
+		return false;
+	else if (status > 0 && (status & NVME_SC_DNR))
+		return false;
+	return true;
+}
+
 /*
  * Fill in the status and result information from the CQE, and then figure out
  * if blk-mq will need to use IPI magic to complete the request, and if yes do
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index fdbcdcedcee9..7e25a96e9870 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2155,9 +2155,11 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
 	nvme_tcp_destroy_io_queues(ctrl, remove);
 }
 
-static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
+static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl,
+		int status)
 {
 	enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
+	bool recon = nvme_ctrl_reconnect(status);
 
 	/* If we are resetting/deleting then do nothing */
 	if (state != NVME_CTRL_CONNECTING) {
@@ -2165,13 +2167,14 @@ static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
 		return;
 	}
 
-	if (nvmf_should_reconnect(ctrl)) {
+	if (recon && nvmf_should_reconnect(ctrl)) {
 		dev_info(ctrl->device, "Reconnecting in %d seconds...\n",
 			ctrl->opts->reconnect_delay);
 		queue_delayed_work(nvme_wq, &to_tcp_ctrl(ctrl)->connect_work,
 				ctrl->opts->reconnect_delay * HZ);
 	} else {
-		dev_info(ctrl->device, "Removing controller...\n");
+		dev_info(ctrl->device, "Removing controller (%d)...\n",
+			 status);
 		nvme_delete_ctrl(ctrl);
 	}
 }
@@ -2252,10 +2255,12 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
 	struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
 			struct nvme_tcp_ctrl, connect_work);
 	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
+	int ret;
 
 	++ctrl->nr_reconnects;
 
-	if (nvme_tcp_setup_ctrl(ctrl, false))
+	ret = nvme_tcp_setup_ctrl(ctrl, false);
+	if (ret)
 		goto requeue;
 
 	dev_info(ctrl->device, "Successfully reconnected (%d attempt)\n",
@@ -2268,7 +2273,7 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
 requeue:
 	dev_info(ctrl->device, "Failed reconnect attempt %d\n",
 			ctrl->nr_reconnects);
-	nvme_tcp_reconnect_or_remove(ctrl);
+	nvme_tcp_reconnect_or_remove(ctrl, ret);
 }
 
 static void nvme_tcp_error_recovery_work(struct work_struct *work)
@@ -2295,7 +2300,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
 		return;
 	}
 
-	nvme_tcp_reconnect_or_remove(ctrl);
+	nvme_tcp_reconnect_or_remove(ctrl, -ENOTCONN);
 }
 
 static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
@@ -2315,6 +2320,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
 {
 	struct nvme_ctrl *ctrl =
 		container_of(work, struct nvme_ctrl, reset_work);
+	int ret;
 
 	nvme_stop_ctrl(ctrl);
 	nvme_tcp_teardown_ctrl(ctrl, false);
@@ -2328,14 +2334,15 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
 		return;
 	}
 
-	if (nvme_tcp_setup_ctrl(ctrl, false))
+	ret = nvme_tcp_setup_ctrl(ctrl, false);
+	if (ret)
 		goto out_fail;
 
 	return;
 
 out_fail:
 	++ctrl->nr_reconnects;
-	nvme_tcp_reconnect_or_remove(ctrl);
+	nvme_tcp_reconnect_or_remove(ctrl, ret);
 }
 
 static void nvme_tcp_stop_ctrl(struct nvme_ctrl *ctrl)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v5 4/6] nvme-rdma: short-circuit reconnect retries
  2024-04-09  9:35 [PATCH v5 0/6] nvme-fabrics: short-circuit connect retries Daniel Wagner
                   ` (2 preceding siblings ...)
  2024-04-09  9:35 ` [PATCH v5 3/6] nvme-tcp: short-circuit reconnect retries Daniel Wagner
@ 2024-04-09  9:35 ` Daniel Wagner
  2024-04-09 14:00   ` Christoph Hellwig
  2024-04-09 20:28   ` Sagi Grimberg
  2024-04-09  9:35 ` [PATCH v5 5/6] nvme-fc: use nvme_ctrl_reconnect to decide reconnect attempts Daniel Wagner
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 34+ messages in thread
From: Daniel Wagner @ 2024-04-09  9:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, James Smart, Hannes Reinecke,
	linux-nvme, linux-kernel, Chaitanya Kulkarni, Daniel Wagner

From: Hannes Reinecke <hare@suse.de>

Returning an nvme status from nvme_rdma_setup_ctrl() indicates that the
association was established and we have received a status from the
controller; consequently we should honour the DNR bit. If not any future
reconnect attempts will just return the same error, so we can
short-circuit the reconnect attempts and fail the connection directly.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/nvme/host/rdma.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 366f0bb4ebfc..53d51df26ae1 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -982,9 +982,11 @@ static void nvme_rdma_free_ctrl(struct nvme_ctrl *nctrl)
 	kfree(ctrl);
 }
 
-static void nvme_rdma_reconnect_or_remove(struct nvme_rdma_ctrl *ctrl)
+static void nvme_rdma_reconnect_or_remove(struct nvme_rdma_ctrl *ctrl,
+		int status)
 {
 	enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
+	bool recon = nvme_ctrl_reconnect(status);
 
 	/* If we are resetting/deleting then do nothing */
 	if (state != NVME_CTRL_CONNECTING) {
@@ -992,12 +994,14 @@ static void nvme_rdma_reconnect_or_remove(struct nvme_rdma_ctrl *ctrl)
 		return;
 	}
 
-	if (nvmf_should_reconnect(&ctrl->ctrl)) {
+	if (recon && nvmf_should_reconnect(&ctrl->ctrl)) {
 		dev_info(ctrl->ctrl.device, "Reconnecting in %d seconds...\n",
 			ctrl->ctrl.opts->reconnect_delay);
 		queue_delayed_work(nvme_wq, &ctrl->reconnect_work,
 				ctrl->ctrl.opts->reconnect_delay * HZ);
 	} else {
+		dev_info(ctrl->ctrl.device, "Removing controller (%d)...\n",
+			 status);
 		nvme_delete_ctrl(&ctrl->ctrl);
 	}
 }
@@ -1104,10 +1108,12 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
 {
 	struct nvme_rdma_ctrl *ctrl = container_of(to_delayed_work(work),
 			struct nvme_rdma_ctrl, reconnect_work);
+	int ret;
 
 	++ctrl->ctrl.nr_reconnects;
 
-	if (nvme_rdma_setup_ctrl(ctrl, false))
+	ret = nvme_rdma_setup_ctrl(ctrl, false);
+	if (ret)
 		goto requeue;
 
 	dev_info(ctrl->ctrl.device, "Successfully reconnected (%d attempts)\n",
@@ -1120,7 +1126,7 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
 requeue:
 	dev_info(ctrl->ctrl.device, "Failed reconnect attempt %d\n",
 			ctrl->ctrl.nr_reconnects);
-	nvme_rdma_reconnect_or_remove(ctrl);
+	nvme_rdma_reconnect_or_remove(ctrl, ret);
 }
 
 static void nvme_rdma_error_recovery_work(struct work_struct *work)
@@ -1145,7 +1151,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
 		return;
 	}
 
-	nvme_rdma_reconnect_or_remove(ctrl);
+	nvme_rdma_reconnect_or_remove(ctrl, -ENOTCONN);
 }
 
 static void nvme_rdma_error_recovery(struct nvme_rdma_ctrl *ctrl)
@@ -2169,6 +2175,7 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
 {
 	struct nvme_rdma_ctrl *ctrl =
 		container_of(work, struct nvme_rdma_ctrl, ctrl.reset_work);
+	int ret;
 
 	nvme_stop_ctrl(&ctrl->ctrl);
 	nvme_rdma_shutdown_ctrl(ctrl, false);
@@ -2179,14 +2186,15 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
 		return;
 	}
 
-	if (nvme_rdma_setup_ctrl(ctrl, false))
+	ret = nvme_rdma_setup_ctrl(ctrl, false);
+	if (ret)
 		goto out_fail;
 
 	return;
 
 out_fail:
 	++ctrl->ctrl.nr_reconnects;
-	nvme_rdma_reconnect_or_remove(ctrl);
+	nvme_rdma_reconnect_or_remove(ctrl, ret);
 }
 
 static const struct nvme_ctrl_ops nvme_rdma_ctrl_ops = {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v5 5/6] nvme-fc: use nvme_ctrl_reconnect to decide reconnect attempts
  2024-04-09  9:35 [PATCH v5 0/6] nvme-fabrics: short-circuit connect retries Daniel Wagner
                   ` (3 preceding siblings ...)
  2024-04-09  9:35 ` [PATCH v5 4/6] nvme-rdma: " Daniel Wagner
@ 2024-04-09  9:35 ` Daniel Wagner
  2024-04-09 14:01   ` Christoph Hellwig
  2024-04-09  9:35 ` [PATCH v5 6/6] nvmet: return DHCHAP status codes from nvmet_setup_auth() Daniel Wagner
  2024-04-12  0:35 ` [PATCH v5 0/6] nvme-fabrics: short-circuit connect retries Keith Busch
  6 siblings, 1 reply; 34+ messages in thread
From: Daniel Wagner @ 2024-04-09  9:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, James Smart, Hannes Reinecke,
	linux-nvme, linux-kernel, Daniel Wagner

The fc transport handles the DNR for reconnect correctly, though it
ignores all the negative error code returned by the LLDD. Thus, use the
nvme_ctrl_reconnect helper to ensure to have consistent behavior for all
transports.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/nvme/host/fc.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index a5b29e9ad342..a58f08304459 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -3301,7 +3301,7 @@ nvme_fc_reconnect_or_delete(struct nvme_fc_ctrl *ctrl, int status)
 	struct nvme_fc_rport *rport = ctrl->rport;
 	struct nvme_fc_remote_port *portptr = &rport->remoteport;
 	unsigned long recon_delay = ctrl->ctrl.opts->reconnect_delay * HZ;
-	bool recon = true;
+	bool recon = nvme_ctrl_reconnect(status);
 
 	if (nvme_ctrl_state(&ctrl->ctrl) != NVME_CTRL_CONNECTING)
 		return;
@@ -3310,8 +3310,6 @@ nvme_fc_reconnect_or_delete(struct nvme_fc_ctrl *ctrl, int status)
 		dev_info(ctrl->ctrl.device,
 			"NVME-FC{%d}: reset: Reconnect attempt failed (%d)\n",
 			ctrl->cnum, status);
-		if (status > 0 && (status & NVME_SC_DNR))
-			recon = false;
 	} else if (time_after_eq(jiffies, rport->dev_loss_end))
 		recon = false;
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v5 6/6] nvmet: return DHCHAP status codes from nvmet_setup_auth()
  2024-04-09  9:35 [PATCH v5 0/6] nvme-fabrics: short-circuit connect retries Daniel Wagner
                   ` (4 preceding siblings ...)
  2024-04-09  9:35 ` [PATCH v5 5/6] nvme-fc: use nvme_ctrl_reconnect to decide reconnect attempts Daniel Wagner
@ 2024-04-09  9:35 ` Daniel Wagner
  2024-04-09 14:03   ` Christoph Hellwig
  2024-04-09 20:23   ` Sagi Grimberg
  2024-04-12  0:35 ` [PATCH v5 0/6] nvme-fabrics: short-circuit connect retries Keith Busch
  6 siblings, 2 replies; 34+ messages in thread
From: Daniel Wagner @ 2024-04-09  9:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, James Smart, Hannes Reinecke,
	linux-nvme, linux-kernel, Hannes Reinecke, Daniel Wagner

From: Hannes Reinecke <hare@kernel.org>

A failure in nvmet_setup_auth() does not mean that the NVMe
authentication command failed, so we should rather return a protocol
error with a 'failure1' response than an NVMe status.

Also update the type used for dhchap_step and dhchap_status to u8 to
avoid confusions with nvme status. Furthermore, split dhchap_status and
nvme status so we don't accidentally mix these return values.

Signed-off-by: Hannes Reinecke <hare@suse.de>
[dwagner: - use u8 as type for dhchap_{step|status}
          - separate nvme status from dhcap_status]
Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 drivers/nvme/target/auth.c             | 20 +++++------
 drivers/nvme/target/fabrics-cmd-auth.c | 49 +++++++++++++-------------
 drivers/nvme/target/fabrics-cmd.c      | 11 +++---
 drivers/nvme/target/nvmet.h            |  8 ++---
 4 files changed, 43 insertions(+), 45 deletions(-)

diff --git a/drivers/nvme/target/auth.c b/drivers/nvme/target/auth.c
index 9afc28f1ffac..53bf1a084469 100644
--- a/drivers/nvme/target/auth.c
+++ b/drivers/nvme/target/auth.c
@@ -126,12 +126,11 @@ int nvmet_setup_dhgroup(struct nvmet_ctrl *ctrl, u8 dhgroup_id)
 	return ret;
 }
 
-int nvmet_setup_auth(struct nvmet_ctrl *ctrl)
+u8 nvmet_setup_auth(struct nvmet_ctrl *ctrl)
 {
 	int ret = 0;
 	struct nvmet_host_link *p;
 	struct nvmet_host *host = NULL;
-	const char *hash_name;
 
 	down_read(&nvmet_config_sem);
 	if (nvmet_is_disc_subsys(ctrl->subsys))
@@ -149,13 +148,16 @@ int nvmet_setup_auth(struct nvmet_ctrl *ctrl)
 	}
 	if (!host) {
 		pr_debug("host %s not found\n", ctrl->hostnqn);
-		ret = -EPERM;
+		ret = NVME_AUTH_DHCHAP_FAILURE_FAILED;
 		goto out_unlock;
 	}
 
 	ret = nvmet_setup_dhgroup(ctrl, host->dhchap_dhgroup_id);
-	if (ret < 0)
+	if (ret < 0) {
 		pr_warn("Failed to setup DH group");
+		ret = NVME_AUTH_DHCHAP_FAILURE_DHGROUP_UNUSABLE;
+		goto out_unlock;
+	}
 
 	if (!host->dhchap_secret) {
 		pr_debug("No authentication provided\n");
@@ -166,12 +168,6 @@ int nvmet_setup_auth(struct nvmet_ctrl *ctrl)
 		pr_debug("Re-use existing hash ID %d\n",
 			 ctrl->shash_id);
 	} else {
-		hash_name = nvme_auth_hmac_name(host->dhchap_hash_id);
-		if (!hash_name) {
-			pr_warn("Hash ID %d invalid\n", host->dhchap_hash_id);
-			ret = -EINVAL;
-			goto out_unlock;
-		}
 		ctrl->shash_id = host->dhchap_hash_id;
 	}
 
@@ -180,7 +176,7 @@ int nvmet_setup_auth(struct nvmet_ctrl *ctrl)
 	ctrl->host_key = nvme_auth_extract_key(host->dhchap_secret + 10,
 					       host->dhchap_key_hash);
 	if (IS_ERR(ctrl->host_key)) {
-		ret = PTR_ERR(ctrl->host_key);
+		ret = NVME_AUTH_DHCHAP_FAILURE_NOT_USABLE;
 		ctrl->host_key = NULL;
 		goto out_free_hash;
 	}
@@ -198,7 +194,7 @@ int nvmet_setup_auth(struct nvmet_ctrl *ctrl)
 	ctrl->ctrl_key = nvme_auth_extract_key(host->dhchap_ctrl_secret + 10,
 					       host->dhchap_ctrl_key_hash);
 	if (IS_ERR(ctrl->ctrl_key)) {
-		ret = PTR_ERR(ctrl->ctrl_key);
+		ret = NVME_AUTH_DHCHAP_FAILURE_NOT_USABLE;
 		ctrl->ctrl_key = NULL;
 		goto out_free_hash;
 	}
diff --git a/drivers/nvme/target/fabrics-cmd-auth.c b/drivers/nvme/target/fabrics-cmd-auth.c
index eb7785be0ca7..d61b8c6ff3b2 100644
--- a/drivers/nvme/target/fabrics-cmd-auth.c
+++ b/drivers/nvme/target/fabrics-cmd-auth.c
@@ -31,7 +31,7 @@ void nvmet_auth_sq_init(struct nvmet_sq *sq)
 	sq->dhchap_step = NVME_AUTH_DHCHAP_MESSAGE_NEGOTIATE;
 }
 
-static u16 nvmet_auth_negotiate(struct nvmet_req *req, void *d)
+static u8 nvmet_auth_negotiate(struct nvmet_req *req, void *d)
 {
 	struct nvmet_ctrl *ctrl = req->sq->ctrl;
 	struct nvmf_auth_dhchap_negotiate_data *data = d;
@@ -109,7 +109,7 @@ static u16 nvmet_auth_negotiate(struct nvmet_req *req, void *d)
 	return 0;
 }
 
-static u16 nvmet_auth_reply(struct nvmet_req *req, void *d)
+static u8 nvmet_auth_reply(struct nvmet_req *req, void *d)
 {
 	struct nvmet_ctrl *ctrl = req->sq->ctrl;
 	struct nvmf_auth_dhchap_reply_data *data = d;
@@ -172,7 +172,7 @@ static u16 nvmet_auth_reply(struct nvmet_req *req, void *d)
 	return 0;
 }
 
-static u16 nvmet_auth_failure2(void *d)
+static u8 nvmet_auth_failure2(void *d)
 {
 	struct nvmf_auth_dhchap_failure_data *data = d;
 
@@ -186,6 +186,7 @@ void nvmet_execute_auth_send(struct nvmet_req *req)
 	void *d;
 	u32 tl;
 	u16 status = 0;
+	u8 dhchap_status;
 
 	if (req->cmd->auth_send.secp != NVME_AUTH_DHCHAP_PROTOCOL_IDENTIFIER) {
 		status = NVME_SC_INVALID_FIELD | NVME_SC_DNR;
@@ -237,30 +238,32 @@ void nvmet_execute_auth_send(struct nvmet_req *req)
 	if (data->auth_type == NVME_AUTH_COMMON_MESSAGES) {
 		if (data->auth_id == NVME_AUTH_DHCHAP_MESSAGE_NEGOTIATE) {
 			/* Restart negotiation */
-			pr_debug("%s: ctrl %d qid %d reset negotiation\n", __func__,
-				 ctrl->cntlid, req->sq->qid);
+			pr_debug("%s: ctrl %d qid %d reset negotiation\n",
+				 __func__, ctrl->cntlid, req->sq->qid);
 			if (!req->sq->qid) {
-				if (nvmet_setup_auth(ctrl) < 0) {
-					status = NVME_SC_INTERNAL;
-					pr_err("ctrl %d qid 0 failed to setup"
-					       "re-authentication",
+				dhchap_status = nvmet_setup_auth(ctrl);
+				if (dhchap_status) {
+					pr_err("ctrl %d qid 0 failed to setup re-authentication\n",
 					       ctrl->cntlid);
-					goto done_failure1;
+					req->sq->dhchap_status = dhchap_status;
+					req->sq->dhchap_step =
+						NVME_AUTH_DHCHAP_MESSAGE_FAILURE1;
+					goto done_kfree;
 				}
 			}
-			req->sq->dhchap_step = NVME_AUTH_DHCHAP_MESSAGE_NEGOTIATE;
+			req->sq->dhchap_step =
+				NVME_AUTH_DHCHAP_MESSAGE_NEGOTIATE;
 		} else if (data->auth_id != req->sq->dhchap_step)
 			goto done_failure1;
 		/* Validate negotiation parameters */
-		status = nvmet_auth_negotiate(req, d);
-		if (status == 0)
+		dhchap_status = nvmet_auth_negotiate(req, d);
+		if (dhchap_status == 0)
 			req->sq->dhchap_step =
 				NVME_AUTH_DHCHAP_MESSAGE_CHALLENGE;
 		else {
 			req->sq->dhchap_step =
 				NVME_AUTH_DHCHAP_MESSAGE_FAILURE1;
-			req->sq->dhchap_status = status;
-			status = 0;
+			req->sq->dhchap_status = dhchap_status;
 		}
 		goto done_kfree;
 	}
@@ -284,15 +287,14 @@ void nvmet_execute_auth_send(struct nvmet_req *req)
 
 	switch (data->auth_id) {
 	case NVME_AUTH_DHCHAP_MESSAGE_REPLY:
-		status = nvmet_auth_reply(req, d);
-		if (status == 0)
+		dhchap_status = nvmet_auth_reply(req, d);
+		if (dhchap_status == 0)
 			req->sq->dhchap_step =
 				NVME_AUTH_DHCHAP_MESSAGE_SUCCESS1;
 		else {
 			req->sq->dhchap_step =
 				NVME_AUTH_DHCHAP_MESSAGE_FAILURE1;
-			req->sq->dhchap_status = status;
-			status = 0;
+			req->sq->dhchap_status = dhchap_status;
 		}
 		goto done_kfree;
 	case NVME_AUTH_DHCHAP_MESSAGE_SUCCESS2:
@@ -301,13 +303,12 @@ void nvmet_execute_auth_send(struct nvmet_req *req)
 			 __func__, ctrl->cntlid, req->sq->qid);
 		goto done_kfree;
 	case NVME_AUTH_DHCHAP_MESSAGE_FAILURE2:
-		status = nvmet_auth_failure2(d);
-		if (status) {
+		dhchap_status = nvmet_auth_failure2(d);
+		if (dhchap_status) {
 			pr_warn("ctrl %d qid %d: authentication failed (%d)\n",
-				ctrl->cntlid, req->sq->qid, status);
-			req->sq->dhchap_status = status;
+				ctrl->cntlid, req->sq->qid, dhchap_status);
+			req->sq->dhchap_status = dhchap_status;
 			req->sq->authenticated = false;
-			status = 0;
 		}
 		goto done_kfree;
 	default:
diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c
index b23f4cf840bd..042b379cbb36 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -211,7 +211,7 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 	struct nvmf_connect_data *d;
 	struct nvmet_ctrl *ctrl = NULL;
 	u16 status;
-	int ret;
+	u8 dhchap_status;
 
 	if (!nvmet_check_transfer_len(req, sizeof(struct nvmf_connect_data)))
 		return;
@@ -254,11 +254,12 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
 
 	uuid_copy(&ctrl->hostid, &d->hostid);
 
-	ret = nvmet_setup_auth(ctrl);
-	if (ret < 0) {
-		pr_err("Failed to setup authentication, error %d\n", ret);
+	dhchap_status = nvmet_setup_auth(ctrl);
+	if (dhchap_status) {
+		pr_err("Failed to setup authentication, dhchap status %u\n",
+		       dhchap_status);
 		nvmet_ctrl_put(ctrl);
-		if (ret == -EPERM)
+		if (dhchap_status == NVME_AUTH_DHCHAP_FAILURE_FAILED)
 			status = (NVME_SC_CONNECT_INVALID_HOST | NVME_SC_DNR);
 		else
 			status = NVME_SC_INTERNAL;
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 257cace53924..5d6adb5c2fc0 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -113,8 +113,8 @@ struct nvmet_sq {
 	bool			authenticated;
 	struct delayed_work	auth_expired_work;
 	u16			dhchap_tid;
-	u16			dhchap_status;
-	int			dhchap_step;
+	u8			dhchap_status;
+	u8			dhchap_step;
 	u8			*dhchap_c1;
 	u8			*dhchap_c2;
 	u32			dhchap_s1;
@@ -721,7 +721,7 @@ void nvmet_execute_auth_receive(struct nvmet_req *req);
 int nvmet_auth_set_key(struct nvmet_host *host, const char *secret,
 		       bool set_ctrl);
 int nvmet_auth_set_host_hash(struct nvmet_host *host, const char *hash);
-int nvmet_setup_auth(struct nvmet_ctrl *ctrl);
+u8 nvmet_setup_auth(struct nvmet_ctrl *ctrl);
 void nvmet_auth_sq_init(struct nvmet_sq *sq);
 void nvmet_destroy_auth(struct nvmet_ctrl *ctrl);
 void nvmet_auth_sq_free(struct nvmet_sq *sq);
@@ -740,7 +740,7 @@ int nvmet_auth_ctrl_exponential(struct nvmet_req *req,
 int nvmet_auth_ctrl_sesskey(struct nvmet_req *req,
 			    u8 *buf, int buf_size);
 #else
-static inline int nvmet_setup_auth(struct nvmet_ctrl *ctrl)
+static inline u8 nvmet_setup_auth(struct nvmet_ctrl *ctrl)
 {
 	return 0;
 }
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable
  2024-04-09  9:35 ` [PATCH v5 1/6] nvme: authentication error are always non-retryable Daniel Wagner
@ 2024-04-09 13:59   ` Christoph Hellwig
  2024-04-09 20:16   ` Sagi Grimberg
  2024-04-09 20:26   ` Sagi Grimberg
  2 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2024-04-09 13:59 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, James Smart,
	Hannes Reinecke, linux-nvme, linux-kernel

On Tue, Apr 09, 2024 at 11:35:05AM +0200, Daniel Wagner wrote:
> From: Hannes Reinecke <hare@suse.de>
> 
> Any authentication errors which are generated internally are always
> non-retryable, so use negative error codes to ensure they are not
> retried.

Thanks, this looks much better now:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/6] nvme-tcp: short-circuit reconnect retries
  2024-04-09  9:35 ` [PATCH v5 3/6] nvme-tcp: short-circuit reconnect retries Daniel Wagner
@ 2024-04-09 13:59   ` Christoph Hellwig
  2024-04-09 20:20   ` Sagi Grimberg
  2024-04-09 20:40   ` Sagi Grimberg
  2 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2024-04-09 13:59 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, James Smart,
	Hannes Reinecke, linux-nvme, linux-kernel

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 4/6] nvme-rdma: short-circuit reconnect retries
  2024-04-09  9:35 ` [PATCH v5 4/6] nvme-rdma: " Daniel Wagner
@ 2024-04-09 14:00   ` Christoph Hellwig
  2024-04-09 14:19     ` Keith Busch
  2024-04-09 20:28   ` Sagi Grimberg
  1 sibling, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2024-04-09 14:00 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, James Smart,
	Hannes Reinecke, linux-nvme, linux-kernel, Chaitanya Kulkarni

On Tue, Apr 09, 2024 at 11:35:08AM +0200, Daniel Wagner wrote:
> From: Hannes Reinecke <hare@suse.de>
> 
> Returning an nvme status from nvme_rdma_setup_ctrl() indicates that the

Shouldn't this an be an a based on my highschool english.  Or does
Eeenvme count as a vowel?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 5/6] nvme-fc: use nvme_ctrl_reconnect to decide reconnect attempts
  2024-04-09  9:35 ` [PATCH v5 5/6] nvme-fc: use nvme_ctrl_reconnect to decide reconnect attempts Daniel Wagner
@ 2024-04-09 14:01   ` Christoph Hellwig
  0 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2024-04-09 14:01 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, James Smart,
	Hannes Reinecke, linux-nvme, linux-kernel

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 6/6] nvmet: return DHCHAP status codes from nvmet_setup_auth()
  2024-04-09  9:35 ` [PATCH v5 6/6] nvmet: return DHCHAP status codes from nvmet_setup_auth() Daniel Wagner
@ 2024-04-09 14:03   ` Christoph Hellwig
  2024-04-09 20:23   ` Sagi Grimberg
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2024-04-09 14:03 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Christoph Hellwig, Keith Busch, Sagi Grimberg, James Smart,
	Hannes Reinecke, linux-nvme, linux-kernel, Hannes Reinecke

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 4/6] nvme-rdma: short-circuit reconnect retries
  2024-04-09 14:00   ` Christoph Hellwig
@ 2024-04-09 14:19     ` Keith Busch
  2024-04-09 20:21       ` Sagi Grimberg
  0 siblings, 1 reply; 34+ messages in thread
From: Keith Busch @ 2024-04-09 14:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Daniel Wagner, Sagi Grimberg, James Smart, Hannes Reinecke,
	linux-nvme, linux-kernel, Chaitanya Kulkarni

On Tue, Apr 09, 2024 at 04:00:54PM +0200, Christoph Hellwig wrote:
> On Tue, Apr 09, 2024 at 11:35:08AM +0200, Daniel Wagner wrote:
> > From: Hannes Reinecke <hare@suse.de>
> > 
> > Returning an nvme status from nvme_rdma_setup_ctrl() indicates that the
> 
> Shouldn't this an be an a based on my highschool english.  Or does
> Eeenvme count as a vowel?

It depends on how you hear it when you read it. If you automatically
expand the acronym, "Non-Volatile ...", then it should get the "a"
article.

If you instead try to pronounce "nvme" directly, it sounds like you're
saying "envy me", like commanding everyone to acknowledge your
awesomeness. Not sure if they had that in mind when deciding on the
name, but it's kind of amusing. Anyway, pronounce it that way, it gets
an "an". :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable
  2024-04-09  9:35 ` [PATCH v5 1/6] nvme: authentication error are always non-retryable Daniel Wagner
  2024-04-09 13:59   ` Christoph Hellwig
@ 2024-04-09 20:16   ` Sagi Grimberg
  2024-04-09 20:26   ` Sagi Grimberg
  2 siblings, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2024-04-09 20:16 UTC (permalink / raw)
  To: Daniel Wagner, Christoph Hellwig
  Cc: Keith Busch, James Smart, Hannes Reinecke, linux-nvme,
	linux-kernel

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/6] nvme-tcp: short-circuit reconnect retries
  2024-04-09  9:35 ` [PATCH v5 3/6] nvme-tcp: short-circuit reconnect retries Daniel Wagner
  2024-04-09 13:59   ` Christoph Hellwig
@ 2024-04-09 20:20   ` Sagi Grimberg
  2024-04-10  6:01     ` Hannes Reinecke
  2024-04-09 20:40   ` Sagi Grimberg
  2 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2024-04-09 20:20 UTC (permalink / raw)
  To: Daniel Wagner, Christoph Hellwig
  Cc: Keith Busch, James Smart, Hannes Reinecke, linux-nvme,
	linux-kernel



On 09/04/2024 12:35, Daniel Wagner wrote:
> From: Hannes Reinecke <hare@suse.de>
>
> Returning an nvme status from nvme_tcp_setup_ctrl() indicates that the
> association was established and we have received a status from the
> controller; consequently we should honour the DNR bit. If not any future
> reconnect attempts will just return the same error, so we can
> short-circuit the reconnect attempts and fail the connection directly.
>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> [dwagner: add helper to decide to reconnect]
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> ---
>   drivers/nvme/host/nvme.h | 24 ++++++++++++++++++++++++
>   drivers/nvme/host/tcp.c  | 23 +++++++++++++++--------
>   2 files changed, 39 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 9b8904a476b8..dfe103283a3d 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -701,6 +701,30 @@ static inline bool nvme_is_path_error(u16 status)
>   	return (status & 0x700) == 0x300;
>   }
>   
> +/*
> + * Evaluate the status information returned by the LLDD in order to
> + * decided if a reconnect attempt should be scheduled.
> + *
> + * There are two cases where no reconnect attempt should be attempted:
> + *
> + * 1) The LLDD reports an negative status. There was an error (e.g. no
> + *    memory) on the host side and thus abort the operation.
> + *    Note, there are exception such as ENOTCONN which is
> + *    not an internal driver error, thus we filter these errors
> + *    out and retry later.
> + * 2) The DNR bit is set and the specification states no further
> + *    connect attempts with the same set of paramenters should be
> + *    attempted.
> + */
> +static inline bool nvme_ctrl_reconnect(int status)
> +{
> +	if (status < 0 && status != -ENOTCONN)
> +		return false;

So if the host failed to allocate a buffer it will never attempt
another reconnect? doesn't sound right to me..,

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 4/6] nvme-rdma: short-circuit reconnect retries
  2024-04-09 14:19     ` Keith Busch
@ 2024-04-09 20:21       ` Sagi Grimberg
  0 siblings, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2024-04-09 20:21 UTC (permalink / raw)
  To: Keith Busch, Christoph Hellwig
  Cc: Daniel Wagner, James Smart, Hannes Reinecke, linux-nvme,
	linux-kernel, Chaitanya Kulkarni



On 09/04/2024 17:19, Keith Busch wrote:
> On Tue, Apr 09, 2024 at 04:00:54PM +0200, Christoph Hellwig wrote:
>> On Tue, Apr 09, 2024 at 11:35:08AM +0200, Daniel Wagner wrote:
>>> From: Hannes Reinecke <hare@suse.de>
>>>
>>> Returning an nvme status from nvme_rdma_setup_ctrl() indicates that the
>> Shouldn't this an be an a based on my highschool english.  Or does
>> Eeenvme count as a vowel?
> It depends on how you hear it when you read it. If you automatically
> expand the acronym, "Non-Volatile ...", then it should get the "a"
> article.
>
> If you instead try to pronounce "nvme" directly, it sounds like you're
> saying "envy me", like commanding everyone to acknowledge your
> awesomeness. Not sure if they had that in mind when deciding on the
> name, but it's kind of amusing. Anyway, pronounce it that way, it gets
> an "an". :)

:)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 6/6] nvmet: return DHCHAP status codes from nvmet_setup_auth()
  2024-04-09  9:35 ` [PATCH v5 6/6] nvmet: return DHCHAP status codes from nvmet_setup_auth() Daniel Wagner
  2024-04-09 14:03   ` Christoph Hellwig
@ 2024-04-09 20:23   ` Sagi Grimberg
  2024-04-10  6:45     ` Hannes Reinecke
  1 sibling, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2024-04-09 20:23 UTC (permalink / raw)
  To: Daniel Wagner, Christoph Hellwig
  Cc: Keith Busch, James Smart, Hannes Reinecke, linux-nvme,
	linux-kernel, Hannes Reinecke



On 09/04/2024 12:35, Daniel Wagner wrote:
> From: Hannes Reinecke <hare@kernel.org>
>
> A failure in nvmet_setup_auth() does not mean that the NVMe
> authentication command failed, so we should rather return a protocol
> error with a 'failure1' response than an NVMe status.
>
> Also update the type used for dhchap_step and dhchap_status to u8 to
> avoid confusions with nvme status. Furthermore, split dhchap_status and
> nvme status so we don't accidentally mix these return values.

What is the implications of this on the host behavior? In other
words, why is this a part of the series?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable
  2024-04-09  9:35 ` [PATCH v5 1/6] nvme: authentication error are always non-retryable Daniel Wagner
  2024-04-09 13:59   ` Christoph Hellwig
  2024-04-09 20:16   ` Sagi Grimberg
@ 2024-04-09 20:26   ` Sagi Grimberg
  2024-04-10  6:52     ` Daniel Wagner
  2 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2024-04-09 20:26 UTC (permalink / raw)
  To: Daniel Wagner, Christoph Hellwig
  Cc: Keith Busch, James Smart, Hannes Reinecke, linux-nvme,
	linux-kernel



On 09/04/2024 12:35, Daniel Wagner wrote:
> From: Hannes Reinecke <hare@suse.de>
>
> Any authentication errors which are generated internally are always
> non-retryable, so use negative error codes to ensure they are not
> retried.

The patch title says that any authentication error is not retryable, and
the patch body says "authentication errors which are generated locally
are non-retryable" so which one is it?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 4/6] nvme-rdma: short-circuit reconnect retries
  2024-04-09  9:35 ` [PATCH v5 4/6] nvme-rdma: " Daniel Wagner
  2024-04-09 14:00   ` Christoph Hellwig
@ 2024-04-09 20:28   ` Sagi Grimberg
  2024-04-12  2:50     ` Keith Busch
  1 sibling, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2024-04-09 20:28 UTC (permalink / raw)
  To: Daniel Wagner, Christoph Hellwig
  Cc: Keith Busch, James Smart, Hannes Reinecke, linux-nvme,
	linux-kernel, Chaitanya Kulkarni



On 09/04/2024 12:35, Daniel Wagner wrote:
> From: Hannes Reinecke <hare@suse.de>
>
> Returning an nvme status from nvme_rdma_setup_ctrl() indicates that the
> association was established and we have received a status from the
> controller; consequently we should honour
honor ?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/6] nvme-tcp: short-circuit reconnect retries
  2024-04-09  9:35 ` [PATCH v5 3/6] nvme-tcp: short-circuit reconnect retries Daniel Wagner
  2024-04-09 13:59   ` Christoph Hellwig
  2024-04-09 20:20   ` Sagi Grimberg
@ 2024-04-09 20:40   ` Sagi Grimberg
  2024-04-10 10:06     ` Daniel Wagner
  2 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2024-04-09 20:40 UTC (permalink / raw)
  To: Daniel Wagner, Christoph Hellwig
  Cc: Keith Busch, James Smart, Hannes Reinecke, linux-nvme,
	linux-kernel



On 09/04/2024 12:35, Daniel Wagner wrote:
> From: Hannes Reinecke <hare@suse.de>
>
> Returning an nvme status from nvme_tcp_setup_ctrl() indicates that the
> association was established and we have received a status from the
> controller; consequently we should honour the DNR bit. If not any future
> reconnect attempts will just return the same error, so we can
> short-circuit the reconnect attempts and fail the connection directly.
>
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> [dwagner: add helper to decide to reconnect]
> Signed-off-by: Daniel Wagner <dwagner@suse.de>
> ---
>   drivers/nvme/host/nvme.h | 24 ++++++++++++++++++++++++
>   drivers/nvme/host/tcp.c  | 23 +++++++++++++++--------
>   2 files changed, 39 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 9b8904a476b8..dfe103283a3d 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -701,6 +701,30 @@ static inline bool nvme_is_path_error(u16 status)
>   	return (status & 0x700) == 0x300;
>   }
>   
> +/*
> + * Evaluate the status information returned by the LLDD

We usually say transport, less so LLDD.

>   in order to
> + * decided if a reconnect attempt should be scheduled.
> + *
> + * There are two cases where no reconnect attempt should be attempted:
> + *
> + * 1) The LLDD reports an negative status. There was an error (e.g. no
> + *    memory) on the host side and thus abort the operation.
> + *    Note, there are exception such as ENOTCONN which is
> + *    not an internal driver error, thus we filter these errors
> + *    out and retry later.

What is this ENOTCONN here? From what I see its just a random
escape value to bypass the status check. I think that 0 would do the same
wouldn't it?

> + * 2) The DNR bit is set and the specification states no further
> + *    connect attempts with the same set of paramenters should be
> + *    attempted.
> + */
> +static inline bool nvme_ctrl_reconnect(int status)
> +{
> +	if (status < 0 && status != -ENOTCONN)
> +		return false;
> +	else if (status > 0 && (status & NVME_SC_DNR))
> +		return false;
> +	return true;
> +}
> +
>   /*
>    * Fill in the status and result information from the CQE, and then figure out
>    * if blk-mq will need to use IPI magic to complete the request, and if yes do
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index fdbcdcedcee9..7e25a96e9870 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2155,9 +2155,11 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>   	nvme_tcp_destroy_io_queues(ctrl, remove);
>   }
>   
> -static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
> +static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl,
> +		int status)
>   {
>   	enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
> +	bool recon = nvme_ctrl_reconnect(status);
>   
>   	/* If we are resetting/deleting then do nothing */
>   	if (state != NVME_CTRL_CONNECTING) {
> @@ -2165,13 +2167,14 @@ static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
>   		return;
>   	}
>   
> -	if (nvmf_should_reconnect(ctrl)) {
> +	if (recon && nvmf_should_reconnect(ctrl)) {

its weird to have a condition that checks _and_ condition
of two different "should reconnect" conditions.

Why don't we simply pass the status to nvmf_should_reconnect()
and evaluate there?

>   		dev_info(ctrl->device, "Reconnecting in %d seconds...\n",
>   			ctrl->opts->reconnect_delay);
>   		queue_delayed_work(nvme_wq, &to_tcp_ctrl(ctrl)->connect_work,
>   				ctrl->opts->reconnect_delay * HZ);
>   	} else {
> -		dev_info(ctrl->device, "Removing controller...\n");
> +		dev_info(ctrl->device, "Removing controller (%d)...\n",
> +			 status);
>   		nvme_delete_ctrl(ctrl);
>   	}
>   }
> @@ -2252,10 +2255,12 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
>   	struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
>   			struct nvme_tcp_ctrl, connect_work);
>   	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
> +	int ret;
>   
>   	++ctrl->nr_reconnects;
>   
> -	if (nvme_tcp_setup_ctrl(ctrl, false))
> +	ret = nvme_tcp_setup_ctrl(ctrl, false);
> +	if (ret)
>   		goto requeue;
>   
>   	dev_info(ctrl->device, "Successfully reconnected (%d attempt)\n",
> @@ -2268,7 +2273,7 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
>   requeue:
>   	dev_info(ctrl->device, "Failed reconnect attempt %d\n",
>   			ctrl->nr_reconnects);
> -	nvme_tcp_reconnect_or_remove(ctrl);
> +	nvme_tcp_reconnect_or_remove(ctrl, ret);
>   }
>   
>   static void nvme_tcp_error_recovery_work(struct work_struct *work)
> @@ -2295,7 +2300,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
>   		return;
>   	}
>   
> -	nvme_tcp_reconnect_or_remove(ctrl);
> +	nvme_tcp_reconnect_or_remove(ctrl, -ENOTCONN);
>   }
>   
>   static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
> @@ -2315,6 +2320,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
>   {
>   	struct nvme_ctrl *ctrl =
>   		container_of(work, struct nvme_ctrl, reset_work);
> +	int ret;
>   
>   	nvme_stop_ctrl(ctrl);
>   	nvme_tcp_teardown_ctrl(ctrl, false);
> @@ -2328,14 +2334,15 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
>   		return;
>   	}
>   
> -	if (nvme_tcp_setup_ctrl(ctrl, false))
> +	ret = nvme_tcp_setup_ctrl(ctrl, false);
> +	if (ret)
>   		goto out_fail;
>   
>   	return;
>   
>   out_fail:
>   	++ctrl->nr_reconnects;
> -	nvme_tcp_reconnect_or_remove(ctrl);
> +	nvme_tcp_reconnect_or_remove(ctrl, ret);
>   }
>   
>   static void nvme_tcp_stop_ctrl(struct nvme_ctrl *ctrl)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/6] nvme-tcp: short-circuit reconnect retries
  2024-04-09 20:20   ` Sagi Grimberg
@ 2024-04-10  6:01     ` Hannes Reinecke
  0 siblings, 0 replies; 34+ messages in thread
From: Hannes Reinecke @ 2024-04-10  6:01 UTC (permalink / raw)
  To: Sagi Grimberg, Daniel Wagner, Christoph Hellwig
  Cc: Keith Busch, James Smart, linux-nvme, linux-kernel

On 4/9/24 22:20, Sagi Grimberg wrote:
> 
> 
> On 09/04/2024 12:35, Daniel Wagner wrote:
>> From: Hannes Reinecke <hare@suse.de>
>>
>> Returning an nvme status from nvme_tcp_setup_ctrl() indicates that the
>> association was established and we have received a status from the
>> controller; consequently we should honour the DNR bit. If not any future
>> reconnect attempts will just return the same error, so we can
>> short-circuit the reconnect attempts and fail the connection directly.
>>
>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> [dwagner: add helper to decide to reconnect]
>> Signed-off-by: Daniel Wagner <dwagner@suse.de>
>> ---
>>   drivers/nvme/host/nvme.h | 24 ++++++++++++++++++++++++
>>   drivers/nvme/host/tcp.c  | 23 +++++++++++++++--------
>>   2 files changed, 39 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>> index 9b8904a476b8..dfe103283a3d 100644
>> --- a/drivers/nvme/host/nvme.h
>> +++ b/drivers/nvme/host/nvme.h
>> @@ -701,6 +701,30 @@ static inline bool nvme_is_path_error(u16 status)
>>       return (status & 0x700) == 0x300;
>>   }
>> +/*
>> + * Evaluate the status information returned by the LLDD in order to
>> + * decided if a reconnect attempt should be scheduled.
>> + *
>> + * There are two cases where no reconnect attempt should be attempted:
>> + *
>> + * 1) The LLDD reports an negative status. There was an error (e.g. no
>> + *    memory) on the host side and thus abort the operation.
>> + *    Note, there are exception such as ENOTCONN which is
>> + *    not an internal driver error, thus we filter these errors
>> + *    out and retry later.
>> + * 2) The DNR bit is set and the specification states no further
>> + *    connect attempts with the same set of paramenters should be
>> + *    attempted.
>> + */
>> +static inline bool nvme_ctrl_reconnect(int status)
>> +{
>> +    if (status < 0 && status != -ENOTCONN)
>> +        return false;
> 
> So if the host failed to allocate a buffer it will never attempt
> another reconnect? doesn't sound right to me..,

Memory allocation errors are always tricky. If a system is under memory
pressure yours won't be the only process which will exhibit issues,
so it's questionable whether you should retry, or whether it wouldn't
be safer to just abort the operation, and retry manually once the
system has recovered.
Also we just have the interesting case where RDMA goes haywire and
reports a log buffer size of several TB. No amount of retries will
be fixing that.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 6/6] nvmet: return DHCHAP status codes from nvmet_setup_auth()
  2024-04-09 20:23   ` Sagi Grimberg
@ 2024-04-10  6:45     ` Hannes Reinecke
  0 siblings, 0 replies; 34+ messages in thread
From: Hannes Reinecke @ 2024-04-10  6:45 UTC (permalink / raw)
  To: Sagi Grimberg, Daniel Wagner, Christoph Hellwig
  Cc: Keith Busch, James Smart, linux-nvme, linux-kernel,
	Hannes Reinecke

On 4/9/24 22:23, Sagi Grimberg wrote:
> 
> 
> On 09/04/2024 12:35, Daniel Wagner wrote:
>> From: Hannes Reinecke <hare@kernel.org>
>>
>> A failure in nvmet_setup_auth() does not mean that the NVMe
>> authentication command failed, so we should rather return a protocol
>> error with a 'failure1' response than an NVMe status.
>>
>> Also update the type used for dhchap_step and dhchap_status to u8 to
>> avoid confusions with nvme status. Furthermore, split dhchap_status and
>> nvme status so we don't accidentally mix these return values.
> 
> What is the implications of this on the host behavior? In other
> words, why is this a part of the series?

This issue came up during testing the series, where we found that the 
target would cause connection failure (ie return the NVMe CQE with a
status code) rather than a protocol error (ie return the NVMe CQE
with a 'good' status, and set the 'failure1' status for DH-CHAP).
And with that a termination of the DH-CHAP protocol, and not
a connection reset.
So not directly related, but required to get the testcase for this
series working.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable
  2024-04-09 20:26   ` Sagi Grimberg
@ 2024-04-10  6:52     ` Daniel Wagner
  2024-04-10 10:21       ` Sagi Grimberg
  0 siblings, 1 reply; 34+ messages in thread
From: Daniel Wagner @ 2024-04-10  6:52 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, James Smart, Hannes Reinecke,
	linux-nvme, linux-kernel

On Tue, Apr 09, 2024 at 11:26:00PM +0300, Sagi Grimberg wrote:
> 
> 
> On 09/04/2024 12:35, Daniel Wagner wrote:
> > From: Hannes Reinecke <hare@suse.de>
> > 
> > Any authentication errors which are generated internally are always
> > non-retryable, so use negative error codes to ensure they are not
> > retried.
> 
> The patch title says that any authentication error is not retryable, and
> the patch body says "authentication errors which are generated locally
> are non-retryable" so which one is it?

Forgot to update the commit message. What about:

  All authentication errors are non-retryable, so use negative error
  codes to ensure they are not retried.

?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 3/6] nvme-tcp: short-circuit reconnect retries
  2024-04-09 20:40   ` Sagi Grimberg
@ 2024-04-10 10:06     ` Daniel Wagner
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Wagner @ 2024-04-10 10:06 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, James Smart, Hannes Reinecke,
	linux-nvme, linux-kernel

On Tue, Apr 09, 2024 at 11:40:03PM +0300, Sagi Grimberg wrote:
> > +/*
> > + * Evaluate the status information returned by the LLDD
> 
> We usually say transport, less so LLDD.

Updated the documentation accordingly.

> > + * 1) The LLDD reports an negative status. There was an error (e.g. no
> > + *    memory) on the host side and thus abort the operation.
> > + *    Note, there are exception such as ENOTCONN which is
> > + *    not an internal driver error, thus we filter these errors
> > + *    out and retry later.
>
> What is this ENOTCONN here? From what I see its just a random
> escape value to bypass the status check. I think that 0 would do the same
> wouldn't it?

ENOTCONN is issued by the reset handler. Sure, 0 will work as well, so
if the consense is that the reset handler should do that. see below.

> > -	if (nvmf_should_reconnect(ctrl)) {
> > +	if (recon && nvmf_should_reconnect(ctrl)) {
> 
> its weird to have a condition that checks _and_ condition
> of two different "should reconnect" conditions.
> 
> Why don't we simply pass the status to nvmf_should_reconnect()
> and evaluate there?

Sure, makes sense.

> >   static void nvme_tcp_error_recovery_work(struct work_struct *work)
> > @@ -2295,7 +2300,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
> >   		return;
> >   	}
> > -	nvme_tcp_reconnect_or_remove(ctrl);
> > +	nvme_tcp_reconnect_or_remove(ctrl, -ENOTCONN);

Here is the source of ENOTCONN.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable
  2024-04-10  6:52     ` Daniel Wagner
@ 2024-04-10 10:21       ` Sagi Grimberg
  2024-04-10 12:05         ` Hannes Reinecke
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2024-04-10 10:21 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Christoph Hellwig, Keith Busch, James Smart, Hannes Reinecke,
	linux-nvme, linux-kernel



On 10/04/2024 9:52, Daniel Wagner wrote:
> On Tue, Apr 09, 2024 at 11:26:00PM +0300, Sagi Grimberg wrote:
>>
>> On 09/04/2024 12:35, Daniel Wagner wrote:
>>> From: Hannes Reinecke <hare@suse.de>
>>>
>>> Any authentication errors which are generated internally are always
>>> non-retryable, so use negative error codes to ensure they are not
>>> retried.
>> The patch title says that any authentication error is not retryable, and
>> the patch body says "authentication errors which are generated locally
>> are non-retryable" so which one is it?
> Forgot to update the commit message. What about:
>
>    All authentication errors are non-retryable, so use negative error
>    codes to ensure they are not retried.
>
> ?

I have a question, what happens if nvmet updated its credentials (by the 
admin) and in the period until
the host got his credentials updated, it happens to 
disconnect/reconnect. It will see an authentication
error, so it will not retry and remove the controller altogether?

Sounds like an issue to me.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable
  2024-04-10 10:21       ` Sagi Grimberg
@ 2024-04-10 12:05         ` Hannes Reinecke
  2024-04-10 13:50           ` Sagi Grimberg
  0 siblings, 1 reply; 34+ messages in thread
From: Hannes Reinecke @ 2024-04-10 12:05 UTC (permalink / raw)
  To: Sagi Grimberg, Daniel Wagner
  Cc: Christoph Hellwig, Keith Busch, James Smart, linux-nvme,
	linux-kernel

On 4/10/24 12:21, Sagi Grimberg wrote:
> 
> 
> On 10/04/2024 9:52, Daniel Wagner wrote:
>> On Tue, Apr 09, 2024 at 11:26:00PM +0300, Sagi Grimberg wrote:
>>>
>>> On 09/04/2024 12:35, Daniel Wagner wrote:
>>>> From: Hannes Reinecke <hare@suse.de>
>>>>
>>>> Any authentication errors which are generated internally are always
>>>> non-retryable, so use negative error codes to ensure they are not
>>>> retried.
>>> The patch title says that any authentication error is not retryable, and
>>> the patch body says "authentication errors which are generated locally
>>> are non-retryable" so which one is it?
>> Forgot to update the commit message. What about:
>>
>>    All authentication errors are non-retryable, so use negative error
>>    codes to ensure they are not retried.
>>
>> ?
> 
> I have a question, what happens if nvmet updated its credentials (by the 
> admin) and in the period until the host got his credentials updated, it
> happens to disconnect/reconnect. It will see an authentication
> error, so it will not retry and remove the controller altogether?
> 
> Sounds like an issue to me.

Usual thing: we cannot differentiate (on the host side) whether the
current PSK is _about_ to be replaced; how should the kernel
know that the admin will replace the PSK in the next minutes?

But that really is an issue with the standard. Currently there is no
way how a target could inform the initiator that the credentials have
been updated.

We would need to define a new status code for this.
In the meantime the safe operations model is to set a lifetime
for each PSK, and ensure that the PSK is updated on both sides
during the lifetime. With that there is a timeframe during which
both PSKs are available (on the target), and the older will expire
automatically once the lifetime limit is reached.

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable
  2024-04-10 12:05         ` Hannes Reinecke
@ 2024-04-10 13:50           ` Sagi Grimberg
  2024-04-11  7:11             ` Hannes Reinecke
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2024-04-10 13:50 UTC (permalink / raw)
  To: Hannes Reinecke, Daniel Wagner
  Cc: Christoph Hellwig, Keith Busch, James Smart, linux-nvme,
	linux-kernel



On 10/04/2024 15:05, Hannes Reinecke wrote:
> On 4/10/24 12:21, Sagi Grimberg wrote:
>>
>>
>> On 10/04/2024 9:52, Daniel Wagner wrote:
>>> On Tue, Apr 09, 2024 at 11:26:00PM +0300, Sagi Grimberg wrote:
>>>>
>>>> On 09/04/2024 12:35, Daniel Wagner wrote:
>>>>> From: Hannes Reinecke <hare@suse.de>
>>>>>
>>>>> Any authentication errors which are generated internally are always
>>>>> non-retryable, so use negative error codes to ensure they are not
>>>>> retried.
>>>> The patch title says that any authentication error is not 
>>>> retryable, and
>>>> the patch body says "authentication errors which are generated locally
>>>> are non-retryable" so which one is it?
>>> Forgot to update the commit message. What about:
>>>
>>>    All authentication errors are non-retryable, so use negative error
>>>    codes to ensure they are not retried.
>>>
>>> ?
>>
>> I have a question, what happens if nvmet updated its credentials (by 
>> the admin) and in the period until the host got his credentials 
>> updated, it
>> happens to disconnect/reconnect. It will see an authentication
>> error, so it will not retry and remove the controller altogether?
>>
>> Sounds like an issue to me.
>
> Usual thing: we cannot differentiate (on the host side) whether the
> current PSK is _about_ to be replaced; how should the kernel
> know that the admin will replace the PSK in the next minutes?
>
> But that really is an issue with the standard. Currently there is no
> way how a target could inform the initiator that the credentials have
> been updated.

I'd say that the sane thing for the host to do in this case is to reconnect
until giving up with hope that it may work. This seems like a better 
approach
than to abruptly remove the controller no?

>
> We would need to define a new status code for this.
> In the meantime the safe operations model is to set a lifetime
> for each PSK, and ensure that the PSK is updated on both sides
> during the lifetime. With that there is a timeframe during which
> both PSKs are available (on the target), and the older will expire
> automatically once the lifetime limit is reached.

That is a good solution, and will also prevent a loss of service until
the host credentials are updated as well.

But regardless I have a feeling that simply removing the controller upon
an authentication error is not the right thing to do here.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable
  2024-04-10 13:50           ` Sagi Grimberg
@ 2024-04-11  7:11             ` Hannes Reinecke
  2024-04-11  8:37               ` Sagi Grimberg
  0 siblings, 1 reply; 34+ messages in thread
From: Hannes Reinecke @ 2024-04-11  7:11 UTC (permalink / raw)
  To: Sagi Grimberg, Daniel Wagner
  Cc: Christoph Hellwig, Keith Busch, James Smart, linux-nvme,
	linux-kernel

On 4/10/24 15:50, Sagi Grimberg wrote:
> 
> 
> On 10/04/2024 15:05, Hannes Reinecke wrote:
>> On 4/10/24 12:21, Sagi Grimberg wrote:
>>>
>>>
>>> On 10/04/2024 9:52, Daniel Wagner wrote:
>>>> On Tue, Apr 09, 2024 at 11:26:00PM +0300, Sagi Grimberg wrote:
>>>>>
>>>>> On 09/04/2024 12:35, Daniel Wagner wrote:
>>>>>> From: Hannes Reinecke <hare@suse.de>
>>>>>>
>>>>>> Any authentication errors which are generated internally are always
>>>>>> non-retryable, so use negative error codes to ensure they are not
>>>>>> retried.
>>>>> The patch title says that any authentication error is not 
>>>>> retryable, and
>>>>> the patch body says "authentication errors which are generated locally
>>>>> are non-retryable" so which one is it?
>>>> Forgot to update the commit message. What about:
>>>>
>>>>    All authentication errors are non-retryable, so use negative error
>>>>    codes to ensure they are not retried.
>>>>
>>>> ?
>>>
>>> I have a question, what happens if nvmet updated its credentials (by 
>>> the admin) and in the period until the host got his credentials 
>>> updated, it
>>> happens to disconnect/reconnect. It will see an authentication
>>> error, so it will not retry and remove the controller altogether?
>>>
>>> Sounds like an issue to me.
>>
>> Usual thing: we cannot differentiate (on the host side) whether the
>> current PSK is _about_ to be replaced; how should the kernel
>> know that the admin will replace the PSK in the next minutes?
>>
>> But that really is an issue with the standard. Currently there is no
>> way how a target could inform the initiator that the credentials have
>> been updated.
> 
> I'd say that the sane thing for the host to do in this case is to reconnect
> until giving up with hope that it may work. This seems like a better 
> approach
> than to abruptly remove the controller no?
> 
>>
>> We would need to define a new status code for this.
>> In the meantime the safe operations model is to set a lifetime
>> for each PSK, and ensure that the PSK is updated on both sides
>> during the lifetime. With that there is a timeframe during which
>> both PSKs are available (on the target), and the older will expire
>> automatically once the lifetime limit is reached.
> 
> That is a good solution, and will also prevent a loss of service until
> the host credentials are updated as well.
> 
> But regardless I have a feeling that simply removing the controller upon
> an authentication error is not the right thing to do here.

Guess what; that's what I tried to do initially. But then Christoph 
objected that we shouldn't generate NVMe status codes internally.
But if we can't do that then we'll have to invent yet another way to 
return a retryable error, leading to even more band-aid.
So I am not quite sure how we could achieve that, short of making 
_every_ error retryable...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 1/6] nvme: authentication error are always non-retryable
  2024-04-11  7:11             ` Hannes Reinecke
@ 2024-04-11  8:37               ` Sagi Grimberg
  0 siblings, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2024-04-11  8:37 UTC (permalink / raw)
  To: Hannes Reinecke, Daniel Wagner
  Cc: Christoph Hellwig, Keith Busch, James Smart, linux-nvme,
	linux-kernel



On 11/04/2024 10:11, Hannes Reinecke wrote:
> On 4/10/24 15:50, Sagi Grimberg wrote:
>>
>>
>> On 10/04/2024 15:05, Hannes Reinecke wrote:
>>> On 4/10/24 12:21, Sagi Grimberg wrote:
>>>>
>>>>
>>>> On 10/04/2024 9:52, Daniel Wagner wrote:
>>>>> On Tue, Apr 09, 2024 at 11:26:00PM +0300, Sagi Grimberg wrote:
>>>>>>
>>>>>> On 09/04/2024 12:35, Daniel Wagner wrote:
>>>>>>> From: Hannes Reinecke <hare@suse.de>
>>>>>>>
>>>>>>> Any authentication errors which are generated internally are always
>>>>>>> non-retryable, so use negative error codes to ensure they are not
>>>>>>> retried.
>>>>>> The patch title says that any authentication error is not 
>>>>>> retryable, and
>>>>>> the patch body says "authentication errors which are generated 
>>>>>> locally
>>>>>> are non-retryable" so which one is it?
>>>>> Forgot to update the commit message. What about:
>>>>>
>>>>>    All authentication errors are non-retryable, so use negative error
>>>>>    codes to ensure they are not retried.
>>>>>
>>>>> ?
>>>>
>>>> I have a question, what happens if nvmet updated its credentials 
>>>> (by the admin) and in the period until the host got his credentials 
>>>> updated, it
>>>> happens to disconnect/reconnect. It will see an authentication
>>>> error, so it will not retry and remove the controller altogether?
>>>>
>>>> Sounds like an issue to me.
>>>
>>> Usual thing: we cannot differentiate (on the host side) whether the
>>> current PSK is _about_ to be replaced; how should the kernel
>>> know that the admin will replace the PSK in the next minutes?
>>>
>>> But that really is an issue with the standard. Currently there is no
>>> way how a target could inform the initiator that the credentials have
>>> been updated.
>>
>> I'd say that the sane thing for the host to do in this case is to 
>> reconnect
>> until giving up with hope that it may work. This seems like a better 
>> approach
>> than to abruptly remove the controller no?
>>
>>>
>>> We would need to define a new status code for this.
>>> In the meantime the safe operations model is to set a lifetime
>>> for each PSK, and ensure that the PSK is updated on both sides
>>> during the lifetime. With that there is a timeframe during which
>>> both PSKs are available (on the target), and the older will expire
>>> automatically once the lifetime limit is reached.
>>
>> That is a good solution, and will also prevent a loss of service until
>> the host credentials are updated as well.
>>
>> But regardless I have a feeling that simply removing the controller upon
>> an authentication error is not the right thing to do here.
>
> Guess what; that's what I tried to do initially. But then Christoph 
> objected that we shouldn't generate NVMe status codes internally.
> But if we can't do that then we'll have to invent yet another way to 
> return a retryable error, leading to even more band-aid.
> So I am not quite sure how we could achieve that, short of making 
> _every_ error retryable...

So this whole thing is that you want to make the host to not reconnect 
if the controller
sent a DNR and reconnect otherwise?

What are you returning today if the authentication failed? Am I reading 
it right that you are
returning -ECONNREFUSED? I think that for the specific case of 
credentials mismatch (that and only
that) you may want to return -EKEYREJECTED. That according to the 
documentation (/* Key was rejected by service */)
is specific enough that perhaps we can treat it specially when asking 
"should I reconnect?"

Thoughts?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 0/6]  nvme-fabrics: short-circuit connect retries
  2024-04-09  9:35 [PATCH v5 0/6] nvme-fabrics: short-circuit connect retries Daniel Wagner
                   ` (5 preceding siblings ...)
  2024-04-09  9:35 ` [PATCH v5 6/6] nvmet: return DHCHAP status codes from nvmet_setup_auth() Daniel Wagner
@ 2024-04-12  0:35 ` Keith Busch
  2024-04-12  7:24   ` Daniel Wagner
  6 siblings, 1 reply; 34+ messages in thread
From: Keith Busch @ 2024-04-12  0:35 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Christoph Hellwig, Sagi Grimberg, James Smart, Hannes Reinecke,
	linux-nvme, linux-kernel

On Tue, Apr 09, 2024 at 11:35:04AM +0200, Daniel Wagner wrote:
> The first patch returns only kernel error codes now and avoids overwriting error
> codes later. Thje newly introduced helper for deciding if a reconnect should be
> attempted is the only place where we have the logic (and documentation).
> 
> On the target side I've separate the nvme status from the dhchap status handling
> which made it a bit clearer. I was tempted to refactor the code in
> nvmet_execute_auth_send to avoid hitting the 80 chars limit but didn't came up
> with something nice yet. So let's keep this change at a minimum before any
> refactoring attempts.
> 
> I've tested with blktests and also an real hardware for nvme-fc.

Thanks, series applied to nvme-6.9.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 4/6] nvme-rdma: short-circuit reconnect retries
  2024-04-09 20:28   ` Sagi Grimberg
@ 2024-04-12  2:50     ` Keith Busch
  0 siblings, 0 replies; 34+ messages in thread
From: Keith Busch @ 2024-04-12  2:50 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Daniel Wagner, Christoph Hellwig, James Smart, Hannes Reinecke,
	linux-nvme, linux-kernel, Chaitanya Kulkarni

On Tue, Apr 09, 2024 at 11:28:04PM +0300, Sagi Grimberg wrote:
> On 09/04/2024 12:35, Daniel Wagner wrote:
> > 
> > Returning an nvme status from nvme_rdma_setup_ctrl() indicates that the
> > association was established and we have received a status from the
> > controller; consequently we should honour
> honor ?

King's English vs. Freedom English? Whichever flavour or color you like,
both are fine here!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 0/6]  nvme-fabrics: short-circuit connect retries
  2024-04-12  0:35 ` [PATCH v5 0/6] nvme-fabrics: short-circuit connect retries Keith Busch
@ 2024-04-12  7:24   ` Daniel Wagner
  2024-04-12 15:24     ` Keith Busch
  0 siblings, 1 reply; 34+ messages in thread
From: Daniel Wagner @ 2024-04-12  7:24 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Sagi Grimberg, James Smart, Hannes Reinecke,
	linux-nvme, linux-kernel

On Thu, Apr 11, 2024 at 06:35:25PM -0600, Keith Busch wrote:
> On Tue, Apr 09, 2024 at 11:35:04AM +0200, Daniel Wagner wrote:
> > The first patch returns only kernel error codes now and avoids overwriting error
> > codes later. Thje newly introduced helper for deciding if a reconnect should be
> > attempted is the only place where we have the logic (and documentation).
> > 
> > On the target side I've separate the nvme status from the dhchap status handling
> > which made it a bit clearer. I was tempted to refactor the code in
> > nvmet_execute_auth_send to avoid hitting the 80 chars limit but didn't came up
> > with something nice yet. So let's keep this change at a minimum before any
> > refactoring attempts.
> > 
> > I've tested with blktests and also an real hardware for nvme-fc.
> 
> Thanks, series applied to nvme-6.9.

Thanks! I have an updated version here which addresses some of Sagi's
feedback, e.g. using only one helper function. Sorry I didn't send out
it earlier, I got a bit side tracked in testing because of the 'funky'
results with RDMA.

Do you want me to send a complete fresh series or patches on top of this
series? I'm fine either way.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 0/6]  nvme-fabrics: short-circuit connect retries
  2024-04-12  7:24   ` Daniel Wagner
@ 2024-04-12 15:24     ` Keith Busch
  2024-04-14  8:34       ` Sagi Grimberg
  0 siblings, 1 reply; 34+ messages in thread
From: Keith Busch @ 2024-04-12 15:24 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Christoph Hellwig, Sagi Grimberg, James Smart, Hannes Reinecke,
	linux-nvme, linux-kernel

On Fri, Apr 12, 2024 at 09:24:04AM +0200, Daniel Wagner wrote:
> On Thu, Apr 11, 2024 at 06:35:25PM -0600, Keith Busch wrote:
> > On Tue, Apr 09, 2024 at 11:35:04AM +0200, Daniel Wagner wrote:
> > > The first patch returns only kernel error codes now and avoids overwriting error
> > > codes later. Thje newly introduced helper for deciding if a reconnect should be
> > > attempted is the only place where we have the logic (and documentation).
> > > 
> > > On the target side I've separate the nvme status from the dhchap status handling
> > > which made it a bit clearer. I was tempted to refactor the code in
> > > nvmet_execute_auth_send to avoid hitting the 80 chars limit but didn't came up
> > > with something nice yet. So let's keep this change at a minimum before any
> > > refactoring attempts.
> > > 
> > > I've tested with blktests and also an real hardware for nvme-fc.
> > 
> > Thanks, series applied to nvme-6.9.
> 
> Thanks! I have an updated version here which addresses some of Sagi's
> feedback, e.g. using only one helper function. Sorry I didn't send out
> it earlier, I got a bit side tracked in testing because of the 'funky'
> results with RDMA.
> 
> Do you want me to send a complete fresh series or patches on top of this
> series? I'm fine either way.

Oh sorry, I didn't notice the discussion carried on after the "review"
tag. Please send me the update, I'll force push.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v5 0/6] nvme-fabrics: short-circuit connect retries
  2024-04-12 15:24     ` Keith Busch
@ 2024-04-14  8:34       ` Sagi Grimberg
  0 siblings, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2024-04-14  8:34 UTC (permalink / raw)
  To: Keith Busch, Daniel Wagner
  Cc: Christoph Hellwig, James Smart, Hannes Reinecke, linux-nvme,
	linux-kernel



On 12/04/2024 18:24, Keith Busch wrote:
> On Fri, Apr 12, 2024 at 09:24:04AM +0200, Daniel Wagner wrote:
>> On Thu, Apr 11, 2024 at 06:35:25PM -0600, Keith Busch wrote:
>>> On Tue, Apr 09, 2024 at 11:35:04AM +0200, Daniel Wagner wrote:
>>>> The first patch returns only kernel error codes now and avoids overwriting error
>>>> codes later. Thje newly introduced helper for deciding if a reconnect should be
>>>> attempted is the only place where we have the logic (and documentation).
>>>>
>>>> On the target side I've separate the nvme status from the dhchap status handling
>>>> which made it a bit clearer. I was tempted to refactor the code in
>>>> nvmet_execute_auth_send to avoid hitting the 80 chars limit but didn't came up
>>>> with something nice yet. So let's keep this change at a minimum before any
>>>> refactoring attempts.
>>>>
>>>> I've tested with blktests and also an real hardware for nvme-fc.
>>> Thanks, series applied to nvme-6.9.
>> Thanks! I have an updated version here which addresses some of Sagi's
>> feedback, e.g. using only one helper function. Sorry I didn't send out
>> it earlier, I got a bit side tracked in testing because of the 'funky'
>> results with RDMA.
>>
>> Do you want me to send a complete fresh series or patches on top of this
>> series? I'm fine either way.
> Oh sorry, I didn't notice the discussion carried on after the "review"
> tag. Please send me the update, I'll force push.

I think that what we want is to have a special handling in the very specific
connect failure when the host/ctrl credentials mismatch, because out of all
the reasons to a connection failure, this _could_ be transient.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-04-14  8:35 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-09  9:35 [PATCH v5 0/6] nvme-fabrics: short-circuit connect retries Daniel Wagner
2024-04-09  9:35 ` [PATCH v5 1/6] nvme: authentication error are always non-retryable Daniel Wagner
2024-04-09 13:59   ` Christoph Hellwig
2024-04-09 20:16   ` Sagi Grimberg
2024-04-09 20:26   ` Sagi Grimberg
2024-04-10  6:52     ` Daniel Wagner
2024-04-10 10:21       ` Sagi Grimberg
2024-04-10 12:05         ` Hannes Reinecke
2024-04-10 13:50           ` Sagi Grimberg
2024-04-11  7:11             ` Hannes Reinecke
2024-04-11  8:37               ` Sagi Grimberg
2024-04-09  9:35 ` [PATCH v5 2/6] nvmet: lock config semaphore when accessing DH-HMAC-CHAP key Daniel Wagner
2024-04-09  9:35 ` [PATCH v5 3/6] nvme-tcp: short-circuit reconnect retries Daniel Wagner
2024-04-09 13:59   ` Christoph Hellwig
2024-04-09 20:20   ` Sagi Grimberg
2024-04-10  6:01     ` Hannes Reinecke
2024-04-09 20:40   ` Sagi Grimberg
2024-04-10 10:06     ` Daniel Wagner
2024-04-09  9:35 ` [PATCH v5 4/6] nvme-rdma: " Daniel Wagner
2024-04-09 14:00   ` Christoph Hellwig
2024-04-09 14:19     ` Keith Busch
2024-04-09 20:21       ` Sagi Grimberg
2024-04-09 20:28   ` Sagi Grimberg
2024-04-12  2:50     ` Keith Busch
2024-04-09  9:35 ` [PATCH v5 5/6] nvme-fc: use nvme_ctrl_reconnect to decide reconnect attempts Daniel Wagner
2024-04-09 14:01   ` Christoph Hellwig
2024-04-09  9:35 ` [PATCH v5 6/6] nvmet: return DHCHAP status codes from nvmet_setup_auth() Daniel Wagner
2024-04-09 14:03   ` Christoph Hellwig
2024-04-09 20:23   ` Sagi Grimberg
2024-04-10  6:45     ` Hannes Reinecke
2024-04-12  0:35 ` [PATCH v5 0/6] nvme-fabrics: short-circuit connect retries Keith Busch
2024-04-12  7:24   ` Daniel Wagner
2024-04-12 15:24     ` Keith Busch
2024-04-14  8:34       ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox