[PATCH v3 0/3] nvme-fc: fix race with connectivity loss and nvme_fc_create

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH v3 0/3] nvme-fc: fix race with connectivity loss and nvme_fc_create_association
@ 2024-11-29  9:28 Daniel Wagner
  2024-11-29  9:28 ` [PATCH v3 1/3] nvme-fc: go straight to connecting state when initializing Daniel Wagner
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Daniel Wagner @ 2024-11-29  9:28 UTC (permalink / raw)
  To: James Smart, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Hannes Reinecke, Paul Ely
  Cc: linux-nvme, linux-kernel, Daniel Wagner

After a long hard stare at the keep alive machinery I am convienced we
need to trigger a reset when nvme_keep_alive_end_io is called with status
!= 0. There is a lengthy explanation in patch #3.

I've also tested this version with blktests and some manual tests. Though
it's not that easy to get into exact sequence reported by Paul.

Daniel

previous cover letter:

We got a bug report that a controller was stuck in the connected state
after an association dropped.

It turns out that nvme_fc_create_association can succeed even though some
operation do fail. This is on purpose to handle the degraded controller
case, where the admin queue is up and running but not the io queues. In
this case the controller will still reach the LIVE state.

Unfortunatly, this will also ignore full connectivity loss for fabric
controllers. Let's address this by not filtering out all errors in
nvme_set_queue_count.

---
Changes in v3:
- collected reviewed tags
- added nvme_ctrl_reset to keep alive end io handler
- Link to v2: https://lore.kernel.org/r/20241029-nvme-fc-handle-com-lost-v2-0-5b0d137e2a0a@kernel.org

Changes in v2:
- handle connection lost in nvme_set_queue_count directly
- collected reviewed tags
- Link to v1: https://lore.kernel.org/r/20240611190647.11856-1-dwagner@suse.de

---
Daniel Wagner (3):
      nvme-fc: go straight to connecting state when initializing
      nvme: trigger reset when keep alive fails
      nvme: handle connectivity loss in nvme_set_queue_count

 drivers/nvme/host/core.c | 13 ++++++++++++-
 drivers/nvme/host/fc.c   |  3 +--
 2 files changed, 13 insertions(+), 3 deletions(-)
---
base-commit: 029cc98dec2eadb5d0978b5fea9ae6c427f2a020
change-id: 20241029-nvme-fc-handle-com-lost-9b241936809a

Best regards,
-- 
Daniel Wagner <wagi@kernel.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/3] nvme-fc: go straight to connecting state when initializing
  2024-11-29  9:28 [PATCH v3 0/3] nvme-fc: fix race with connectivity loss and nvme_fc_create_association Daniel Wagner
@ 2024-11-29  9:28 ` Daniel Wagner
  2024-11-29  9:28 ` [PATCH v3 2/3] nvme: trigger reset when keep alive fails Daniel Wagner
  2024-11-29  9:28 ` [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count Daniel Wagner
  2 siblings, 0 replies; 18+ messages in thread
From: Daniel Wagner @ 2024-11-29  9:28 UTC (permalink / raw)
  To: James Smart, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Hannes Reinecke, Paul Ely
  Cc: linux-nvme, linux-kernel, Daniel Wagner

The initial controller initialization mimiks the reconnect loop
behavior by switching from NEW to RESETTING and then to CONNECTING.

The transition from NEW to CONNECTING is a valid transition, so there is
no point entering the RESETTING state. TCP and RDMA also transition
directly to CONNECTING state.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/nvme/host/fc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index b81af7919e94c421387033bf8361a9cf8a867486..d45ab530ff9b7bd03bc311474278fc840f8786d5 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -3579,8 +3579,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 	list_add_tail(&ctrl->ctrl_list, &rport->ctrl_list);
 	spin_unlock_irqrestore(&rport->lock, flags);
 
-	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING) ||
-	    !nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
+	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_CONNECTING)) {
 		dev_err(ctrl->ctrl.device,
 			"NVME-FC{%d}: failed to init ctrl state\n", ctrl->cnum);
 		goto fail_ctrl;

-- 
2.47.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 2/3] nvme: trigger reset when keep alive fails
  2024-11-29  9:28 [PATCH v3 0/3] nvme-fc: fix race with connectivity loss and nvme_fc_create_association Daniel Wagner
  2024-11-29  9:28 ` [PATCH v3 1/3] nvme-fc: go straight to connecting state when initializing Daniel Wagner
@ 2024-11-29  9:28 ` Daniel Wagner
  2024-11-29 11:09   ` Hannes Reinecke
                     ` (2 more replies)
  2024-11-29  9:28 ` [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count Daniel Wagner
  2 siblings, 3 replies; 18+ messages in thread
From: Daniel Wagner @ 2024-11-29  9:28 UTC (permalink / raw)
  To: James Smart, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Hannes Reinecke, Paul Ely
  Cc: linux-nvme, linux-kernel, Daniel Wagner

nvme_keep_alive_work setups a keep alive command and uses
blk_execute_rq_nowait to send out the command in an asynchronously
manner. Eventually, nvme_keep_alive_end_io is called. If the status
argument is 0, a new keep alive is send out. When the status argument is
not 0, only an error is logged. The keep alive machinery does not
trigger the error recovery.

The FC driver is relying on the keep alive machinery to trigger recovery
when an error is detected. Whenever an error happens during the creation
of the association the idea is that the operation is aborted and retried
later. Though there is a window where an error happens and
nvme_fc_create_assocation can't detect the error.

         1) nvme nvme10: NVME-FC{10}: create association : ...
         2) nvme nvme10: NVME-FC{10}: controller connectivity lost. Awaiting Reconnect
            nvme nvme10: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
         3) nvme nvme10: Could not set queue count (880)
            nvme nvme10: Failed to configure AEN (cfg 900)
         4) nvme nvme10: NVME-FC{10}: controller connect complete
         5) nvme nvme10: failed nvme_keep_alive_end_io error=4

A connection attempt starts 1) and the ctrl is in state CONNECTING.
Shortly after the LLDD driver detects a connection lost event and calls
nvme_fc_ctrl_connectivity_loss 2). Because we are still in CONNECTING
state, this event is ignored.

nvme_fc_create_association continues to run in parallel and tries to
communicate with the controller and those commands fail. Though these
errors are filtered out, e.g in 3) setting the I/O queues numbers fails
which leads to an early exit in nvme_fc_create_io_queues. Because the
number of IO queues is 0 at this point, there is nothing left in
nvme_fc_create_association which could detected the connection drop.
Thus the ctrl enters LIVE state 4).

The keep alive timer fires and a keep alive command is send off but
gets rejected by nvme_fc_queue_rq and the rq status is set to
NVME_SC_HOST_PATH_ERROR. The nvme status is then mapped to a block layer
status BLK_STS_TRANSPORT/4 in nvme_end_req. Eventually,
nvme_keep_alive_end_io sees the status != 0 and just logs an error 5).

We should obviously detect the problem in 3) and abort there (will
address this later), but that still leaves a race window open. There is
a race window open in nvme_fc_create_association after starting the IO
queues and setting the ctrl state to LIVE.

Thus trigger a reset from the keep alive handler when an error is
reported.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/nvme/host/core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bfd71511c85f8b1a9508c6ea062475ff51bf27fe..2a07c2c540b26c8cbe886711abaf6f0afbe6c4df 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1320,6 +1320,12 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
 		dev_err(ctrl->device,
 			"failed nvme_keep_alive_end_io error=%d\n",
 				status);
+		/*
+		 * The driver reports that we lost the connection,
+		 * trigger a recovery.
+		 */
+		if (status == BLK_STS_TRANSPORT)
+			nvme_reset_ctrl(ctrl);
 		return RQ_END_IO_NONE;
 	}

-- 
2.47.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
  2024-11-29  9:28 [PATCH v3 0/3] nvme-fc: fix race with connectivity loss and nvme_fc_create_association Daniel Wagner
  2024-11-29  9:28 ` [PATCH v3 1/3] nvme-fc: go straight to connecting state when initializing Daniel Wagner
  2024-11-29  9:28 ` [PATCH v3 2/3] nvme: trigger reset when keep alive fails Daniel Wagner
@ 2024-11-29  9:28 ` Daniel Wagner
  2024-11-29 11:10   ` Hannes Reinecke
  2024-12-24 10:39   ` Sagi Grimberg
  2 siblings, 2 replies; 18+ messages in thread
From: Daniel Wagner @ 2024-11-29  9:28 UTC (permalink / raw)
  To: James Smart, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Hannes Reinecke, Paul Ely
  Cc: linux-nvme, linux-kernel, Daniel Wagner

When the set feature attempts fails with any NVME status code set in
nvme_set_queue_count, the function still report success. Though the
numbers of queues set to 0. This is done to support controllers in
degraded state (the admin queue is still up and running but no IO
queues).

Though there is an exception. When nvme_set_features reports an host
path error, nvme_set_queue_count should propagate this error as the
connectivity is lost, which means also the admin queue is not working
anymore.

Fixes: 9a0be7abb62f ("nvme: refactor set_queue_count")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
---
 drivers/nvme/host/core.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2a07c2c540b26c8cbe886711abaf6f0afbe6c4df..aa2a7c7d4d0ae7bd1f7fb704e55c0a8d724b9d56 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1677,7 +1677,12 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
 
 	status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, NULL, 0,
 			&result);
-	if (status < 0)
+	/*
+	 * It's either a kernel error or the host observed a connection
+	 * lost. In either case it's not possible communicate with the
+	 * controller and thus enter the error code path.
+	 */
+	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
 		return status;
 
 	/*

-- 
2.47.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] nvme: trigger reset when keep alive fails
  2024-11-29  9:28 ` [PATCH v3 2/3] nvme: trigger reset when keep alive fails Daniel Wagner
@ 2024-11-29 11:09   ` Hannes Reinecke
  2024-12-09 13:36   ` Christoph Hellwig
  2024-12-24 10:31   ` Sagi Grimberg
  2 siblings, 0 replies; 18+ messages in thread
From: Hannes Reinecke @ 2024-11-29 11:09 UTC (permalink / raw)
  To: Daniel Wagner, James Smart, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Paul Ely
  Cc: linux-nvme, linux-kernel

On 11/29/24 10:28, Daniel Wagner wrote:
> nvme_keep_alive_work setups a keep alive command and uses
> blk_execute_rq_nowait to send out the command in an asynchronously
> manner. Eventually, nvme_keep_alive_end_io is called. If the status
> argument is 0, a new keep alive is send out. When the status argument is
> not 0, only an error is logged. The keep alive machinery does not
> trigger the error recovery.
> 
> The FC driver is relying on the keep alive machinery to trigger recovery
> when an error is detected. Whenever an error happens during the creation
> of the association the idea is that the operation is aborted and retried
> later. Though there is a window where an error happens and
> nvme_fc_create_assocation can't detect the error.
> 
>           1) nvme nvme10: NVME-FC{10}: create association : ...
>           2) nvme nvme10: NVME-FC{10}: controller connectivity lost. Awaiting Reconnect
>              nvme nvme10: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
>           3) nvme nvme10: Could not set queue count (880)
>              nvme nvme10: Failed to configure AEN (cfg 900)
>           4) nvme nvme10: NVME-FC{10}: controller connect complete
>           5) nvme nvme10: failed nvme_keep_alive_end_io error=4
> 
> A connection attempt starts 1) and the ctrl is in state CONNECTING.
> Shortly after the LLDD driver detects a connection lost event and calls
> nvme_fc_ctrl_connectivity_loss 2). Because we are still in CONNECTING
> state, this event is ignored.
> 
> nvme_fc_create_association continues to run in parallel and tries to
> communicate with the controller and those commands fail. Though these
> errors are filtered out, e.g in 3) setting the I/O queues numbers fails
> which leads to an early exit in nvme_fc_create_io_queues. Because the
> number of IO queues is 0 at this point, there is nothing left in
> nvme_fc_create_association which could detected the connection drop.
> Thus the ctrl enters LIVE state 4).
> 
> The keep alive timer fires and a keep alive command is send off but
> gets rejected by nvme_fc_queue_rq and the rq status is set to
> NVME_SC_HOST_PATH_ERROR. The nvme status is then mapped to a block layer
> status BLK_STS_TRANSPORT/4 in nvme_end_req. Eventually,
> nvme_keep_alive_end_io sees the status != 0 and just logs an error 5).
> 
> We should obviously detect the problem in 3) and abort there (will
> address this later), but that still leaves a race window open. There is
> a race window open in nvme_fc_create_association after starting the IO
> queues and setting the ctrl state to LIVE.
> 
> Thus trigger a reset from the keep alive handler when an error is
> reported.
> 
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   drivers/nvme/host/core.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index bfd71511c85f8b1a9508c6ea062475ff51bf27fe..2a07c2c540b26c8cbe886711abaf6f0afbe6c4df 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1320,6 +1320,12 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
>   		dev_err(ctrl->device,
>   			"failed nvme_keep_alive_end_io error=%d\n",
>   				status);
> +		/*
> +		 * The driver reports that we lost the connection,
> +		 * trigger a recovery.
> +		 */
> +		if (status == BLK_STS_TRANSPORT)
> +			nvme_reset_ctrl(ctrl);
>   		return RQ_END_IO_NONE;
>   	}
>   
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
  2024-11-29  9:28 ` [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count Daniel Wagner
@ 2024-11-29 11:10   ` Hannes Reinecke
  2024-12-17  8:35     ` Daniel Wagner
  2024-12-24 10:39   ` Sagi Grimberg
  1 sibling, 1 reply; 18+ messages in thread
From: Hannes Reinecke @ 2024-11-29 11:10 UTC (permalink / raw)
  To: Daniel Wagner, James Smart, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Paul Ely
  Cc: linux-nvme, linux-kernel

On 11/29/24 10:28, Daniel Wagner wrote:
> When the set feature attempts fails with any NVME status code set in
> nvme_set_queue_count, the function still report success. Though the
> numbers of queues set to 0. This is done to support controllers in
> degraded state (the admin queue is still up and running but no IO
> queues).
> 
> Though there is an exception. When nvme_set_features reports an host
> path error, nvme_set_queue_count should propagate this error as the
> connectivity is lost, which means also the admin queue is not working
> anymore.
> 
> Fixes: 9a0be7abb62f ("nvme: refactor set_queue_count")
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   drivers/nvme/host/core.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 2a07c2c540b26c8cbe886711abaf6f0afbe6c4df..aa2a7c7d4d0ae7bd1f7fb704e55c0a8d724b9d56 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1677,7 +1677,12 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
>   
>   	status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, NULL, 0,
>   			&result);
> -	if (status < 0)
> +	/*
> +	 * It's either a kernel error or the host observed a connection
> +	 * lost. In either case it's not possible communicate with the
> +	 * controller and thus enter the error code path.
> +	 */
> +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
>   		return status;
>   
>   	/*
> 
Hmm. Maybe checking for NVME_SC_DNR, too?

Otherwise:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] nvme: trigger reset when keep alive fails
  2024-11-29  9:28 ` [PATCH v3 2/3] nvme: trigger reset when keep alive fails Daniel Wagner
  2024-11-29 11:09   ` Hannes Reinecke
@ 2024-12-09 13:36   ` Christoph Hellwig
  2024-12-24 10:31   ` Sagi Grimberg
  2 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2024-12-09 13:36 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: James Smart, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	Hannes Reinecke, Paul Ely, linux-nvme, linux-kernel

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
  2024-11-29 11:10   ` Hannes Reinecke
@ 2024-12-17  8:35     ` Daniel Wagner
  2024-12-17  9:45       ` Hannes Reinecke
  2024-12-24 10:35       ` Sagi Grimberg
  0 siblings, 2 replies; 18+ messages in thread
From: Daniel Wagner @ 2024-12-17  8:35 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Wagner, James Smart, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Paul Ely, linux-nvme, linux-kernel

On Fri, Nov 29, 2024 at 12:10:33PM +0100, Hannes Reinecke wrote:
> > +	/*
> > +	 * It's either a kernel error or the host observed a connection
> > +	 * lost. In either case it's not possible communicate with the
> > +	 * controller and thus enter the error code path.
> > +	 */
> > +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
> >   		return status;
> >   	/*
> > 
> Hmm. Maybe checking for NVME_SC_DNR, too?

if no one complains I'll update the check to:

	if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
	    status == NVME_SC_HOST_PATH_ERROR)
		return status;

okay?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
  2024-12-17  8:35     ` Daniel Wagner
@ 2024-12-17  9:45       ` Hannes Reinecke
  2024-12-17 14:01         ` Daniel Wagner
  2024-12-24 10:35       ` Sagi Grimberg
  1 sibling, 1 reply; 18+ messages in thread
From: Hannes Reinecke @ 2024-12-17  9:45 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Daniel Wagner, James Smart, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Paul Ely, linux-nvme, linux-kernel

On 12/17/24 09:35, Daniel Wagner wrote:
> On Fri, Nov 29, 2024 at 12:10:33PM +0100, Hannes Reinecke wrote:
>>> +	/*
>>> +	 * It's either a kernel error or the host observed a connection
>>> +	 * lost. In either case it's not possible communicate with the
>>> +	 * controller and thus enter the error code path.
>>> +	 */
>>> +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
>>>    		return status;
>>>    	/*
>>>
>> Hmm. Maybe checking for NVME_SC_DNR, too?
> 
> if no one complains I'll update the check to:
> 
> 	if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
> 	    status == NVME_SC_HOST_PATH_ERROR)
> 		return status;
> 
> okay?

Which really cries out for a helper. But otherwise okay.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
  2024-12-17  9:45       ` Hannes Reinecke
@ 2024-12-17 14:01         ` Daniel Wagner
  2024-12-20  8:32           ` Hannes Reinecke
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Wagner @ 2024-12-17 14:01 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Daniel Wagner, James Smart, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Paul Ely, linux-nvme, linux-kernel

On Tue, Dec 17, 2024 at 10:45:45AM +0100, Hannes Reinecke wrote:
> On 12/17/24 09:35, Daniel Wagner wrote:
> > On Fri, Nov 29, 2024 at 12:10:33PM +0100, Hannes Reinecke wrote:
> > > > +	/*
> > > > +	 * It's either a kernel error or the host observed a connection
> > > > +	 * lost. In either case it's not possible communicate with the
> > > > +	 * controller and thus enter the error code path.
> > > > +	 */
> > > > +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
> > > >    		return status;
> > > >    	/*
> > > > 
> > > Hmm. Maybe checking for NVME_SC_DNR, too?
> > 
> > if no one complains I'll update the check to:
> > 
> > 	if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
> > 	    status == NVME_SC_HOST_PATH_ERROR)
> > 		return status;
> > 
> > okay?
> 
> Which really cries out for a helper. But otherwise okay.

As far I remember, Christoph is not a big fan of single users of tiny
helper functions. Anyway, what about this:

+static bool nvme_num_queue_updated(status)
+{
+       /*
+        * It's either a kernel error or the host observed a connection
+        * lost. In either case it's not possible communicate with the
+        * controller and thus enter the error code path.
+        */
+
+       if (status < 0 || status & NVME_STATUS_DNR ||
+           status == NVME_SC_HOST_PATH_ERROR)
+               return false;
+
+       return true;
+}
+
 int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
 {
        u32 q_count = (*count - 1) | ((*count - 1) << 16);
@@ -1701,13 +1716,8 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)

        status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, NULL, 0,
                        &result);
-       /*
-        * It's either a kernel error or the host observed a connection
-        * lost. In either case it's not possible communicate with the
-        * controller and thus enter the error code path.
-        */
-       if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
-           status == NVME_SC_HOST_PATH_ERROR)
+
+       if (!nvme_num_queue_updated(status))
                return status;

        /*


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
  2024-12-17 14:01         ` Daniel Wagner
@ 2024-12-20  8:32           ` Hannes Reinecke
  0 siblings, 0 replies; 18+ messages in thread
From: Hannes Reinecke @ 2024-12-20  8:32 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Daniel Wagner, James Smart, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Paul Ely, linux-nvme, linux-kernel

On 12/17/24 15:01, Daniel Wagner wrote:
> On Tue, Dec 17, 2024 at 10:45:45AM +0100, Hannes Reinecke wrote:
>> On 12/17/24 09:35, Daniel Wagner wrote:
>>> On Fri, Nov 29, 2024 at 12:10:33PM +0100, Hannes Reinecke wrote:
>>>>> +	/*
>>>>> +	 * It's either a kernel error or the host observed a connection
>>>>> +	 * lost. In either case it's not possible communicate with the
>>>>> +	 * controller and thus enter the error code path.
>>>>> +	 */
>>>>> +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
>>>>>     		return status;
>>>>>     	/*
>>>>>
>>>> Hmm. Maybe checking for NVME_SC_DNR, too?
>>>
>>> if no one complains I'll update the check to:
>>>
>>> 	if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
>>> 	    status == NVME_SC_HOST_PATH_ERROR)
>>> 		return status;
>>>
>>> okay?
>>
>> Which really cries out for a helper. But otherwise okay.
> 
> As far I remember, Christoph is not a big fan of single users of tiny
> helper functions. Anyway, what about this:
> 
> +static bool nvme_num_queue_updated(status)
> +{
> +       /*
> +        * It's either a kernel error or the host observed a connection
> +        * lost. In either case it's not possible communicate with the
> +        * controller and thus enter the error code path.
> +        */
> +
> +       if (status < 0 || status & NVME_STATUS_DNR ||
> +           status == NVME_SC_HOST_PATH_ERROR)
> +               return false;
> +
> +       return true;
> +}
> +
>   int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
>   {
>          u32 q_count = (*count - 1) | ((*count - 1) << 16);
> @@ -1701,13 +1716,8 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
> 
>          status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, NULL, 0,
>                          &result);
> -       /*
> -        * It's either a kernel error or the host observed a connection
> -        * lost. In either case it's not possible communicate with the
> -        * controller and thus enter the error code path.
> -        */
> -       if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
> -           status == NVME_SC_HOST_PATH_ERROR)
> +
> +       if (!nvme_num_queue_updated(status))
>                  return status;
> 
>          /*

Naming is terrible. The helper does not check whether the number of 
queues updated, so why the name?

Better keep it inline, then.

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] nvme: trigger reset when keep alive fails
  2024-11-29  9:28 ` [PATCH v3 2/3] nvme: trigger reset when keep alive fails Daniel Wagner
  2024-11-29 11:09   ` Hannes Reinecke
  2024-12-09 13:36   ` Christoph Hellwig
@ 2024-12-24 10:31   ` Sagi Grimberg
  2025-01-07 14:38     ` Daniel Wagner
  2 siblings, 1 reply; 18+ messages in thread
From: Sagi Grimberg @ 2024-12-24 10:31 UTC (permalink / raw)
  To: Daniel Wagner, James Smart, Keith Busch, Christoph Hellwig,
	Hannes Reinecke, Paul Ely
  Cc: linux-nvme, linux-kernel




On 29/11/2024 11:28, Daniel Wagner wrote:
> nvme_keep_alive_work setups a keep alive command and uses
> blk_execute_rq_nowait to send out the command in an asynchronously
> manner. Eventually, nvme_keep_alive_end_io is called. If the status
> argument is 0, a new keep alive is send out. When the status argument is
> not 0, only an error is logged. The keep alive machinery does not
> trigger the error recovery.
>
> The FC driver is relying on the keep alive machinery to trigger recovery
> when an error is detected. Whenever an error happens during the creation
> of the association the idea is that the operation is aborted and retried
> later. Though there is a window where an error happens and
> nvme_fc_create_assocation can't detect the error.
>
>           1) nvme nvme10: NVME-FC{10}: create association : ...
>           2) nvme nvme10: NVME-FC{10}: controller connectivity lost. Awaiting Reconnect
>              nvme nvme10: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
>           3) nvme nvme10: Could not set queue count (880)
>              nvme nvme10: Failed to configure AEN (cfg 900)
>           4) nvme nvme10: NVME-FC{10}: controller connect complete
>           5) nvme nvme10: failed nvme_keep_alive_end_io error=4
>
> A connection attempt starts 1) and the ctrl is in state CONNECTING.
> Shortly after the LLDD driver detects a connection lost event and calls
> nvme_fc_ctrl_connectivity_loss 2). Because we are still in CONNECTING
> state, this event is ignored.
>
> nvme_fc_create_association continues to run in parallel and tries to
> communicate with the controller and those commands fail. Though these
> errors are filtered out, e.g in 3) setting the I/O queues numbers fails
> which leads to an early exit in nvme_fc_create_io_queues. Because the
> number of IO queues is 0 at this point, there is nothing left in
> nvme_fc_create_association which could detected the connection drop.
> Thus the ctrl enters LIVE state 4).
>
> The keep alive timer fires and a keep alive command is send off but
> gets rejected by nvme_fc_queue_rq and the rq status is set to
> NVME_SC_HOST_PATH_ERROR. The nvme status is then mapped to a block layer
> status BLK_STS_TRANSPORT/4 in nvme_end_req. Eventually,
> nvme_keep_alive_end_io sees the status != 0 and just logs an error 5).
>
> We should obviously detect the problem in 3) and abort there (will
> address this later), but that still leaves a race window open. There is
> a race window open in nvme_fc_create_association after starting the IO
> queues and setting the ctrl state to LIVE.
>
> Thus trigger a reset from the keep alive handler when an error is
> reported.
>
> Signed-off-by: Daniel Wagner <wagi@kernel.org>
> ---
>   drivers/nvme/host/core.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index bfd71511c85f8b1a9508c6ea062475ff51bf27fe..2a07c2c540b26c8cbe886711abaf6f0afbe6c4df 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1320,6 +1320,12 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
>   		dev_err(ctrl->device,
>   			"failed nvme_keep_alive_end_io error=%d\n",
>   				status);
> +		/*
> +		 * The driver reports that we lost the connection,
> +		 * trigger a recovery.
> +		 */
> +		if (status == BLK_STS_TRANSPORT)
> +			nvme_reset_ctrl(ctrl);
>   		return RQ_END_IO_NONE;
>   	}
>   
>

A lengthy explanation that results in nvme core behavior that assumes a 
very specific driver
behavior. Isn't the root of the problem that FC is willing to live 
peacefully with a controller
without any queues/connectivity to it without periodically reconnecting?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
  2024-12-17  8:35     ` Daniel Wagner
  2024-12-17  9:45       ` Hannes Reinecke
@ 2024-12-24 10:35       ` Sagi Grimberg
  2025-01-07 14:40         ` Daniel Wagner
  1 sibling, 1 reply; 18+ messages in thread
From: Sagi Grimberg @ 2024-12-24 10:35 UTC (permalink / raw)
  To: Daniel Wagner, Hannes Reinecke
  Cc: Daniel Wagner, James Smart, Keith Busch, Christoph Hellwig,
	Paul Ely, linux-nvme, linux-kernel




On 17/12/2024 10:35, Daniel Wagner wrote:
> On Fri, Nov 29, 2024 at 12:10:33PM +0100, Hannes Reinecke wrote:
>>> +	/*
>>> +	 * It's either a kernel error or the host observed a connection
>>> +	 * lost. In either case it's not possible communicate with the
>>> +	 * controller and thus enter the error code path.
>>> +	 */
>>> +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
>>>    		return status;
>>>    	/*
>>>
>> Hmm. Maybe checking for NVME_SC_DNR, too?
> if no one complains I'll update the check to:
>
> 	if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
> 	    status == NVME_SC_HOST_PATH_ERROR)
> 		return status;
>
> okay?
Why do we care about the DNR? are you going to retry based on it?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
  2024-11-29  9:28 ` [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count Daniel Wagner
  2024-11-29 11:10   ` Hannes Reinecke
@ 2024-12-24 10:39   ` Sagi Grimberg
  1 sibling, 0 replies; 18+ messages in thread
From: Sagi Grimberg @ 2024-12-24 10:39 UTC (permalink / raw)
  To: Daniel Wagner, James Smart, Keith Busch, Christoph Hellwig,
	Hannes Reinecke, Paul Ely
  Cc: linux-nvme, linux-kernel

Makes sense to me,
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] nvme: trigger reset when keep alive fails
  2024-12-24 10:31   ` Sagi Grimberg
@ 2025-01-07 14:38     ` Daniel Wagner
  2025-01-08 10:50       ` Sagi Grimberg
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Wagner @ 2025-01-07 14:38 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Daniel Wagner, James Smart, Keith Busch, Christoph Hellwig,
	Hannes Reinecke, Paul Ely, linux-nvme, linux-kernel

On Tue, Dec 24, 2024 at 12:31:35PM +0200, Sagi Grimberg wrote:
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index bfd71511c85f8b1a9508c6ea062475ff51bf27fe..2a07c2c540b26c8cbe886711abaf6f0afbe6c4df 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -1320,6 +1320,12 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
> >   		dev_err(ctrl->device,
> >   			"failed nvme_keep_alive_end_io error=%d\n",
> >   				status);
> > +		/*
> > +		 * The driver reports that we lost the connection,
> > +		 * trigger a recovery.
> > +		 */
> > +		if (status == BLK_STS_TRANSPORT)
> > +			nvme_reset_ctrl(ctrl);
> >   		return RQ_END_IO_NONE;
> >   	}
> > 
> 
> A lengthy explanation that results in nvme core behavior that assumes a very
> specific driver behavior.

I tried to explain exactly what's going on, so we can discuss possible
solutions without communicating past each other.

In the meantime I started on a patch set for the TP4129 related changes
in the spec (KATO Corrections and Clarifications). These changes would
also depend on the kato timeout handler triggering a reset.

I am fine with dropping this change for now and discuss it in the light
of TP4129 if this is what you prefer?

> Isn't the root of the problem that FC is willing to live
> peacefully with a controller
> without any queues/connectivity to it without periodically reconnecting?

The root problem is that the connect lost event gets ignored in the
CONNECTING state for the first connection attempt. All will work fine
for RECONNECTING state.

Maybe something like this instead? (untested)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index c4cbe3ce81f7..1f1d1d62a978 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -148,6 +148,7 @@ struct nvme_fc_rport {
 #define ASSOC_ACTIVE		0
 #define ASSOC_FAILED		1
 #define FCCTRL_TERMIO		2
+#define CONNECTIVITY_LOST	3

 struct nvme_fc_ctrl {
 	spinlock_t		lock;
@@ -785,6 +786,8 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
 		"NVME-FC{%d}: controller connectivity lost. Awaiting "
 		"Reconnect", ctrl->cnum);

+	set_bit(CONNECTIVITY_LOST, &ctrl->flags);
+
 	switch (nvme_ctrl_state(&ctrl->ctrl)) {
 	case NVME_CTRL_NEW:
 	case NVME_CTRL_LIVE:
@@ -3071,6 +3074,8 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)
 	if (nvme_fc_ctlr_active_on_rport(ctrl))
 		return -ENOTUNIQ;

+	clear_bit(CONNECTIVITY_LOST, &ctrl->flags);
+
 	dev_info(ctrl->ctrl.device,
 		"NVME-FC{%d}: create association : host wwpn 0x%016llx "
 		" rport wwpn 0x%016llx: NQN \"%s\"\n",
@@ -3174,6 +3179,11 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)

 	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);

+	if (test_bit(CONNECTIVITY_LOST, &ctrl->flags)) {
+		ret = -EIO;
+		goto out_term_aeo_ops;
+	}
+
 	ctrl->ctrl.nr_reconnects = 0;

 	if (changed)


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
  2024-12-24 10:35       ` Sagi Grimberg
@ 2025-01-07 14:40         ` Daniel Wagner
  2025-01-08 10:51           ` Sagi Grimberg
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Wagner @ 2025-01-07 14:40 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Hannes Reinecke, James Smart, Keith Busch, Christoph Hellwig,
	Paul Ely, linux-nvme, linux-kernel

On Tue, Dec 24, 2024 at 12:35:23PM +0200, Sagi Grimberg wrote:
> On 17/12/2024 10:35, Daniel Wagner wrote:
> > On Fri, Nov 29, 2024 at 12:10:33PM +0100, Hannes Reinecke wrote:
> > > > +	/*
> > > > +	 * It's either a kernel error or the host observed a connection
> > > > +	 * lost. In either case it's not possible communicate with the
> > > > +	 * controller and thus enter the error code path.
> > > > +	 */
> > > > +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
> > > >    		return status;
> > > >    	/*
> > > > 
> > > Hmm. Maybe checking for NVME_SC_DNR, too?
> > if no one complains I'll update the check to:
> > 
> > 	if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
> > 	    status == NVME_SC_HOST_PATH_ERROR)
> > 		return status;
> > 
> > okay?
> Why do we care about the DNR? are you going to retry based on it?

I don't know if we should care. Hannes brought this up.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/3] nvme: trigger reset when keep alive fails
  2025-01-07 14:38     ` Daniel Wagner
@ 2025-01-08 10:50       ` Sagi Grimberg
  0 siblings, 0 replies; 18+ messages in thread
From: Sagi Grimberg @ 2025-01-08 10:50 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Daniel Wagner, James Smart, Keith Busch, Christoph Hellwig,
	Hannes Reinecke, Paul Ely, linux-nvme, linux-kernel




On 07/01/2025 16:38, Daniel Wagner wrote:
> On Tue, Dec 24, 2024 at 12:31:35PM +0200, Sagi Grimberg wrote:
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index bfd71511c85f8b1a9508c6ea062475ff51bf27fe..2a07c2c540b26c8cbe886711abaf6f0afbe6c4df 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -1320,6 +1320,12 @@ static enum rq_end_io_ret nvme_keep_alive_end_io(struct request *rq,
>>>    		dev_err(ctrl->device,
>>>    			"failed nvme_keep_alive_end_io error=%d\n",
>>>    				status);
>>> +		/*
>>> +		 * The driver reports that we lost the connection,
>>> +		 * trigger a recovery.
>>> +		 */
>>> +		if (status == BLK_STS_TRANSPORT)
>>> +			nvme_reset_ctrl(ctrl);
>>>    		return RQ_END_IO_NONE;
>>>    	}
>>>
>> A lengthy explanation that results in nvme core behavior that assumes a very
>> specific driver behavior.
> I tried to explain exactly what's going on, so we can discuss possible
> solutions without communicating past each other.
>
> In the meantime I started on a patch set for the TP4129 related changes
> in the spec (KATO Corrections and Clarifications). These changes would
> also depend on the kato timeout handler triggering a reset.
>
> I am fine with dropping this change for now and discuss it in the light
> of TP4129 if this is what you prefer?
>
>> Isn't the root of the problem that FC is willing to live
>> peacefully with a controller
>> without any queues/connectivity to it without periodically reconnecting?
> The root problem is that the connect lost event gets ignored in the
> CONNECTING state for the first connection attempt. All will work fine
> for RECONNECTING state.
>
> Maybe something like this instead? (untested)
>
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index c4cbe3ce81f7..1f1d1d62a978 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -148,6 +148,7 @@ struct nvme_fc_rport {
>   #define ASSOC_ACTIVE		0
>   #define ASSOC_FAILED		1
>   #define FCCTRL_TERMIO		2
> +#define CONNECTIVITY_LOST	3
>
>   struct nvme_fc_ctrl {
>   	spinlock_t		lock;
> @@ -785,6 +786,8 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
>   		"NVME-FC{%d}: controller connectivity lost. Awaiting "
>   		"Reconnect", ctrl->cnum);
>
> +	set_bit(CONNECTIVITY_LOST, &ctrl->flags);
> +
>   	switch (nvme_ctrl_state(&ctrl->ctrl)) {
>   	case NVME_CTRL_NEW:
>   	case NVME_CTRL_LIVE:
> @@ -3071,6 +3074,8 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)
>   	if (nvme_fc_ctlr_active_on_rport(ctrl))
>   		return -ENOTUNIQ;
>
> +	clear_bit(CONNECTIVITY_LOST, &ctrl->flags);
> +
>   	dev_info(ctrl->ctrl.device,
>   		"NVME-FC{%d}: create association : host wwpn 0x%016llx "
>   		" rport wwpn 0x%016llx: NQN \"%s\"\n",
> @@ -3174,6 +3179,11 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)
>
>   	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
>
> +	if (test_bit(CONNECTIVITY_LOST, &ctrl->flags)) {
> +		ret = -EIO;
> +		goto out_term_aeo_ops;
> +	}
> +
>   	ctrl->ctrl.nr_reconnects = 0;
>
>   	if (changed)

This looks a lot better to me.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count
  2025-01-07 14:40         ` Daniel Wagner
@ 2025-01-08 10:51           ` Sagi Grimberg
  0 siblings, 0 replies; 18+ messages in thread
From: Sagi Grimberg @ 2025-01-08 10:51 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Hannes Reinecke, James Smart, Keith Busch, Christoph Hellwig,
	Paul Ely, linux-nvme, linux-kernel




On 07/01/2025 16:40, Daniel Wagner wrote:
> On Tue, Dec 24, 2024 at 12:35:23PM +0200, Sagi Grimberg wrote:
>> On 17/12/2024 10:35, Daniel Wagner wrote:
>>> On Fri, Nov 29, 2024 at 12:10:33PM +0100, Hannes Reinecke wrote:
>>>>> +	/*
>>>>> +	 * It's either a kernel error or the host observed a connection
>>>>> +	 * lost. In either case it's not possible communicate with the
>>>>> +	 * controller and thus enter the error code path.
>>>>> +	 */
>>>>> +	if (status < 0 || status == NVME_SC_HOST_PATH_ERROR)
>>>>>     		return status;
>>>>>     	/*
>>>>>
>>>> Hmm. Maybe checking for NVME_SC_DNR, too?
>>> if no one complains I'll update the check to:
>>>
>>> 	if (status < 0 || (status > 0 && (status & NVME_STATUS_DNR)) ||
>>> 	    status == NVME_SC_HOST_PATH_ERROR)
>>> 		return status;
>>>
>>> okay?
>> Why do we care about the DNR? are you going to retry based on it?
> I don't know if we should care. Hannes brought this up.

I don't see why should we care.


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-01-08 10:59 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-29  9:28 [PATCH v3 0/3] nvme-fc: fix race with connectivity loss and nvme_fc_create_association Daniel Wagner
2024-11-29  9:28 ` [PATCH v3 1/3] nvme-fc: go straight to connecting state when initializing Daniel Wagner
2024-11-29  9:28 ` [PATCH v3 2/3] nvme: trigger reset when keep alive fails Daniel Wagner
2024-11-29 11:09   ` Hannes Reinecke
2024-12-09 13:36   ` Christoph Hellwig
2024-12-24 10:31   ` Sagi Grimberg
2025-01-07 14:38     ` Daniel Wagner
2025-01-08 10:50       ` Sagi Grimberg
2024-11-29  9:28 ` [PATCH v3 3/3] nvme: handle connectivity loss in nvme_set_queue_count Daniel Wagner
2024-11-29 11:10   ` Hannes Reinecke
2024-12-17  8:35     ` Daniel Wagner
2024-12-17  9:45       ` Hannes Reinecke
2024-12-17 14:01         ` Daniel Wagner
2024-12-20  8:32           ` Hannes Reinecke
2024-12-24 10:35       ` Sagi Grimberg
2025-01-07 14:40         ` Daniel Wagner
2025-01-08 10:51           ` Sagi Grimberg
2024-12-24 10:39   ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox