[PATCHv2] nvme-tcp: Do not terminate i/o commands during RESETTING

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHv2] nvme-tcp: Do not terminate i/o commands during RESETTING
@ 2024-01-12 10:09 hare
  2024-01-14  2:53 ` Max Gurtovoy
  2024-01-15 12:36 ` Sagi Grimberg
  0 siblings, 2 replies; 4+ messages in thread
From: hare @ 2024-01-12 10:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke

From: Hannes Reinecke <hare@suse.de>

Terminating commands from the timeout handler might lead
to a data corruption as the timeout might trigger before
KATO expired.
When several commands have been sent in a batch and
the command timeouts trigger just after the keep-alive
command has been sent then the first command will trigger
the error recovery. But all other commands will timeout
directly afterwards and will hit the timeout handler
before the err_work workqueue handler has started.
This results in these commands being aborted and
immediately retried without waiting for KATO.
So return BLK_EH_RESET_TIMER for I/O commands when
the controller is in 'RESETTING' or 'DELETING'
state to ensure that the commands will
be retried only after the KATO interval.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/tcp.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 08805f027810..9dcb2d3b123c 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2431,17 +2431,27 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
 	struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
 	u8 opc = pdu->cmd.common.opcode, fctype = pdu->cmd.fabrics.fctype;
 	int qid = nvme_tcp_queue_id(req->queue);
+	enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
 
 	dev_warn(ctrl->device,
-		"queue %d: timeout cid %#x type %d opcode %#x (%s)\n",
+		"queue %d: timeout cid %#x type %d opcode %#x (%s) state %d\n",
 		nvme_tcp_queue_id(req->queue), nvme_cid(rq), pdu->hdr.type,
-		opc, nvme_opcode_str(qid, opc, fctype));
+		opc, nvme_opcode_str(qid, opc, fctype), state);
+
+	/*
+	 * If the controller is in state RESETTING or DELETING all
+	 * inflight commands will be terminated soon which in turn
+	 * may failover to a different path.
+	 */
+	if ((state == NVME_CTRL_RESETTING ||
+	     state == NVME_CTRL_DELETING) && qid > 0)
+		return BLK_EH_RESET_TIMER;
 
-	if (nvme_ctrl_state(ctrl) != NVME_CTRL_LIVE) {
+	if (state != NVME_CTRL_LIVE) {
 		/*
-		 * If we are resetting, connecting or deleting we should
-		 * complete immediately because we may block controller
-		 * teardown or setup sequence
+		 * If the controller is not live we should complete
+		 * immediately because we may block controller teardown
+		 * or setup sequence
 		 * - ctrl disable/shutdown fabrics requests
 		 * - connect requests
 		 * - initialization admin requests
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCHv2] nvme-tcp: Do not terminate i/o commands during RESETTING
  2024-01-12 10:09 [PATCHv2] nvme-tcp: Do not terminate i/o commands during RESETTING hare
@ 2024-01-14  2:53 ` Max Gurtovoy
  2024-01-15  6:22   ` Hannes Reinecke
  2024-01-15 12:36 ` Sagi Grimberg
  1 sibling, 1 reply; 4+ messages in thread
From: Max Gurtovoy @ 2024-01-14  2:53 UTC (permalink / raw)
  To: hare, Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, linux-nvme, Hannes Reinecke

Hi Hannes,

On 12/01/2024 12:09, hare@kernel.org wrote:
> From: Hannes Reinecke <hare@suse.de>
> 
> Terminating commands from the timeout handler might lead
> to a data corruption as the timeout might trigger before
> KATO expired.

Can you please explain the data corruption and how this patch is fixing it ?

> When several commands have been sent in a batch and
> the command timeouts trigger just after the keep-alive
> command has been sent then the first command will trigger
> the error recovery. But all other commands will timeout
> directly afterwards and will hit the timeout handler
> before the err_work workqueue handler has started.
> This results in these commands being aborted and
> immediately retried without waiting for KATO.
> So return BLK_EH_RESET_TIMER for I/O commands when
> the controller is in 'RESETTING' or 'DELETING'
> state to ensure that the commands will
> be retried only after the KATO interval.

I'm not sure I understand how does KATO and reconnect_delay are related ?

> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/host/tcp.c | 22 ++++++++++++++++------
>   1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 08805f027810..9dcb2d3b123c 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2431,17 +2431,27 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
>   	struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
>   	u8 opc = pdu->cmd.common.opcode, fctype = pdu->cmd.fabrics.fctype;
>   	int qid = nvme_tcp_queue_id(req->queue);
> +	enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
>   
>   	dev_warn(ctrl->device,
> -		"queue %d: timeout cid %#x type %d opcode %#x (%s)\n",
> +		"queue %d: timeout cid %#x type %d opcode %#x (%s) state %d\n",
>   		nvme_tcp_queue_id(req->queue), nvme_cid(rq), pdu->hdr.type,
> -		opc, nvme_opcode_str(qid, opc, fctype));
> +		opc, nvme_opcode_str(qid, opc, fctype), state);
> +
> +	/*
> +	 * If the controller is in state RESETTING or DELETING all
> +	 * inflight commands will be terminated soon which in turn
> +	 * may failover to a different path.
> +	 */
> +	if ((state == NVME_CTRL_RESETTING ||
> +	     state == NVME_CTRL_DELETING) && qid > 0)
> +		return BLK_EH_RESET_TIMER;
>   
> -	if (nvme_ctrl_state(ctrl) != NVME_CTRL_LIVE) {
> +	if (state != NVME_CTRL_LIVE) {
>   		/*
> -		 * If we are resetting, connecting or deleting we should
> -		 * complete immediately because we may block controller
> -		 * teardown or setup sequence
> +		 * If the controller is not live we should complete
> +		 * immediately because we may block controller teardown
> +		 * or setup sequence
>   		 * - ctrl disable/shutdown fabrics requests
>   		 * - connect requests
>   		 * - initialization admin requests


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCHv2] nvme-tcp: Do not terminate i/o commands during RESETTING
  2024-01-14  2:53 ` Max Gurtovoy
@ 2024-01-15  6:22   ` Hannes Reinecke
  0 siblings, 0 replies; 4+ messages in thread
From: Hannes Reinecke @ 2024-01-15  6:22 UTC (permalink / raw)
  To: Max Gurtovoy, hare, Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, linux-nvme

On 1/14/24 03:53, Max Gurtovoy wrote:
> Hi Hannes,
> 
> On 12/01/2024 12:09, hare@kernel.org wrote:
>> From: Hannes Reinecke <hare@suse.de>
>>
>> Terminating commands from the timeout handler might lead
>> to a data corruption as the timeout might trigger before
>> KATO expired.
> 
> Can you please explain the data corruption and how this patch is fixing 
> it ?
> 
It's like this:
n: connection breaks
n + 1: send command 1,2,3, tmo 30s
n + 2: send keep-alive, tmo 30s
n + 31: command timeout
  - command 1 timeout:
    queue error recovery workqueue,
    return BLK_EH_RESET_TIMER
  - command 2 timeout:
    abort command
    retry command
  - start error recovery workqueue, abort command 1&3
n + 32: KATO expires, commands are retried

Now command 2 was aborted directly, and will be retried
on a different path without waiting for KATO.
So command 2 will be sent to the controller while the controller
is not aware that a KATO timeout had happened.
This not only violates the spec (which states that we should only
retry commands after a KATO timeout happened), but there are
some controller implementations which need to clear up internal
state once KATO triggered. And on those implementations we see
a data corruption.

As discussed elsewhere these controllers really should be implement
CQAT, but that is no excuse for us violating the spec.

>> When several commands have been sent in a batch and
>> the command timeouts trigger just after the keep-alive
>> command has been sent then the first command will trigger
>> the error recovery. But all other commands will timeout
>> directly afterwards and will hit the timeout handler
>> before the err_work workqueue handler has started.
>> This results in these commands being aborted and
>> immediately retried without waiting for KATO.
>> So return BLK_EH_RESET_TIMER for I/O commands when
>> the controller is in 'RESETTING' or 'DELETING'
>> state to ensure that the commands will
>> be retried only after the KATO interval.
> 
> I'm not sure I understand how does KATO and reconnect_delay are related ?
> 
Not sure I'm following; reconnection delay doesn't come into it...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCHv2] nvme-tcp: Do not terminate i/o commands during RESETTING
  2024-01-12 10:09 [PATCHv2] nvme-tcp: Do not terminate i/o commands during RESETTING hare
  2024-01-14  2:53 ` Max Gurtovoy
@ 2024-01-15 12:36 ` Sagi Grimberg
  1 sibling, 0 replies; 4+ messages in thread
From: Sagi Grimberg @ 2024-01-15 12:36 UTC (permalink / raw)
  To: hare, Christoph Hellwig; +Cc: Keith Busch, linux-nvme, Hannes Reinecke



On 1/12/24 12:09, hare@kernel.org wrote:
> From: Hannes Reinecke <hare@suse.de>
> 
> Terminating commands from the timeout handler might lead
> to a data corruption as the timeout might trigger before
> KATO expired.
> When several commands have been sent in a batch and
> the command timeouts trigger just after the keep-alive
> command has been sent then the first command will trigger
> the error recovery. But all other commands will timeout
> directly afterwards and will hit the timeout handler
> before the err_work workqueue handler has started.
> This results in these commands being aborted and
> immediately retried without waiting for KATO.
> So return BLK_EH_RESET_TIMER for I/O commands when
> the controller is in 'RESETTING' or 'DELETING'
> state to ensure that the commands will
> be retried only after the KATO interval.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/host/tcp.c | 22 ++++++++++++++++------
>   1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 08805f027810..9dcb2d3b123c 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2431,17 +2431,27 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
>   	struct nvme_tcp_cmd_pdu *pdu = nvme_tcp_req_cmd_pdu(req);
>   	u8 opc = pdu->cmd.common.opcode, fctype = pdu->cmd.fabrics.fctype;
>   	int qid = nvme_tcp_queue_id(req->queue);
> +	enum nvme_ctrl_state state = nvme_ctrl_state(ctrl);
>   
>   	dev_warn(ctrl->device,
> -		"queue %d: timeout cid %#x type %d opcode %#x (%s)\n",
> +		"queue %d: timeout cid %#x type %d opcode %#x (%s) state %d\n",
>   		nvme_tcp_queue_id(req->queue), nvme_cid(rq), pdu->hdr.type,
> -		opc, nvme_opcode_str(qid, opc, fctype));
> +		opc, nvme_opcode_str(qid, opc, fctype), state);
> +
> +	/*
> +	 * If the controller is in state RESETTING or DELETING all
> +	 * inflight commands will be terminated soon which in turn
> +	 * may failover to a different path.
> +	 */
> +	if ((state == NVME_CTRL_RESETTING ||
> +	     state == NVME_CTRL_DELETING) && qid > 0)
> +		return BLK_EH_RESET_TIMER;
>   
> -	if (nvme_ctrl_state(ctrl) != NVME_CTRL_LIVE) {
> +	if (state != NVME_CTRL_LIVE) {

The remaining states are:
NVME_CTRL_NEW
NVME_CTRL_CONNECTING
NVME_CTRL_DELETING_NOIO
NVME_CTRL_DEAD

The comment perhaps is becoming outdated?

I'm wandering if it would make sense to turn this
into a switch-case statement:
--
	switch (state) {
	case NVME_CTRL_LIVE:
		/*
		 * LIVE state should trigger the normal error recovery
		 * which will handle completing this request.
          	 */
		break;
	case NVME_CTRL_RESETTING:
		/* fallthru */
	case NVME_CTRL_DELETING:
		if (!qid)
			break;
		/*
		 * If the controller is in state RESETTING or DELETING
		 * all inflight I/O commands will be terminated soon
		 * which in turn may failover to a different path.
		 */
		return BLK_EH_RESET_TIMER;
	case NVME_CTRL_DELETING_NOIO;
	case NVME_CTRL_DEAD;
		/* fallthru */
	case NVME_CTRL_CONNECTING;
		/*
		 * Fail immediately to avoid a deadlock if the request
		 * is blocking as part of the ctrl setup or teardown
		 * sequence (e.g. connect, property_set).
		 */
		nvme_tcp_complete_timed_out(rq);
		return BLK_EH_DONE;
	case NVME_CTRL_NEW:
		/* fallthru */
	default:
		WARN_ONCE();
		return BLK_EH_RESET_TIMER;
	}

	nvme_tcp_error_recovery(ctrl);
	return BLK_EH_RESET_TIMER;
--

Thoughts?


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-01-15 12:36 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-12 10:09 [PATCHv2] nvme-tcp: Do not terminate i/o commands during RESETTING hare
2024-01-14  2:53 ` Max Gurtovoy
2024-01-15  6:22   ` Hannes Reinecke
2024-01-15 12:36 ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.