[PATCH 0/3] nvme-tcp: start error recovery after KATO

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH 0/3] nvme-tcp: start error recovery after KATO
@ 2023-09-08 10:00 Hannes Reinecke
  2023-09-08 10:00 ` [PATCH 1/3] nvme-tcp: Do not terminate commands when in RESETTING Hannes Reinecke
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Hannes Reinecke @ 2023-09-08 10:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Hi all,

there have been some very insistent reports of data corruption
with certain target implementations due to command retries.
Problem here is that for TCP we're starting error recovery
immediately after either a command timeout or a (local) link loss.
That is contrary to the NVMe base spec, which states in
section 3.9:

 If a Keep Alive Timer expires:
  a) the controller shall ...
  and
  b) the host assumes all outstanding commands are not completed
     and re-issues commands as appropriate.

IE we should retry commands only after KATO expired.
With this patchset we will always wait until KATO expired until
starting error recovery. This will cause a longer delay until
failed commands are retried, but that's kinda the point
of this patchset :-)

As usual, comments and reviews are welcome.

Hannes Reinecke (3):
  nvme-tcp: Do not terminate commands when in RESETTING
  nvme-tcp: make 'err_work' a delayed work
  nvme-tcp: delay error recovery until the next KATO interval

 drivers/nvme/host/core.c |  3 ++-
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/tcp.c  | 29 +++++++++++++++++++++++------
 3 files changed, 26 insertions(+), 7 deletions(-)

-- 
2.35.3

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/3] nvme-tcp: Do not terminate commands when in RESETTING
  2023-09-08 10:00 [PATCH 0/3] nvme-tcp: start error recovery after KATO Hannes Reinecke
@ 2023-09-08 10:00 ` Hannes Reinecke
  2023-09-12 11:54   ` Sagi Grimberg
  2023-09-08 10:00 ` [PATCH 2/3] nvme-tcp: make 'err_work' a delayed work Hannes Reinecke
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Hannes Reinecke @ 2023-09-08 10:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

When the error recovery started it will terminate all commands
anyway, so we should just return BLK_EH_RESET_TIMER for a
command timeout.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/tcp.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 5b332d9f87fc..1faef1cf5c94 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2274,6 +2274,14 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
 		nvme_tcp_queue_id(req->queue), nvme_cid(rq), pdu->hdr.type,
 		opc, nvme_opcode_str(qid, opc, fctype));
 
+	/*
+	 * If the error recovery is started all commands will be
+	 * aborted anyway, and nothing is to be done here.
+	 */
+	if (ctrl->state == NVME_CTRL_RESETTING &&
+	    work_pending(&to_tcp_ctrl(ctrl)->err_work))
+		return BLK_EH_RESET_TIMER;
+
 	if (ctrl->state != NVME_CTRL_LIVE) {
 		/*
 		 * If we are resetting, connecting or deleting we should
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/3] nvme-tcp: make 'err_work' a delayed work
  2023-09-08 10:00 [PATCH 0/3] nvme-tcp: start error recovery after KATO Hannes Reinecke
  2023-09-08 10:00 ` [PATCH 1/3] nvme-tcp: Do not terminate commands when in RESETTING Hannes Reinecke
@ 2023-09-08 10:00 ` Hannes Reinecke
  2023-09-08 10:00 ` [PATCH 3/3] nvme-tcp: delay error recovery until the next KATO interval Hannes Reinecke
  2023-09-12 11:51 ` [PATCH 0/3] nvme-tcp: start error recovery after KATO Sagi Grimberg
  3 siblings, 0 replies; 20+ messages in thread
From: Hannes Reinecke @ 2023-09-08 10:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Move 'err_work' to a delayed work such that the error recovery
can be invoked at a later time.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/tcp.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 1faef1cf5c94..990bd863ae34 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -166,7 +166,7 @@ struct nvme_tcp_ctrl {
 	struct sockaddr_storage src_addr;
 	struct nvme_ctrl	ctrl;
 
-	struct work_struct	err_work;
+	struct delayed_work	err_work;
 	struct delayed_work	connect_work;
 	struct nvme_tcp_request async_req;
 	u32			io_queues[HCTX_MAX_TYPES];
@@ -539,7 +539,7 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
 		return;
 
 	dev_warn(ctrl->device, "starting error recovery\n");
-	queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
+	queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, 0);
 }
 
 static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
@@ -2109,7 +2109,7 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
 
 static void nvme_tcp_error_recovery_work(struct work_struct *work)
 {
-	struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
+	struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
 				struct nvme_tcp_ctrl, err_work);
 	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
 
@@ -2172,7 +2172,7 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
 
 static void nvme_tcp_stop_ctrl(struct nvme_ctrl *ctrl)
 {
-	flush_work(&to_tcp_ctrl(ctrl)->err_work);
+	flush_delayed_work(&to_tcp_ctrl(ctrl)->err_work);
 	cancel_delayed_work_sync(&to_tcp_ctrl(ctrl)->connect_work);
 }
 
@@ -2279,7 +2279,7 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
 	 * aborted anyway, and nothing is to be done here.
 	 */
 	if (ctrl->state == NVME_CTRL_RESETTING &&
-	    work_pending(&to_tcp_ctrl(ctrl)->err_work))
+	    delayed_work_pending(&to_tcp_ctrl(ctrl)->err_work))
 		return BLK_EH_RESET_TIMER;
 
 	if (ctrl->state != NVME_CTRL_LIVE) {
@@ -2533,7 +2533,8 @@ static struct nvme_ctrl *nvme_tcp_create_ctrl(struct device *dev,
 
 	INIT_DELAYED_WORK(&ctrl->connect_work,
 			nvme_tcp_reconnect_ctrl_work);
-	INIT_WORK(&ctrl->err_work, nvme_tcp_error_recovery_work);
+	INIT_DELAYED_WORK(&ctrl->err_work,
+			  nvme_tcp_error_recovery_work);
 	INIT_WORK(&ctrl->ctrl.reset_work, nvme_reset_ctrl_work);
 
 	if (!(opts->mask & NVMF_OPT_TRSVCID)) {
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/3] nvme-tcp: delay error recovery until the next KATO interval
  2023-09-08 10:00 [PATCH 0/3] nvme-tcp: start error recovery after KATO Hannes Reinecke
  2023-09-08 10:00 ` [PATCH 1/3] nvme-tcp: Do not terminate commands when in RESETTING Hannes Reinecke
  2023-09-08 10:00 ` [PATCH 2/3] nvme-tcp: make 'err_work' a delayed work Hannes Reinecke
@ 2023-09-08 10:00 ` Hannes Reinecke
  2023-09-12 12:17   ` Sagi Grimberg
  2023-09-12 11:51 ` [PATCH 0/3] nvme-tcp: start error recovery after KATO Sagi Grimberg
  3 siblings, 1 reply; 20+ messages in thread
From: Hannes Reinecke @ 2023-09-08 10:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Section 3.9 of the NVMe base spec states that:

 If a Keep Alive Timer expires:
 a) the controller shall ...
 and
 b) the host assumes all outstanding commands are not completed
   and re-issues commands as appropriate.

which means that we should _not_ retry any commands until KATO
expired (or the equivalent of the default KATO timeout if KATO
is not active).
And there are some target which reported a data corruption if
we retry commands immediately after reporting a command timeout
or link loss.
So always delay error recovery until the next KATO interval
to give controllers enough time to detect a KATO failure.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/nvme/host/core.c |  3 ++-
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/tcp.c  | 12 ++++++++++--
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f3a01b79148c..2d546612b20a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1175,7 +1175,7 @@ EXPORT_SYMBOL_NS_GPL(nvme_passthru_end, NVME_TARGET_PASSTHRU);
  *   The host should send Keep Alive commands at half of the Keep Alive Timeout
  *   accounting for transport roundtrip times [..].
  */
-static unsigned long nvme_keep_alive_work_period(struct nvme_ctrl *ctrl)
+unsigned long nvme_keep_alive_work_period(struct nvme_ctrl *ctrl)
 {
 	unsigned long delay = ctrl->kato * HZ / 2;
 
@@ -1189,6 +1189,7 @@ static unsigned long nvme_keep_alive_work_period(struct nvme_ctrl *ctrl)
 		delay /= 2;
 	return delay;
 }
+EXPORT_SYMBOL_GPL(nvme_keep_alive_work_period);
 
 static void nvme_queue_keep_alive_work(struct nvme_ctrl *ctrl)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f35647c470af..38845e5fa5a3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -777,6 +777,7 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl);
 void nvme_wait_freeze(struct nvme_ctrl *ctrl);
 int nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
 void nvme_start_freeze(struct nvme_ctrl *ctrl);
+unsigned long nvme_keep_alive_work_period(struct nvme_ctrl *ctrl);
 
 static inline enum req_op nvme_req_op(struct nvme_command *cmd)
 {
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 990bd863ae34..7bfdb05b2f17 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -533,13 +533,21 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
 	queue->ddgst_remaining = 0;
 }
 
+/*
+ * Error recovery needs to be started after KATO expired,
+ * always delay until the next KATO interval before
+ * starting error recovery.
+ */
 static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
 {
+	unsigned long delay = nvme_keep_alive_work_period(ctrl);
+
 	if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
 		return;
 
-	dev_warn(ctrl->device, "starting error recovery\n");
-	queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, 0);
+	dev_warn(ctrl->device, "starting error recovery in %lu seconds\n",
+		 delay / HZ);
+	queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, delay);
 }
 
 static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] nvme-tcp: start error recovery after KATO
  2023-09-08 10:00 [PATCH 0/3] nvme-tcp: start error recovery after KATO Hannes Reinecke
                   ` (2 preceding siblings ...)
  2023-09-08 10:00 ` [PATCH 3/3] nvme-tcp: delay error recovery until the next KATO interval Hannes Reinecke
@ 2023-09-12 11:51 ` Sagi Grimberg
  2023-09-12 11:56   ` Hannes Reinecke
  3 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2023-09-12 11:51 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

> Hi all,
> 
> there have been some very insistent reports of data corruption
> with certain target implementations due to command retries.

None of which were reported on this list...

> Problem here is that for TCP we're starting error recovery
> immediately after either a command timeout or a (local) link loss.

It does so only in one occasion, when the user triggered a
reset_controller. a command timeout is greater than the default
kato (6 times in fact), was this the case where the issue was
observed? If so, the timeout handler should probably just wait
the kato remaining time.

BTW, the same happens for rdma as well. Nothing should be
tcp specific here afaict.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: Do not terminate commands when in RESETTING
  2023-09-08 10:00 ` [PATCH 1/3] nvme-tcp: Do not terminate commands when in RESETTING Hannes Reinecke
@ 2023-09-12 11:54   ` Sagi Grimberg
  2023-09-12 12:06     ` Hannes Reinecke
  0 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2023-09-12 11:54 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


> When the error recovery started it will terminate all commands
> anyway, so we should just return BLK_EH_RESET_TIMER for a
> command timeout.

Does this fix a real issue? the timeout handler and the error
recovery should be mutually exclusive.

> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/nvme/host/tcp.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 5b332d9f87fc..1faef1cf5c94 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2274,6 +2274,14 @@ static enum blk_eh_timer_return nvme_tcp_timeout(struct request *rq)
>   		nvme_tcp_queue_id(req->queue), nvme_cid(rq), pdu->hdr.type,
>   		opc, nvme_opcode_str(qid, opc, fctype));
>   
> +	/*
> +	 * If the error recovery is started all commands will be
> +	 * aborted anyway, and nothing is to be done here.
> +	 */
> +	if (ctrl->state == NVME_CTRL_RESETTING &&
> +	    work_pending(&to_tcp_ctrl(ctrl)->err_work))
> +		return BLK_EH_RESET_TIMER;

What does the work_pending() check give you here?

> +
>   	if (ctrl->state != NVME_CTRL_LIVE) {
>   		/*
>   		 * If we are resetting, connecting or deleting we should


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] nvme-tcp: start error recovery after KATO
  2023-09-12 11:51 ` [PATCH 0/3] nvme-tcp: start error recovery after KATO Sagi Grimberg
@ 2023-09-12 11:56   ` Hannes Reinecke
  2023-09-12 12:12     ` Sagi Grimberg
  0 siblings, 1 reply; 20+ messages in thread
From: Hannes Reinecke @ 2023-09-12 11:56 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 9/12/23 13:51, Sagi Grimberg wrote:
>> Hi all,
>>
>> there have been some very insistent reports of data corruption
>> with certain target implementations due to command retries.
> 
> None of which were reported on this list...
> 
Correct. I can ask them to post their finding here if that would make a 
difference.

>> Problem here is that for TCP we're starting error recovery
>> immediately after either a command timeout or a (local) link loss.
> 
> It does so only in one occasion, when the user triggered a
> reset_controller. a command timeout is greater than the default
> kato (6 times in fact), was this the case where the issue was
> observed? If so, the timeout handler should probably just wait
> the kato remaining time.
> 
Nothing to do with reset_controller.
The problem really is in nvme_tcp_state_change(), which will blindly 
start error recovery (and a subsequent command retry) whenever the local 
link drops. And that error recover does _not_ wait for KATO, but rather
causes an immediate retry.

> BTW, the same happens for rdma as well. Nothing should be
> tcp specific here afaict.

Oh, sure. Will be modifying the patch to include that.

Cheers,

Hannes



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: Do not terminate commands when in RESETTING
  2023-09-12 11:54   ` Sagi Grimberg
@ 2023-09-12 12:06     ` Hannes Reinecke
  2023-09-12 12:15       ` Sagi Grimberg
  0 siblings, 1 reply; 20+ messages in thread
From: Hannes Reinecke @ 2023-09-12 12:06 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 9/12/23 13:54, Sagi Grimberg wrote:
> 
>> When the error recovery started it will terminate all commands
>> anyway, so we should just return BLK_EH_RESET_TIMER for a
>> command timeout.
> 
> Does this fix a real issue? the timeout handler and the error
> recovery should be mutually exclusive.
> 
They are not. When several timeouts are triggering at the same time,
only the first will start error recovery; the remaining ones will
still hit the timeout function until cancel_tagset() is called.

>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>   drivers/nvme/host/tcp.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 5b332d9f87fc..1faef1cf5c94 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -2274,6 +2274,14 @@ static enum blk_eh_timer_return 
>> nvme_tcp_timeout(struct request *rq)
>>           nvme_tcp_queue_id(req->queue), nvme_cid(rq), pdu->hdr.type,
>>           opc, nvme_opcode_str(qid, opc, fctype));
>> +    /*
>> +     * If the error recovery is started all commands will be
>> +     * aborted anyway, and nothing is to be done here.
>> +     */
>> +    if (ctrl->state == NVME_CTRL_RESETTING &&
>> +        work_pending(&to_tcp_ctrl(ctrl)->err_work))
>> +        return BLK_EH_RESET_TIMER;
> 
> What does the work_pending() check give you here?
> 
'RESETTING' is also entered for firmware download, so we need to check
for work_pending() to find out if the error recovery engaged.

Cheers,

Hannes



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] nvme-tcp: start error recovery after KATO
  2023-09-12 11:56   ` Hannes Reinecke
@ 2023-09-12 12:12     ` Sagi Grimberg
  2023-09-12 13:25       ` Hannes Reinecke
  0 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2023-09-12 12:12 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

>>> Hi all,
>>>
>>> there have been some very insistent reports of data corruption
>>> with certain target implementations due to command retries.
>>
>> None of which were reported on this list...
>>
> Correct. I can ask them to post their finding here if that would make a 
> difference.
> 
>>> Problem here is that for TCP we're starting error recovery
>>> immediately after either a command timeout or a (local) link loss.
>>
>> It does so only in one occasion, when the user triggered a
>> reset_controller. a command timeout is greater than the default
>> kato (6 times in fact), was this the case where the issue was
>> observed? If so, the timeout handler should probably just wait
>> the kato remaining time.
>>
> Nothing to do with reset_controller.
> The problem really is in nvme_tcp_state_change(), which will blindly 
> start error recovery (and a subsequent command retry) whenever the local 
> link drops.

That is incorrect. A link drop does not trigger a socket state change.
a state change is triggered for specific socket states
(see net/ipv4/tcp.c for description of socket states).

These are a set of states that describes local and remote sides
shutting down (or a TIME_WAIT). The cases where the host
triggers error recovery without the controller acknowledging it
(in case the link is down) are FIN_WAIT1/2, and these happen
when the user triggers a controller reset/disconnect.

In the other circumstances, we'll see a TCP_CLOSE/CLOSE_WAIT/CLOSING
which should indicate the controller acknowledged the shutdown.

> And that error recover does _not_ wait for KATO, but rather
> causes an immediate retry.

We can't blindly wait for KATO. If the controller saw the socket
closure, there is no point waiting kato.

If the user sets KATO to be 3 hours, and does a controller reset,
what happens now?

Only when the host triggers the shutdown, and that is not acknowledged
by the controller, then yes, the host needs to wait the remaining of
kato.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: Do not terminate commands when in RESETTING
  2023-09-12 12:06     ` Hannes Reinecke
@ 2023-09-12 12:15       ` Sagi Grimberg
  2023-09-12 12:59         ` Hannes Reinecke
  0 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2023-09-12 12:15 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>>> When the error recovery started it will terminate all commands
>>> anyway, so we should just return BLK_EH_RESET_TIMER for a
>>> command timeout.
>>
>> Does this fix a real issue? the timeout handler and the error
>> recovery should be mutually exclusive.
>>
> They are not. When several timeouts are triggering at the same time,
> only the first will start error recovery; the remaining ones will
> still hit the timeout function until cancel_tagset() is called.

Yes, but they do not access any resources unsynchronized. Meaning
is this a use-after-free issue?

> 
>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>> ---
>>>   drivers/nvme/host/tcp.c | 8 ++++++++
>>>   1 file changed, 8 insertions(+)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 5b332d9f87fc..1faef1cf5c94 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -2274,6 +2274,14 @@ static enum blk_eh_timer_return 
>>> nvme_tcp_timeout(struct request *rq)
>>>           nvme_tcp_queue_id(req->queue), nvme_cid(rq), pdu->hdr.type,
>>>           opc, nvme_opcode_str(qid, opc, fctype));
>>> +    /*
>>> +     * If the error recovery is started all commands will be
>>> +     * aborted anyway, and nothing is to be done here.
>>> +     */
>>> +    if (ctrl->state == NVME_CTRL_RESETTING &&
>>> +        work_pending(&to_tcp_ctrl(ctrl)->err_work))
>>> +        return BLK_EH_RESET_TIMER;
>>
>> What does the work_pending() check give you here?
>>
> 'RESETTING' is also entered for firmware download, so we need to check
> for work_pending() to find out if the error recovery engaged.

firmware download? what happens if err_work is yet to run when
this request times out?

Not sure what this is protecting against.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: delay error recovery until the next KATO interval
  2023-09-08 10:00 ` [PATCH 3/3] nvme-tcp: delay error recovery until the next KATO interval Hannes Reinecke
@ 2023-09-12 12:17   ` Sagi Grimberg
  0 siblings, 0 replies; 20+ messages in thread
From: Sagi Grimberg @ 2023-09-12 12:17 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

Please see my questions elsewhere in the thread.
More information is needed on what exactly is happening.
It is not clear to me at all.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: Do not terminate commands when in RESETTING
  2023-09-12 12:15       ` Sagi Grimberg
@ 2023-09-12 12:59         ` Hannes Reinecke
  2023-09-12 13:02           ` Sagi Grimberg
  0 siblings, 1 reply; 20+ messages in thread
From: Hannes Reinecke @ 2023-09-12 12:59 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 9/12/23 14:15, Sagi Grimberg wrote:
> 
>>>> When the error recovery started it will terminate all commands
>>>> anyway, so we should just return BLK_EH_RESET_TIMER for a
>>>> command timeout.
>>>
>>> Does this fix a real issue? the timeout handler and the error
>>> recovery should be mutually exclusive.
>>>
>> They are not. When several timeouts are triggering at the same time,
>> only the first will start error recovery; the remaining ones will
>> still hit the timeout function until cancel_tagset() is called.
> 
> Yes, but they do not access any resources unsynchronized. Meaning
> is this a use-after-free issue?
> 
This patch itself is really just an optimization. Sure the current
version works, but point is that we don't need to cancel the request
manually as they'll be handled by cancel_tagset() anyway.

>>
>>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>>>> ---
>>>>   drivers/nvme/host/tcp.c | 8 ++++++++
>>>>   1 file changed, 8 insertions(+)
>>>>
>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>> index 5b332d9f87fc..1faef1cf5c94 100644
>>>> --- a/drivers/nvme/host/tcp.c
>>>> +++ b/drivers/nvme/host/tcp.c
>>>> @@ -2274,6 +2274,14 @@ static enum blk_eh_timer_return 
>>>> nvme_tcp_timeout(struct request *rq)
>>>>           nvme_tcp_queue_id(req->queue), nvme_cid(rq), pdu->hdr.type,
>>>>           opc, nvme_opcode_str(qid, opc, fctype));
>>>> +    /*
>>>> +     * If the error recovery is started all commands will be
>>>> +     * aborted anyway, and nothing is to be done here.
>>>> +     */
>>>> +    if (ctrl->state == NVME_CTRL_RESETTING &&
>>>> +        work_pending(&to_tcp_ctrl(ctrl)->err_work))
>>>> +        return BLK_EH_RESET_TIMER;
>>>
>>> What does the work_pending() check give you here?
>>>
>> 'RESETTING' is also entered for firmware download, so we need to check
>> for work_pending() to find out if the error recovery engaged.
> 
> firmware download? what happens if err_work is yet to run when
> this request times out?
> 
But that's fine, as it's queued, and as such will be run eventually.
It'll have a delay until it's completed.

> Not sure what this is protecting against.

At this time, against nothing. It's really a prep patch for moving the
error recovery onto a delayed workqueue.

Cheers,

Hannes



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: Do not terminate commands when in RESETTING
  2023-09-12 12:59         ` Hannes Reinecke
@ 2023-09-12 13:02           ` Sagi Grimberg
  0 siblings, 0 replies; 20+ messages in thread
From: Sagi Grimberg @ 2023-09-12 13:02 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>> firmware download? what happens if err_work is yet to run when
>> this request times out?
>>
> But that's fine, as it's queued, and as such will be run eventually.
> It'll have a delay until it's completed.

But there is a window between the ctrl state RESETTING and the
work being queued, you are fine with failing the commands in
this window?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] nvme-tcp: start error recovery after KATO
  2023-09-12 12:12     ` Sagi Grimberg
@ 2023-09-12 13:25       ` Hannes Reinecke
  2023-09-12 13:43         ` Sagi Grimberg
  0 siblings, 1 reply; 20+ messages in thread
From: Hannes Reinecke @ 2023-09-12 13:25 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 9/12/23 14:12, Sagi Grimberg wrote:
> 
>>>> Hi all,
>>>>
>>>> there have been some very insistent reports of data corruption
>>>> with certain target implementations due to command retries.
>>>
>>> None of which were reported on this list...
>>>
>> Correct. I can ask them to post their finding here if that would make 
>> a difference.
>>
>>>> Problem here is that for TCP we're starting error recovery
>>>> immediately after either a command timeout or a (local) link loss.
>>>
>>> It does so only in one occasion, when the user triggered a
>>> reset_controller. a command timeout is greater than the default
>>> kato (6 times in fact), was this the case where the issue was
>>> observed? If so, the timeout handler should probably just wait
>>> the kato remaining time.
>>>
>> Nothing to do with reset_controller.
>> The problem really is in nvme_tcp_state_change(), which will blindly 
>> start error recovery (and a subsequent command retry) whenever the 
>> local link drops.
> 
> That is incorrect. A link drop does not trigger a socket state change.
> a state change is triggered for specific socket states
> (see net/ipv4/tcp.c for description of socket states).
> 
> These are a set of states that describes local and remote sides
> shutting down (or a TIME_WAIT). The cases where the host
> triggers error recovery without the controller acknowledging it
> (in case the link is down) are FIN_WAIT1/2, and these happen
> when the user triggers a controller reset/disconnect.
> 
> In the other circumstances, we'll see a TCP_CLOSE/CLOSE_WAIT/CLOSING
> which should indicate the controller acknowledged the shutdown.
> 

Ah, now. It's really neither.
Complaint from the customer was that we are retrying commands without 
waiting for KATO. Actual claim was that we're retrying more-or-less 
immediately.

So, the problem here is not the first command running into a timeout;
the problem really is with the _subsequent_ commands running into a 
timeout (cf my comments to patch 1/3).
The first command will set the status to RESETTING and start error 
recovery, all fine. But all _other_ commands will be aborted directly,
causing them to be retried immediately without any delay.

>> And that error recover does _not_ wait for KATO, but rather
>> causes an immediate retry.
> 
> We can't blindly wait for KATO. If the controller saw the socket
> closure, there is no point waiting kato.
> 
Agreed. So we should be marking out FIN_WAIT1 as that one requiring us
to wait for KATO.

> If the user sets KATO to be 3 hours, and does a controller reset,
> what happens now?
> 
See above. I'd be fine with delaying error recovery for FIN_WAIT1 only.

> Only when the host triggers the shutdown, and that is not acknowledged
> by the controller, then yes, the host needs to wait the remaining of
> kato.

I'll be modifying the patchset to updating the description for the first
patch and only delaying error recovery for FIN_WAIT1.

Cheers,

Hannes



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] nvme-tcp: start error recovery after KATO
  2023-09-12 13:25       ` Hannes Reinecke
@ 2023-09-12 13:43         ` Sagi Grimberg
  2023-12-04 10:30           ` Sagi Grimberg
  0 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2023-09-12 13:43 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>>>>> Hi all,
>>>>>
>>>>> there have been some very insistent reports of data corruption
>>>>> with certain target implementations due to command retries.
>>>>
>>>> None of which were reported on this list...
>>>>
>>> Correct. I can ask them to post their finding here if that would make 
>>> a difference.
>>>
>>>>> Problem here is that for TCP we're starting error recovery
>>>>> immediately after either a command timeout or a (local) link loss.
>>>>
>>>> It does so only in one occasion, when the user triggered a
>>>> reset_controller. a command timeout is greater than the default
>>>> kato (6 times in fact), was this the case where the issue was
>>>> observed? If so, the timeout handler should probably just wait
>>>> the kato remaining time.
>>>>
>>> Nothing to do with reset_controller.
>>> The problem really is in nvme_tcp_state_change(), which will blindly 
>>> start error recovery (and a subsequent command retry) whenever the 
>>> local link drops.
>>
>> That is incorrect. A link drop does not trigger a socket state change.
>> a state change is triggered for specific socket states
>> (see net/ipv4/tcp.c for description of socket states).
>>
>> These are a set of states that describes local and remote sides
>> shutting down (or a TIME_WAIT). The cases where the host
>> triggers error recovery without the controller acknowledging it
>> (in case the link is down) are FIN_WAIT1/2, and these happen
>> when the user triggers a controller reset/disconnect.
>>
>> In the other circumstances, we'll see a TCP_CLOSE/CLOSE_WAIT/CLOSING
>> which should indicate the controller acknowledged the shutdown.
>>
> 
> Ah, now. It's really neither.
> Complaint from the customer was that we are retrying commands without 
> waiting for KATO. Actual claim was that we're retrying more-or-less 
> immediately.
> 
> So, the problem here is not the first command running into a timeout;
> the problem really is with the _subsequent_ commands running into a 
> timeout (cf my comments to patch 1/3).
> The first command will set the status to RESETTING and start error 
> recovery, all fine. But all _other_ commands will be aborted directly,
> causing them to be retried immediately without any delay.

This is still unclear to me.
The default io timeout is 30s, which is 6 times larger than the
default kato. So effectively we are hitting a case where we waited
6 kato periods, and with your patch we are waiting 7 kato periods?

There is a disconnect for me here.

Can you please clarify:
1. what is your kato?
2. what is your io timeout?
3. was the controller RESETTING state triggered from kato? or
    from io timeout?
    if the latter, it indicates that kato is higher than io timeout,
    otherwise the transport is perfectly fine, and this is essentially
    a controller io that is taking longer than the host io timeout?

Do you have some trace that points out how a command is sent down to
the controller (successfully arrives to the controller) the network goes
down, and the io is retried before kato expires?

What is unclear to me here, is how kato expires, but the controller
continues to process commands.

>>> And that error recover does _not_ wait for KATO, but rather
>>> causes an immediate retry.
>>
>> We can't blindly wait for KATO. If the controller saw the socket
>> closure, there is no point waiting kato.
>>
> Agreed. So we should be marking out FIN_WAIT1 as that one requiring us
> to wait for KATO.

You can't do that, that will wait on FIN_WAIT1 for every queue.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] nvme-tcp: start error recovery after KATO
  2023-09-12 13:43         ` Sagi Grimberg
@ 2023-12-04 10:30           ` Sagi Grimberg
  2023-12-04 11:37             ` Hannes Reinecke
  0 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2023-12-04 10:30 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>>>>>> Hi all,
>>>>>>
>>>>>> there have been some very insistent reports of data corruption
>>>>>> with certain target implementations due to command retries.
>>>>>
>>>>> None of which were reported on this list...
>>>>>
>>>> Correct. I can ask them to post their finding here if that would 
>>>> make a difference.
>>>>
>>>>>> Problem here is that for TCP we're starting error recovery
>>>>>> immediately after either a command timeout or a (local) link loss.
>>>>>
>>>>> It does so only in one occasion, when the user triggered a
>>>>> reset_controller. a command timeout is greater than the default
>>>>> kato (6 times in fact), was this the case where the issue was
>>>>> observed? If so, the timeout handler should probably just wait
>>>>> the kato remaining time.
>>>>>
>>>> Nothing to do with reset_controller.
>>>> The problem really is in nvme_tcp_state_change(), which will blindly 
>>>> start error recovery (and a subsequent command retry) whenever the 
>>>> local link drops.
>>>
>>> That is incorrect. A link drop does not trigger a socket state change.
>>> a state change is triggered for specific socket states
>>> (see net/ipv4/tcp.c for description of socket states).
>>>
>>> These are a set of states that describes local and remote sides
>>> shutting down (or a TIME_WAIT). The cases where the host
>>> triggers error recovery without the controller acknowledging it
>>> (in case the link is down) are FIN_WAIT1/2, and these happen
>>> when the user triggers a controller reset/disconnect.
>>>
>>> In the other circumstances, we'll see a TCP_CLOSE/CLOSE_WAIT/CLOSING
>>> which should indicate the controller acknowledged the shutdown.
>>>
>>
>> Ah, now. It's really neither.
>> Complaint from the customer was that we are retrying commands without 
>> waiting for KATO. Actual claim was that we're retrying more-or-less 
>> immediately.
>>
>> So, the problem here is not the first command running into a timeout;
>> the problem really is with the _subsequent_ commands running into a 
>> timeout (cf my comments to patch 1/3).
>> The first command will set the status to RESETTING and start error 
>> recovery, all fine. But all _other_ commands will be aborted directly,
>> causing them to be retried immediately without any delay.
> 
> This is still unclear to me.
> The default io timeout is 30s, which is 6 times larger than the
> default kato. So effectively we are hitting a case where we waited
> 6 kato periods, and with your patch we are waiting 7 kato periods?
> 
> There is a disconnect for me here.
> 
> Can you please clarify:
> 1. what is your kato?
> 2. what is your io timeout?
> 3. was the controller RESETTING state triggered from kato? or
>     from io timeout?
>     if the latter, it indicates that kato is higher than io timeout,
>     otherwise the transport is perfectly fine, and this is essentially
>     a controller io that is taking longer than the host io timeout?
> 
> Do you have some trace that points out how a command is sent down to
> the controller (successfully arrives to the controller) the network goes
> down, and the io is retried before kato expires?
> 
> What is unclear to me here, is how kato expires, but the controller
> continues to process commands.
> 
>>>> And that error recover does _not_ wait for KATO, but rather
>>>> causes an immediate retry.
>>>
>>> We can't blindly wait for KATO. If the controller saw the socket
>>> closure, there is no point waiting kato.
>>>
>> Agreed. So we should be marking out FIN_WAIT1 as that one requiring us
>> to wait for KATO.
> 
> You can't do that, that will wait on FIN_WAIT1 for every queue.

Hey Hannes,

Can you comment where this went?

I'd like us to see how to progress to resolve this (assuming this
is still an issue).


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] nvme-tcp: start error recovery after KATO
  2023-12-04 10:30           ` Sagi Grimberg
@ 2023-12-04 11:37             ` Hannes Reinecke
  2023-12-04 13:27               ` Sagi Grimberg
  0 siblings, 1 reply; 20+ messages in thread
From: Hannes Reinecke @ 2023-12-04 11:37 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 12/4/23 11:30, Sagi Grimberg wrote:
> 
>>>>>>> Hi all,
>>>>>>>
>>>>>>> there have been some very insistent reports of data corruption
>>>>>>> with certain target implementations due to command retries.
>>>>>>
>>>>>> None of which were reported on this list...
>>>>>>
>>>>> Correct. I can ask them to post their finding here if that would 
>>>>> make a difference.
>>>>>
>>>>>>> Problem here is that for TCP we're starting error recovery
>>>>>>> immediately after either a command timeout or a (local) link loss.
>>>>>>
>>>>>> It does so only in one occasion, when the user triggered a
>>>>>> reset_controller. a command timeout is greater than the default
>>>>>> kato (6 times in fact), was this the case where the issue was
>>>>>> observed? If so, the timeout handler should probably just wait
>>>>>> the kato remaining time.
>>>>>>
>>>>> Nothing to do with reset_controller.
>>>>> The problem really is in nvme_tcp_state_change(), which will 
>>>>> blindly start error recovery (and a subsequent command retry) 
>>>>> whenever the local link drops.
>>>>
>>>> That is incorrect. A link drop does not trigger a socket state change.
>>>> a state change is triggered for specific socket states
>>>> (see net/ipv4/tcp.c for description of socket states).
>>>>
>>>> These are a set of states that describes local and remote sides
>>>> shutting down (or a TIME_WAIT). The cases where the host
>>>> triggers error recovery without the controller acknowledging it
>>>> (in case the link is down) are FIN_WAIT1/2, and these happen
>>>> when the user triggers a controller reset/disconnect.
>>>>
>>>> In the other circumstances, we'll see a TCP_CLOSE/CLOSE_WAIT/CLOSING
>>>> which should indicate the controller acknowledged the shutdown.
>>>>
>>>
>>> Ah, now. It's really neither.
>>> Complaint from the customer was that we are retrying commands without 
>>> waiting for KATO. Actual claim was that we're retrying more-or-less 
>>> immediately.
>>>
>>> So, the problem here is not the first command running into a timeout;
>>> the problem really is with the _subsequent_ commands running into a 
>>> timeout (cf my comments to patch 1/3).
>>> The first command will set the status to RESETTING and start error 
>>> recovery, all fine. But all _other_ commands will be aborted directly,
>>> causing them to be retried immediately without any delay.
>>
>> This is still unclear to me.
>> The default io timeout is 30s, which is 6 times larger than the
>> default kato. So effectively we are hitting a case where we waited
>> 6 kato periods, and with your patch we are waiting 7 kato periods?
>>
>> There is a disconnect for me here.
>>
>> Can you please clarify:
>> 1. what is your kato?
>> 2. what is your io timeout?
>> 3. was the controller RESETTING state triggered from kato? or
>>     from io timeout?
>>     if the latter, it indicates that kato is higher than io timeout,
>>     otherwise the transport is perfectly fine, and this is essentially
>>     a controller io that is taking longer than the host io timeout?
>>
>> Do you have some trace that points out how a command is sent down to
>> the controller (successfully arrives to the controller) the network goes
>> down, and the io is retried before kato expires?
>>
>> What is unclear to me here, is how kato expires, but the controller
>> continues to process commands.
>>

Argument here is that there is a race window between command timeout
triggering (ie the call to nvme_tcp_timeout()) and scheduling of the
'err_work' workqueue function. If you have a burst of commands all
expiring at roughly the same time the first command will set the 
controller state to 'RESETTING', and schedule the 'err_work' workqueue
function.
The second command will timing out _before_ err_work is scheduled,
will see the RESETTING state in nvme_tcp_timeout(), and directly
complete and retry (via nvme_tcp_complete_timed_out()) without
waiting for KATO.
That is the issue the customer had been seing, and that is what is
causing data corruption on their end.

This patchset is an attempt to fix this issue.
We'd be happy to pick up this thread, it just had been pushed back
internally as we put in a band-aid for that customer in our kernels...


Cheers,

Hannes



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] nvme-tcp: start error recovery after KATO
  2023-12-04 11:37             ` Hannes Reinecke
@ 2023-12-04 13:27               ` Sagi Grimberg
  2023-12-04 14:15                 ` Hannes Reinecke
  0 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2023-12-04 13:27 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme



On 12/4/23 13:37, Hannes Reinecke wrote:
> On 12/4/23 11:30, Sagi Grimberg wrote:
>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> there have been some very insistent reports of data corruption
>>>>>>>> with certain target implementations due to command retries.
>>>>>>>
>>>>>>> None of which were reported on this list...
>>>>>>>
>>>>>> Correct. I can ask them to post their finding here if that would 
>>>>>> make a difference.
>>>>>>
>>>>>>>> Problem here is that for TCP we're starting error recovery
>>>>>>>> immediately after either a command timeout or a (local) link loss.
>>>>>>>
>>>>>>> It does so only in one occasion, when the user triggered a
>>>>>>> reset_controller. a command timeout is greater than the default
>>>>>>> kato (6 times in fact), was this the case where the issue was
>>>>>>> observed? If so, the timeout handler should probably just wait
>>>>>>> the kato remaining time.
>>>>>>>
>>>>>> Nothing to do with reset_controller.
>>>>>> The problem really is in nvme_tcp_state_change(), which will 
>>>>>> blindly start error recovery (and a subsequent command retry) 
>>>>>> whenever the local link drops.
>>>>>
>>>>> That is incorrect. A link drop does not trigger a socket state change.
>>>>> a state change is triggered for specific socket states
>>>>> (see net/ipv4/tcp.c for description of socket states).
>>>>>
>>>>> These are a set of states that describes local and remote sides
>>>>> shutting down (or a TIME_WAIT). The cases where the host
>>>>> triggers error recovery without the controller acknowledging it
>>>>> (in case the link is down) are FIN_WAIT1/2, and these happen
>>>>> when the user triggers a controller reset/disconnect.
>>>>>
>>>>> In the other circumstances, we'll see a TCP_CLOSE/CLOSE_WAIT/CLOSING
>>>>> which should indicate the controller acknowledged the shutdown.
>>>>>
>>>>
>>>> Ah, now. It's really neither.
>>>> Complaint from the customer was that we are retrying commands 
>>>> without waiting for KATO. Actual claim was that we're retrying 
>>>> more-or-less immediately.
>>>>
>>>> So, the problem here is not the first command running into a timeout;
>>>> the problem really is with the _subsequent_ commands running into a 
>>>> timeout (cf my comments to patch 1/3).
>>>> The first command will set the status to RESETTING and start error 
>>>> recovery, all fine. But all _other_ commands will be aborted directly,
>>>> causing them to be retried immediately without any delay.
>>>
>>> This is still unclear to me.
>>> The default io timeout is 30s, which is 6 times larger than the
>>> default kato. So effectively we are hitting a case where we waited
>>> 6 kato periods, and with your patch we are waiting 7 kato periods?
>>>
>>> There is a disconnect for me here.
>>>
>>> Can you please clarify:
>>> 1. what is your kato?
>>> 2. what is your io timeout?
>>> 3. was the controller RESETTING state triggered from kato? or
>>>     from io timeout?
>>>     if the latter, it indicates that kato is higher than io timeout,
>>>     otherwise the transport is perfectly fine, and this is essentially
>>>     a controller io that is taking longer than the host io timeout?
>>>
>>> Do you have some trace that points out how a command is sent down to
>>> the controller (successfully arrives to the controller) the network goes
>>> down, and the io is retried before kato expires?
>>>
>>> What is unclear to me here, is how kato expires, but the controller
>>> continues to process commands.
>>>
> 
> Argument here is that there is a race window between command timeout
> triggering (ie the call to nvme_tcp_timeout()) and scheduling of the
> 'err_work' workqueue function. If you have a burst of commands all
> expiring at roughly the same time the first command will set the 
> controller state to 'RESETTING', and schedule the 'err_work' workqueue
> function.

Yes, that is correct. It is also the keep-alive itself that can
trigger the timeout handler. In fact, it is the more likely one
because it will by default have a lower timeout than IO commands.

This is a part of where I lose the trail here, because I'd expect
the keep-alive to expire far before any of the IO commands, especially
with the latest changes to expedite the keep-alive execution pace.

> The second command will timing out _before_ err_work is scheduled,
> will see the RESETTING state in nvme_tcp_timeout(), and directly
> complete and retry (via nvme_tcp_complete_timed_out()) without
> waiting for KATO.

That is something we probably can address by analyzing the controller
state a bit deeper. We've always failed IO commands quickly in the
timeout handler because these commands can be part of the ctrl setup
sequence, which means the error recovery will not run and fail the 
commands on their behalf reliably. I also think there were issues with
issues in the teardown stage (shutdown or disable).

> That is the issue the customer had been seing, and that is what is
> causing data corruption on their end.
> 
> This patchset is an attempt to fix this issue.

In your patch 3, you take nvme_keep_alive_work_period(), which is
either kato/2 or kato/4. It is not clear to me why these values
were chosen other than "the issue went away".

Also, what happens if kato (which is user configurable), is arbitrarily
long? This means that IO failover can take say 60 minutes? Do you think
its reasonable? This is where I'm concerned about the unconditional
delay.

I think that the idea was that an appropriate delay would be indicated
by the controller, rather than being determined by a user controlled
variable.

Regardless we should limit to exactly when such a delay is actually
needed.

Plus, I haven't seen anything for nvme-rdma and it is unclear to me
why this is a host specific to tcp.

> We'd be happy to pick up this thread, it just had been pushed back
> internally as we put in a band-aid for that customer in our kernels...

Well, my assumption is that we want this addressed upstream. Just a
matter of how.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] nvme-tcp: start error recovery after KATO
  2023-12-04 13:27               ` Sagi Grimberg
@ 2023-12-04 14:15                 ` Hannes Reinecke
  2023-12-04 16:11                   ` Sagi Grimberg
  0 siblings, 1 reply; 20+ messages in thread
From: Hannes Reinecke @ 2023-12-04 14:15 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 12/4/23 14:27, Sagi Grimberg wrote:
> 
> 
> On 12/4/23 13:37, Hannes Reinecke wrote:
>> On 12/4/23 11:30, Sagi Grimberg wrote:
>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> there have been some very insistent reports of data corruption
>>>>>>>>> with certain target implementations due to command retries.
>>>>>>>>
>>>>>>>> None of which were reported on this list...
>>>>>>>>
>>>>>>> Correct. I can ask them to post their finding here if that would 
>>>>>>> make a difference.
>>>>>>>
>>>>>>>>> Problem here is that for TCP we're starting error recovery
>>>>>>>>> immediately after either a command timeout or a (local) link loss.
>>>>>>>>
>>>>>>>> It does so only in one occasion, when the user triggered a
>>>>>>>> reset_controller. a command timeout is greater than the default
>>>>>>>> kato (6 times in fact), was this the case where the issue was
>>>>>>>> observed? If so, the timeout handler should probably just wait
>>>>>>>> the kato remaining time.
>>>>>>>>
>>>>>>> Nothing to do with reset_controller.
>>>>>>> The problem really is in nvme_tcp_state_change(), which will 
>>>>>>> blindly start error recovery (and a subsequent command retry) 
>>>>>>> whenever the local link drops.
>>>>>>
>>>>>> That is incorrect. A link drop does not trigger a socket state 
>>>>>> change.
>>>>>> a state change is triggered for specific socket states
>>>>>> (see net/ipv4/tcp.c for description of socket states).
>>>>>>
>>>>>> These are a set of states that describes local and remote sides
>>>>>> shutting down (or a TIME_WAIT). The cases where the host
>>>>>> triggers error recovery without the controller acknowledging it
>>>>>> (in case the link is down) are FIN_WAIT1/2, and these happen
>>>>>> when the user triggers a controller reset/disconnect.
>>>>>>
>>>>>> In the other circumstances, we'll see a TCP_CLOSE/CLOSE_WAIT/CLOSING
>>>>>> which should indicate the controller acknowledged the shutdown.
>>>>>>
>>>>>
>>>>> Ah, now. It's really neither.
>>>>> Complaint from the customer was that we are retrying commands 
>>>>> without waiting for KATO. Actual claim was that we're retrying 
>>>>> more-or-less immediately.
>>>>>
>>>>> So, the problem here is not the first command running into a timeout;
>>>>> the problem really is with the _subsequent_ commands running into a 
>>>>> timeout (cf my comments to patch 1/3).
>>>>> The first command will set the status to RESETTING and start error 
>>>>> recovery, all fine. But all _other_ commands will be aborted directly,
>>>>> causing them to be retried immediately without any delay.
>>>>
>>>> This is still unclear to me.
>>>> The default io timeout is 30s, which is 6 times larger than the
>>>> default kato. So effectively we are hitting a case where we waited
>>>> 6 kato periods, and with your patch we are waiting 7 kato periods?
>>>>
>>>> There is a disconnect for me here.
>>>>
>>>> Can you please clarify:
>>>> 1. what is your kato?
>>>> 2. what is your io timeout?
>>>> 3. was the controller RESETTING state triggered from kato? or
>>>>     from io timeout?
>>>>     if the latter, it indicates that kato is higher than io timeout,
>>>>     otherwise the transport is perfectly fine, and this is essentially
>>>>     a controller io that is taking longer than the host io timeout?
>>>>
>>>> Do you have some trace that points out how a command is sent down to
>>>> the controller (successfully arrives to the controller) the network 
>>>> goes
>>>> down, and the io is retried before kato expires?
>>>>
>>>> What is unclear to me here, is how kato expires, but the controller
>>>> continues to process commands.
>>>>
>>
>> Argument here is that there is a race window between command timeout
>> triggering (ie the call to nvme_tcp_timeout()) and scheduling of the
>> 'err_work' workqueue function. If you have a burst of commands all
>> expiring at roughly the same time the first command will set the 
>> controller state to 'RESETTING', and schedule the 'err_work' workqueue
>> function.
> 
> Yes, that is correct. It is also the keep-alive itself that can
> trigger the timeout handler. In fact, it is the more likely one
> because it will by default have a lower timeout than IO commands.
> 
> This is a part of where I lose the trail here, because I'd expect
> the keep-alive to expire far before any of the IO commands, especially
> with the latest changes to expedite the keep-alive execution pace.
> 
That's not necessarily the case. We don't have any checks to align 
command timeouts to KATO, and we don't mandate anywhere that command
timeouts need to be larger than KATO.
And even then we can get pathological cases where commands take up
to the timeout for processing, and the last keep-alive was just sent
before that.
So I'd rather prefer to fix the race condition.

>> The second command will timing out _before_ err_work is scheduled,
>> will see the RESETTING state in nvme_tcp_timeout(), and directly
>> complete and retry (via nvme_tcp_complete_timed_out()) without
>> waiting for KATO.
> 
> That is something we probably can address by analyzing the controller
> state a bit deeper. We've always failed IO commands quickly in the
> timeout handler because these commands can be part of the ctrl setup
> sequence, which means the error recovery will not run and fail the 
> commands on their behalf reliably. I also think there were issues with
> issues in the teardown stage (shutdown or disable).
> 
Sure. That's what I tried to address with the first patch.

>> That is the issue the customer had been seing, and that is what is
>> causing data corruption on their end.
>>
>> This patchset is an attempt to fix this issue.
> 
> In your patch 3, you take nvme_keep_alive_work_period(), which is
> either kato/2 or kato/4. It is not clear to me why these values
> were chosen other than "the issue went away".
> 
Yeah, that is wrong. The idea is to ensure that command timeout _always_ 
start error recovery at least after the KATO period expired, but that an 
actual KATO timeout would schedule the error handling directly (ie 
flushing the scheduled error recovery workqueue item).
That way the KATO timeout is the upper limit, but will be started 
earlier depending on the timing between command and keep-alive timeout.

> Also, what happens if kato (which is user configurable), is arbitrarily
> long? This means that IO failover can take say 60 minutes? Do you think
> its reasonable? This is where I'm concerned about the unconditional
> delay.
> 
Yes, unfortunately that's true. But that's what's mandated by the spec.
We could add an option to stick to the original behaviour, though, as 
one could argue that this is what users are expecting now.

> I think that the idea was that an appropriate delay would be indicated
> by the controller, rather than being determined by a user controlled
> variable.
> 
Yep. The standard will be updated to cover that, but we're not there yet.

> Regardless we should limit to exactly when such a delay is actually
> needed.
> 
> Plus, I haven't seen anything for nvme-rdma and it is unclear to me
> why this is a host specific to tcp.
> 
That's because the testing had been done on TCP only, and I wanted to
resolve the discussion before posting a full patchset.

>> We'd be happy to pick up this thread, it just had been pushed back
>> internally as we put in a band-aid for that customer in our kernels...
> 
> Well, my assumption is that we want this addressed upstream. Just a
> matter of how.

Agreed.

Cheers,

Hannes



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/3] nvme-tcp: start error recovery after KATO
  2023-12-04 14:15                 ` Hannes Reinecke
@ 2023-12-04 16:11                   ` Sagi Grimberg
  0 siblings, 0 replies; 20+ messages in thread
From: Sagi Grimberg @ 2023-12-04 16:11 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme


>>> Argument here is that there is a race window between command timeout
>>> triggering (ie the call to nvme_tcp_timeout()) and scheduling of the
>>> 'err_work' workqueue function. If you have a burst of commands all
>>> expiring at roughly the same time the first command will set the 
>>> controller state to 'RESETTING', and schedule the 'err_work' workqueue
>>> function.
>>
>> Yes, that is correct. It is also the keep-alive itself that can
>> trigger the timeout handler. In fact, it is the more likely one
>> because it will by default have a lower timeout than IO commands.
>>
>> This is a part of where I lose the trail here, because I'd expect
>> the keep-alive to expire far before any of the IO commands, especially
>> with the latest changes to expedite the keep-alive execution pace.
>>
> That's not necessarily the case. We don't have any checks to align 
> command timeouts to KATO, and we don't mandate anywhere that command
> timeouts need to be larger than KATO.

Was this the case? where io timeout lower than kato?
What were these values in reproducer?

> And even then we can get pathological cases where commands take up
> to the timeout for processing, and the last keep-alive was just sent
> before that.
> So I'd rather prefer to fix the race condition.

My assumption was that the connection (transport) was torn, not that
the controller processing time exceeded command timeout.

>>> The second command will timing out _before_ err_work is scheduled,
>>> will see the RESETTING state in nvme_tcp_timeout(), and directly
>>> complete and retry (via nvme_tcp_complete_timed_out()) without
>>> waiting for KATO.
>>
>> That is something we probably can address by analyzing the controller
>> state a bit deeper. We've always failed IO commands quickly in the
>> timeout handler because these commands can be part of the ctrl setup
>> sequence, which means the error recovery will not run and fail the 
>> commands on their behalf reliably. I also think there were issues with
>> issues in the teardown stage (shutdown or disable).
>>
> Sure. That's what I tried to address with the first patch.

Don't think we need to look at the running state of the err_work (looks
racy to me). Perhaps we can suffice on the ctrl state alone, but we also
need to check that it is an IO command, so we can address when the
setup/teardown admin commands can still fail without relying on the
error handler.

But overall I'm fine with this patch (1/3).

>>> That is the issue the customer had been seing, and that is what is
>>> causing data corruption on their end.
>>>
>>> This patchset is an attempt to fix this issue.
>>
>> In your patch 3, you take nvme_keep_alive_work_period(), which is
>> either kato/2 or kato/4. It is not clear to me why these values
>> were chosen other than "the issue went away".
>>
> Yeah, that is wrong. The idea is to ensure that command timeout _always_ 
> start error recovery at least after the KATO period expired, but that an 
> actual KATO timeout would schedule the error handling directly (ie 
> flushing the scheduled error recovery workqueue item).
> That way the KATO timeout is the upper limit, but will be started 
> earlier depending on the timing between command and keep-alive timeout.

So the correct patch was to delay the error handler for kato?

>> Also, what happens if kato (which is user configurable), is arbitrarily
>> long? This means that IO failover can take say 60 minutes? Do you think
>> its reasonable? This is where I'm concerned about the unconditional
>> delay.
>>
> Yes, unfortunately that's true. But that's what's mandated by the spec.
> We could add an option to stick to the original behaviour, though, as 
> one could argue that this is what users are expecting now.

The only way kato is an option to add a delay is only if we prevent or
limit what users set it to. That is a discussion others should comment
on.

>> I think that the idea was that an appropriate delay would be indicated
>> by the controller, rather than being determined by a user controlled
>> variable.
>>
> Yep. The standard will be updated to cover that, but we're not there yet.
> 
>> Regardless we should limit to exactly when such a delay is actually
>> needed.
>>
>> Plus, I haven't seen anything for nvme-rdma and it is unclear to me
>> why this is a host specific to tcp.
>>
> That's because the testing had been done on TCP only, and I wanted to
> resolve the discussion before posting a full patchset.

I see.

>>> We'd be happy to pick up this thread, it just had been pushed back
>>> internally as we put in a band-aid for that customer in our kernels...
>>
>> Well, my assumption is that we want this addressed upstream. Just a
>> matter of how.
> 
> Agreed.
> 
> Cheers,
> 
> Hannes
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-12-04 16:12 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-08 10:00 [PATCH 0/3] nvme-tcp: start error recovery after KATO Hannes Reinecke
2023-09-08 10:00 ` [PATCH 1/3] nvme-tcp: Do not terminate commands when in RESETTING Hannes Reinecke
2023-09-12 11:54   ` Sagi Grimberg
2023-09-12 12:06     ` Hannes Reinecke
2023-09-12 12:15       ` Sagi Grimberg
2023-09-12 12:59         ` Hannes Reinecke
2023-09-12 13:02           ` Sagi Grimberg
2023-09-08 10:00 ` [PATCH 2/3] nvme-tcp: make 'err_work' a delayed work Hannes Reinecke
2023-09-08 10:00 ` [PATCH 3/3] nvme-tcp: delay error recovery until the next KATO interval Hannes Reinecke
2023-09-12 12:17   ` Sagi Grimberg
2023-09-12 11:51 ` [PATCH 0/3] nvme-tcp: start error recovery after KATO Sagi Grimberg
2023-09-12 11:56   ` Hannes Reinecke
2023-09-12 12:12     ` Sagi Grimberg
2023-09-12 13:25       ` Hannes Reinecke
2023-09-12 13:43         ` Sagi Grimberg
2023-12-04 10:30           ` Sagi Grimberg
2023-12-04 11:37             ` Hannes Reinecke
2023-12-04 13:27               ` Sagi Grimberg
2023-12-04 14:15                 ` Hannes Reinecke
2023-12-04 16:11                   ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox