[PATCH 2/3] nvme-tcp: sanitize request list handling

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-03-07 13:27 [PATCH 0/3] nvme-tcp: fixup I/O stall " Hannes Reinecke
@ 2025-03-07 13:28 ` Hannes Reinecke
  2025-03-11 18:53   ` Chris Leech
  0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2025-03-07 13:28 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, Keith Busch, linux-nvme, Hannes Reinecke

Validate the request in nvme_tcp_handle_r2t() to ensure it's not
part of any list, otherwise a malicious R2T PDU might inject a
loop in request list processing.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/tcp.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 034edf852878..073c8c3538fd 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -441,6 +441,7 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
 	}
 
 	list_del(&req->entry);
+	init_llist_node(&req->lentry);
 	return req;
 }
 
@@ -548,6 +549,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
 	req->queue = queue;
 	nvme_req(rq)->ctrl = &ctrl->ctrl;
 	nvme_req(rq)->cmd = &pdu->cmd;
+	init_llist_node(&req->lentry);
+	INIT_LIST_HEAD(&req->entry);
 
 	return 0;
 }
@@ -759,8 +762,12 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 
 	nvme_tcp_setup_h2c_data_pdu(req);
 
+	WARN_ON(queue->request == req);
+	WARN_ON(llist_on_list(&req->lentry));
+	WARN_ON(!list_empty(&req->entry));
 	llist_add(&req->lentry, &queue->req_list);
-	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+	if (list_empty(&queue->send_list))
+		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 
 	return 0;
 }
@@ -2532,6 +2539,8 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
 	ctrl->async_req.offset = 0;
 	ctrl->async_req.curr_bio = NULL;
 	ctrl->async_req.data_len = 0;
+	init_llist_node(&ctrl->async_req.lentry);
+	INIT_LIST_HEAD(&ctrl->async_req.entry);
 
 	nvme_tcp_queue_request(&ctrl->async_req, true);
 }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-03-07 13:28 ` [PATCH 2/3] nvme-tcp: sanitize request list handling Hannes Reinecke
@ 2025-03-11 18:53   ` Chris Leech
  2025-03-12  8:16     ` Hannes Reinecke
  0 siblings, 1 reply; 48+ messages in thread
From: Chris Leech @ 2025-03-11 18:53 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme

On Fri, Mar 07, 2025 at 02:28:01PM +0100, Hannes Reinecke wrote:
> Validate the request in nvme_tcp_handle_r2t() to ensure it's not
> part of any list, otherwise a malicious R2T PDU might inject a
> loop in request list processing.
> 
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>  drivers/nvme/host/tcp.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 

...

> @@ -759,8 +762,12 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
>  
>  	nvme_tcp_setup_h2c_data_pdu(req);
>  
> +	WARN_ON(queue->request == req);
> +	WARN_ON(llist_on_list(&req->lentry));
> +	WARN_ON(!list_empty(&req->entry));
>  	llist_add(&req->lentry, &queue->req_list);

Are we happy with a WARN here, or should this be handled as an error?
The idea of an duplicate R2Ts creating a loop in req_list is
frightening.

- Chris



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-03-11 18:53   ` Chris Leech
@ 2025-03-12  8:16     ` Hannes Reinecke
  2025-03-12 12:27       ` Maurizio Lombardi
  0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2025-03-12  8:16 UTC (permalink / raw)
  To: Chris Leech, Hannes Reinecke
  Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme

On 3/11/25 19:53, Chris Leech wrote:
> On Fri, Mar 07, 2025 at 02:28:01PM +0100, Hannes Reinecke wrote:
>> Validate the request in nvme_tcp_handle_r2t() to ensure it's not
>> part of any list, otherwise a malicious R2T PDU might inject a
>> loop in request list processing.
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>>   drivers/nvme/host/tcp.c | 11 ++++++++++-
>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>
> 
> ...
> 
>> @@ -759,8 +762,12 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
>>   
>>   	nvme_tcp_setup_h2c_data_pdu(req);
>>   
>> +	WARN_ON(queue->request == req);
>> +	WARN_ON(llist_on_list(&req->lentry));
>> +	WARN_ON(!list_empty(&req->entry));
>>   	llist_add(&req->lentry, &queue->req_list);
> 
> Are we happy with a WARN here, or should this be handled as an error?
> The idea of an duplicate R2Ts creating a loop in req_list is
> frightening.
> 
It is, but wanted to check if others see this as an issue, too.
We actually should bail out and reset the connection; BUG_ON()
is pretty harsh, and not really appropriate as this isn't an error
on our side.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-03-12  8:16     ` Hannes Reinecke
@ 2025-03-12 12:27       ` Maurizio Lombardi
  0 siblings, 0 replies; 48+ messages in thread
From: Maurizio Lombardi @ 2025-03-12 12:27 UTC (permalink / raw)
  To: Hannes Reinecke, Chris Leech, Hannes Reinecke
  Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme

On Wed Mar 12, 2025 at 9:16 AM CET, Hannes Reinecke wrote:
> On 3/11/25 19:53, Chris Leech wrote:
>>> @@ -759,8 +762,12 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
>>>   
>>>   	nvme_tcp_setup_h2c_data_pdu(req);
>>>   
>>> +	WARN_ON(queue->request == req);
>>> +	WARN_ON(llist_on_list(&req->lentry));
>>> +	WARN_ON(!list_empty(&req->entry));
>>>   	llist_add(&req->lentry, &queue->req_list);
>> 
>> Are we happy with a WARN here, or should this be handled as an error?
>> The idea of an duplicate R2Ts creating a loop in req_list is
>> frightening.
>> 
> It is, but wanted to check if others see this as an issue, too.
> We actually should bail out and reset the connection; BUG_ON()
> is pretty harsh, and not really appropriate as this isn't an error
> on our side.

So a malicious or buggy target could trigger WARNs host-side? Did I understand it
correctly?

This doesn't sound ok to me, I think the host should print an error
message and reset the connection.

Maurizio


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCHv3 0/3] nvme-tcp: fixup I/O stall on congested sockets
@ 2025-04-03  6:55 Hannes Reinecke
  2025-04-03  6:55 ` [PATCH 1/3] nvme-tcp: open-code nvme_tcp_queue_request() for R2T Hannes Reinecke
                   ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Hannes Reinecke @ 2025-04-03  6:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Hi all,

I have been chasing keep-alive timeouts with TLS enabled in the last few
days (weeks, even :-( ). On larger setups (eg with 32 queues) the connection
never got established properly as I've been hitting keep-alive timeouts before
the last queue got connected.
Turns out that occasionally we simply do not send the keep-alive request; it's
been added to the request list but the io_work workqueue function is never
restarted as it bails out after nvme_tcp_try_recv() returns -EAGAIN.
During debugging I also found that we're quite lazy with the list
handling of requests, so I've added two preliminary patches to ensure
that all list elements are properly terminated.

As usual, comments and reviews are welcome.

Changes to v2:
- Removed AEN patches again

Changes to the original submission:
- Include reviews from Chris Leech
- Add patch to requeue namespace scan
- Add patch to re-read ANA log page

Hannes Reinecke (3):
  nvme-tcp: open-code nvme_tcp_queue_request() for R2T
  nvme-tcp: sanitize request list handling
  nvme-tcp: fix I/O stalls on congested sockets

 drivers/nvme/host/tcp.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

-- 
2.35.3

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/3] nvme-tcp: open-code nvme_tcp_queue_request() for R2T
  2025-04-03  6:55 [PATCHv3 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke
@ 2025-04-03  6:55 ` Hannes Reinecke
  2025-04-13 22:56   ` Sagi Grimberg
  2025-04-22  8:09   ` Christoph Hellwig
  2025-04-03  6:55 ` [PATCH 2/3] nvme-tcp: sanitize request list handling Hannes Reinecke
  2025-04-03  6:55 ` [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke
  2 siblings, 2 replies; 48+ messages in thread
From: Hannes Reinecke @ 2025-04-03  6:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

When handling an R2T PDU we short-circuit nvme_tcp_queue_request()
as we should not attempt to send consecutive PDUs. So open-code
nvme_tcp_queue_request() for R2T and drop the last argument.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/tcp.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 327f3f2f5399..b2f2aef2e2f2 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -404,7 +404,7 @@ static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
 }
 
 static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
-		bool sync, bool last)
+		bool last)
 {
 	struct nvme_tcp_queue *queue = req->queue;
 	bool empty;
@@ -418,7 +418,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 	 * are on the same cpu, so we don't introduce contention.
 	 */
 	if (queue->io_cpu == raw_smp_processor_id() &&
-	    sync && empty && mutex_trylock(&queue->send_mutex)) {
+	    empty && mutex_trylock(&queue->send_mutex)) {
 		nvme_tcp_send_all(queue);
 		mutex_unlock(&queue->send_mutex);
 	}
@@ -771,7 +771,9 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 	req->ttag = pdu->ttag;
 
 	nvme_tcp_setup_h2c_data_pdu(req);
-	nvme_tcp_queue_request(req, false, true);
+
+	llist_add(&req->lentry, &queue->req_list);
+	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 
 	return 0;
 }
@@ -2557,7 +2559,7 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
 	ctrl->async_req.curr_bio = NULL;
 	ctrl->async_req.data_len = 0;
 
-	nvme_tcp_queue_request(&ctrl->async_req, true, true);
+	nvme_tcp_queue_request(&ctrl->async_req, true);
 }
 
 static void nvme_tcp_complete_timed_out(struct request *rq)
@@ -2709,7 +2711,7 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	nvme_start_request(rq);
 
-	nvme_tcp_queue_request(req, true, bd->last);
+	nvme_tcp_queue_request(req, bd->last);
 
 	return BLK_STS_OK;
 }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-04-03  6:55 [PATCHv3 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke
  2025-04-03  6:55 ` [PATCH 1/3] nvme-tcp: open-code nvme_tcp_queue_request() for R2T Hannes Reinecke
@ 2025-04-03  6:55 ` Hannes Reinecke
  2025-04-13 23:00   ` Sagi Grimberg
  2025-04-03  6:55 ` [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke
  2 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2025-04-03  6:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Validate the request in nvme_tcp_handle_r2t() to ensure it's not
part of any list, otherwise a malicious R2T PDU might inject a
loop in request list processing.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/tcp.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index b2f2aef2e2f2..1a319cb86453 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -454,6 +454,7 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
 	}
 
 	list_del(&req->entry);
+	init_llist_node(&req->lentry);
 	return req;
 }
 
@@ -561,6 +562,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
 	req->queue = queue;
 	nvme_req(rq)->ctrl = &ctrl->ctrl;
 	nvme_req(rq)->cmd = &pdu->cmd;
+	init_llist_node(&req->lentry);
+	INIT_LIST_HEAD(&req->entry);
 
 	return 0;
 }
@@ -765,6 +768,15 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 		return -EPROTO;
 	}
 
+	if (queue->request == req ||
+	    llist_on_list(&req->lentry) ||
+	    !list_empty(&req->entry)) {
+		dev_err(queue->ctrl->ctrl.device,
+			"req %d unexpected r2t while processing request\n",
+			rq->tag);
+		return -EPROTO;
+	}
+
 	req->pdu_len = 0;
 	req->h2cdata_left = r2t_length;
 	req->h2cdata_offset = r2t_offset;
@@ -773,7 +785,8 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 	nvme_tcp_setup_h2c_data_pdu(req);
 
 	llist_add(&req->lentry, &queue->req_list);
-	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+	if (list_empty(&queue->send_list))
+		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 
 	return 0;
 }
@@ -2558,6 +2571,8 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
 	ctrl->async_req.offset = 0;
 	ctrl->async_req.curr_bio = NULL;
 	ctrl->async_req.data_len = 0;
+	init_llist_node(&ctrl->async_req.lentry);
+	INIT_LIST_HEAD(&ctrl->async_req.entry);
 
 	nvme_tcp_queue_request(&ctrl->async_req, true);
 }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-03  6:55 [PATCHv3 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke
  2025-04-03  6:55 ` [PATCH 1/3] nvme-tcp: open-code nvme_tcp_queue_request() for R2T Hannes Reinecke
  2025-04-03  6:55 ` [PATCH 2/3] nvme-tcp: sanitize request list handling Hannes Reinecke
@ 2025-04-03  6:55 ` Hannes Reinecke
  2025-04-13 23:09   ` Sagi Grimberg
  2025-04-15  7:07   ` Hannes Reinecke
  2 siblings, 2 replies; 48+ messages in thread
From: Hannes Reinecke @ 2025-04-03  6:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke,
	Chris Leech

When the socket is busy processing nvme_tcp_try_recv() might
return -EAGAIN, but this doesn't automatically imply that
the sending side is blocked, too.
So check if there are pending requests once nvme_tcp_try_recv()
returns -EAGAIN and continue with the sending loop to avoid
I/O stalls.

Acked-by: Chris Leech <cleech@redhat.com>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/tcp.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 1a319cb86453..87f1d7a4ea06 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct work_struct *w)
 		result = nvme_tcp_try_recv(queue);
 		if (result > 0)
 			pending = true;
-		else if (unlikely(result < 0))
+		else if (unlikely(result < 0) && result != -EAGAIN)
 			return;
 
+		if (nvme_tcp_queue_has_pending(queue))
+			pending = true;
+
 		if (!pending || !queue->rd_enabled)
 			return;
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: open-code nvme_tcp_queue_request() for R2T
  2025-04-03  6:55 ` [PATCH 1/3] nvme-tcp: open-code nvme_tcp_queue_request() for R2T Hannes Reinecke
@ 2025-04-13 22:56   ` Sagi Grimberg
  2025-04-22  8:09   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2025-04-13 22:56 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

This is better indeed,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-04-03  6:55 ` [PATCH 2/3] nvme-tcp: sanitize request list handling Hannes Reinecke
@ 2025-04-13 23:00   ` Sagi Grimberg
  2025-04-14 12:35     ` Hannes Reinecke
  0 siblings, 1 reply; 48+ messages in thread
From: Sagi Grimberg @ 2025-04-13 23:00 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme



On 03/04/2025 9:55, Hannes Reinecke wrote:
> Validate the request in nvme_tcp_handle_r2t() to ensure it's not
> part of any list, otherwise a malicious R2T PDU might inject a
> loop in request list processing.

Not clear what do you mean by "malicious R2T PDU".
Can you please clarify what you have seen/observed in the commit msg 
(i.e. what led you
to this patch)?

>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>   drivers/nvme/host/tcp.c | 17 ++++++++++++++++-
>   1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index b2f2aef2e2f2..1a319cb86453 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -454,6 +454,7 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
>   	}
>   
>   	list_del(&req->entry);
> +	init_llist_node(&req->lentry);
>   	return req;
>   }
>   
> @@ -561,6 +562,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
>   	req->queue = queue;
>   	nvme_req(rq)->ctrl = &ctrl->ctrl;
>   	nvme_req(rq)->cmd = &pdu->cmd;
> +	init_llist_node(&req->lentry);
> +	INIT_LIST_HEAD(&req->entry);
>   
>   	return 0;
>   }
> @@ -765,6 +768,15 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
>   		return -EPROTO;
>   	}
>   
> +	if (queue->request == req ||
> +	    llist_on_list(&req->lentry) ||
> +	    !list_empty(&req->entry)) {
> +		dev_err(queue->ctrl->ctrl.device,
> +			"req %d unexpected r2t while processing request\n",
> +			rq->tag);
> +		return -EPROTO;
> +	}
> +
>   	req->pdu_len = 0;
>   	req->h2cdata_left = r2t_length;
>   	req->h2cdata_offset = r2t_offset;
> @@ -773,7 +785,8 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
>   	nvme_tcp_setup_h2c_data_pdu(req);
>   
>   	llist_add(&req->lentry, &queue->req_list);
> -	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> +	if (list_empty(&queue->send_list))
> +		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);

Is this change mandatory? looks out of place.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-03  6:55 ` [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke
@ 2025-04-13 23:09   ` Sagi Grimberg
  2025-04-14  6:21     ` Hannes Reinecke
  2025-04-15  7:07   ` Hannes Reinecke
  1 sibling, 1 reply; 48+ messages in thread
From: Sagi Grimberg @ 2025-04-13 23:09 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme, Chris Leech



On 03/04/2025 9:55, Hannes Reinecke wrote:
> When the socket is busy processing nvme_tcp_try_recv() might
> return -EAGAIN, but this doesn't automatically imply that
> the sending side is blocked, too.
> So check if there are pending requests once nvme_tcp_try_recv()
> returns -EAGAIN and continue with the sending loop to avoid
> I/O stalls.
>
> Acked-by: Chris Leech <cleech@redhat.com>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>   drivers/nvme/host/tcp.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 1a319cb86453..87f1d7a4ea06 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct work_struct *w)
>   		result = nvme_tcp_try_recv(queue);
>   		if (result > 0)
>   			pending = true;
> -		else if (unlikely(result < 0))
> +		else if (unlikely(result < 0) && result != -EAGAIN)
>   			return;

The way that the send path was done - is that EAGAIN returns 0 (success 
returns >0, failure returns <0)
Perhaps we can make recv do the same?

>   
> +		if (nvme_tcp_queue_has_pending(queue))
> +			pending = true;
> +

Something is not clear to me, this suggest that try_send was not able to 
send data on the socket,
shouldn't a .write_space() callback wake you when the socket send buffer 
gets some space? Why
do you immediately try more even if you're sendmsg is returning EAGAIN?

This is specific to TLS I assume here?


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-13 23:09   ` Sagi Grimberg
@ 2025-04-14  6:21     ` Hannes Reinecke
  2025-04-14 20:24       ` Sagi Grimberg
  0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2025-04-14  6:21 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chris Leech

On 4/14/25 01:09, Sagi Grimberg wrote:
> 
> 
> On 03/04/2025 9:55, Hannes Reinecke wrote:
>> When the socket is busy processing nvme_tcp_try_recv() might
>> return -EAGAIN, but this doesn't automatically imply that
>> the sending side is blocked, too.
>> So check if there are pending requests once nvme_tcp_try_recv()
>> returns -EAGAIN and continue with the sending loop to avoid
>> I/O stalls.
>>
>> Acked-by: Chris Leech <cleech@redhat.com>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>>   drivers/nvme/host/tcp.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 1a319cb86453..87f1d7a4ea06 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct work_struct 
>> *w)
>>           result = nvme_tcp_try_recv(queue);
>>           if (result > 0)
>>               pending = true;
>> -        else if (unlikely(result < 0))
>> +        else if (unlikely(result < 0) && result != -EAGAIN)
>>               return;
> 
> The way that the send path was done - is that EAGAIN returns 0 (success 
> returns >0, failure returns <0)
> Perhaps we can make recv do the same?
> 
I guess we can.

>> +        if (nvme_tcp_queue_has_pending(queue))
>> +            pending = true;
>> +
> 
> Something is not clear to me, this suggest that try_send was not able to 
> send data on the socket, shouldn't a .write_space() callback wake you when
 > the socket send buffer gets some space? Why> do you immediately try 
more even if you're sendmsg is returning EAGAIN?
> 
But that's precisely the point: sendmsg() did _not_ return -EAGAIN.
recvmsg() did.

We just imply that nvme_tcp_try_send() will return -EAGAIN once 
nvme_tcp_try_recv() did.

Which is wrong; -EAGAIN on reception can be due to a number of factors,
and does _not_ imply in any shape or form that sending will exhibit
the same error.

> This is specific to TLS I assume here?

Not necessarily; I've seen it with my performance testing on 10GigE
with normal TCP, too. TLS is just an easier way to reproduce it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-04-13 23:00   ` Sagi Grimberg
@ 2025-04-14 12:35     ` Hannes Reinecke
  2025-04-14 20:29       ` Sagi Grimberg
  0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2025-04-14 12:35 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 4/14/25 01:00, Sagi Grimberg wrote:
> 
> 
> On 03/04/2025 9:55, Hannes Reinecke wrote:
>> Validate the request in nvme_tcp_handle_r2t() to ensure it's not
>> part of any list, otherwise a malicious R2T PDU might inject a
>> loop in request list processing.
> 
> Not clear what do you mean by "malicious R2T PDU".
> Can you please clarify what you have seen/observed in the commit msg 
> (i.e. what led you to this patch)?
> 
This is coming from code inspection only.
In nvme_tcp_handle_r2t() we are looking up a request by the 'command_id'
value in the pdu, and then add it to 'queue->req_list' without further
checking.
So if a malicious R2T packet is received containing the command_id of
a command currently on 'queue->req_list' we'll end up with duplicate
entries on the list.

[ .. ]

>> @@ -773,7 +785,8 @@ static int nvme_tcp_handle_r2t(struct 
>> nvme_tcp_queue *queue,
>>       nvme_tcp_setup_h2c_data_pdu(req);
>>       llist_add(&req->lentry, &queue->req_list);
>> -    queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>> +    if (list_empty(&queue->send_list))
>> +        queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> 
> Is this change mandatory? looks out of place.
> 
Arguably an optimisation. I can leave it out.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-14  6:21     ` Hannes Reinecke
@ 2025-04-14 20:24       ` Sagi Grimberg
  0 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2025-04-14 20:24 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chris Leech



On 14/04/2025 9:21, Hannes Reinecke wrote:
> On 4/14/25 01:09, Sagi Grimberg wrote:
>>
>>
>> On 03/04/2025 9:55, Hannes Reinecke wrote:
>>> When the socket is busy processing nvme_tcp_try_recv() might
>>> return -EAGAIN, but this doesn't automatically imply that
>>> the sending side is blocked, too.
>>> So check if there are pending requests once nvme_tcp_try_recv()
>>> returns -EAGAIN and continue with the sending loop to avoid
>>> I/O stalls.
>>>
>>> Acked-by: Chris Leech <cleech@redhat.com>
>>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>> ---
>>>   drivers/nvme/host/tcp.c | 5 ++++-
>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 1a319cb86453..87f1d7a4ea06 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct 
>>> work_struct *w)
>>>           result = nvme_tcp_try_recv(queue);
>>>           if (result > 0)
>>>               pending = true;
>>> -        else if (unlikely(result < 0))
>>> +        else if (unlikely(result < 0) && result != -EAGAIN)
>>>               return;
>>
>> The way that the send path was done - is that EAGAIN returns 0 
>> (success returns >0, failure returns <0)
>> Perhaps we can make recv do the same?
>>
> I guess we can.
>
>>> +        if (nvme_tcp_queue_has_pending(queue))
>>> +            pending = true;
>>> +
>>
>> Something is not clear to me, this suggest that try_send was not able 
>> to send data on the socket, shouldn't a .write_space() callback wake 
>> you when
> > the socket send buffer gets some space? Why> do you immediately try 
> more even if you're sendmsg is returning EAGAIN?
>>
> But that's precisely the point: sendmsg() did _not_ return -EAGAIN.
> recvmsg() did.

What did try_send return? Why didn't it set pending?

>
>
> We just imply that nvme_tcp_try_send() will return -EAGAIN once 
> nvme_tcp_try_recv() did.
>
> Which is wrong; -EAGAIN on reception can be due to a number of factors,
> and does _not_ imply in any shape or form that sending will exhibit
> the same error.

I get that recv returning EAGAIN, we should not return from io_work(). 
The second check
is not clear to me. If try_recv got stuff, we set pending=1, if try_send 
was able to send, we set
pending=1. They should be independent. Why is this check in the end of 
the loop correct.

It can be that the queue has pending requests, but there is no room in 
the socket send buffer,
otherwise, it would have set pending=1 no?

I am missing something.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-04-14 12:35     ` Hannes Reinecke
@ 2025-04-14 20:29       ` Sagi Grimberg
  0 siblings, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2025-04-14 20:29 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme



On 14/04/2025 15:35, Hannes Reinecke wrote:
> On 4/14/25 01:00, Sagi Grimberg wrote:
>>
>>
>> On 03/04/2025 9:55, Hannes Reinecke wrote:
>>> Validate the request in nvme_tcp_handle_r2t() to ensure it's not
>>> part of any list, otherwise a malicious R2T PDU might inject a
>>> loop in request list processing.
>>
>> Not clear what do you mean by "malicious R2T PDU".
>> Can you please clarify what you have seen/observed in the commit msg 
>> (i.e. what led you to this patch)?
>>
> This is coming from code inspection only.
> In nvme_tcp_handle_r2t() we are looking up a request by the 'command_id'
> value in the pdu, and then add it to 'queue->req_list' without further
> checking.
> So if a malicious R2T packet is received containing the command_id of
> a command currently on 'queue->req_list' we'll end up with duplicate
> entries on the list.

I think this falls into a bucket of a bunch checks that the host does 
not do. But ok.

>
> [ .. ]
>
>>> @@ -773,7 +785,8 @@ static int nvme_tcp_handle_r2t(struct 
>>> nvme_tcp_queue *queue,
>>>       nvme_tcp_setup_h2c_data_pdu(req);
>>>       llist_add(&req->lentry, &queue->req_list);
>>> -    queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>> +    if (list_empty(&queue->send_list))
>>> +        queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>
>> Is this change mandatory? looks out of place.
>>
> Arguably an optimisation. I can leave it out.

When it comes in this context it is not clear what it is designed to do. 
Don't mind having it,
but lets separate it from this patch.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-03  6:55 ` [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke
  2025-04-13 23:09   ` Sagi Grimberg
@ 2025-04-15  7:07   ` Hannes Reinecke
  2025-04-15 21:35     ` Sagi Grimberg
  1 sibling, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2025-04-15  7:07 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig
  Cc: Sagi Grimberg, Keith Busch, linux-nvme, Chris Leech

On 4/3/25 08:55, Hannes Reinecke wrote:
> When the socket is busy processing nvme_tcp_try_recv() might
> return -EAGAIN, but this doesn't automatically imply that
> the sending side is blocked, too.
> So check if there are pending requests once nvme_tcp_try_recv()
> returns -EAGAIN and continue with the sending loop to avoid
> I/O stalls.
> 
> Acked-by: Chris Leech <cleech@redhat.com>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>   drivers/nvme/host/tcp.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 1a319cb86453..87f1d7a4ea06 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct work_struct *w)
>   		result = nvme_tcp_try_recv(queue);
>   		if (result > 0)
>   			pending = true;
> -		else if (unlikely(result < 0))
> +		else if (unlikely(result < 0) && result != -EAGAIN)
>   			return;
>   
> +		if (nvme_tcp_queue_has_pending(queue))
> +			pending = true;
> +
>   		if (!pending || !queue->rd_enabled)
>   			return;
>   

The various 'try_send' function will return -EAGAIN for a partial send.
But it doesn't indicate a blocked Tx, rather we should retry directly.
Hence this check.

Unless you tell me differently and even a partial send will cause
->write_space() to be invoked, then we wouldn't _need_ it. It would
still be an optimisation as we're saving the round-trip via socket
callbacks.

We could aim for a different error here, to differentiate between a
'real' EAGAIN and a partial send.
Whatever you prefer.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-15  7:07   ` Hannes Reinecke
@ 2025-04-15 21:35     ` Sagi Grimberg
  2025-04-16  7:56       ` Hannes Reinecke
  0 siblings, 1 reply; 48+ messages in thread
From: Sagi Grimberg @ 2025-04-15 21:35 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chris Leech



On 15/04/2025 10:07, Hannes Reinecke wrote:
> On 4/3/25 08:55, Hannes Reinecke wrote:
>> When the socket is busy processing nvme_tcp_try_recv() might
>> return -EAGAIN, but this doesn't automatically imply that
>> the sending side is blocked, too.
>> So check if there are pending requests once nvme_tcp_try_recv()
>> returns -EAGAIN and continue with the sending loop to avoid
>> I/O stalls.
>>
>> Acked-by: Chris Leech <cleech@redhat.com>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>>   drivers/nvme/host/tcp.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 1a319cb86453..87f1d7a4ea06 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct 
>> work_struct *w)
>>           result = nvme_tcp_try_recv(queue);
>>           if (result > 0)
>>               pending = true;
>> -        else if (unlikely(result < 0))
>> +        else if (unlikely(result < 0) && result != -EAGAIN)
>>               return;
>>   +        if (nvme_tcp_queue_has_pending(queue))
>> +            pending = true;
>> +
>>           if (!pending || !queue->rd_enabled)
>>               return;
>
> The various 'try_send' function will return -EAGAIN for a partial send.
> But it doesn't indicate a blocked Tx, rather we should retry directly.
> Hence this check.
>
> Unless you tell me differently and even a partial send will cause
> ->write_space() to be invoked, then we wouldn't _need_ it.

Umm, that is my understanding. If you tried to send X and were able to
send Y where Y < X, you shouldn't have to keep trying in a busy loop, 
the stack should
tell you when you can send again.

> It would
> still be an optimisation as we're saving the round-trip via socket
> callbacks.

But you are doing a busy loop on a socket that cannot accept new data, 
there are other
sockets that the kthread can be working on.

>
> We could aim for a different error here, to differentiate between a
> 'real' EAGAIN and a partial send.
> Whatever you prefer.

I still don't understand why a partial send warrants a busy loop call to 
sock_sendmsg...

My assumption is that the call right after the partial send, will see 
EAGAIN error. But I may
be missing something here... I just never expected that a partial write 
means that we must busy loop
sending to the socket.

What does a blocking sendmsg do under the hood? does it also follow this 
practice?


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-15 21:35     ` Sagi Grimberg
@ 2025-04-16  7:56       ` Hannes Reinecke
  2025-04-16 22:09         ` Sagi Grimberg
  0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2025-04-16  7:56 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chris Leech

On 4/15/25 23:35, Sagi Grimberg wrote:
> 
> 
> On 15/04/2025 10:07, Hannes Reinecke wrote:
>> On 4/3/25 08:55, Hannes Reinecke wrote:
>>> When the socket is busy processing nvme_tcp_try_recv() might
>>> return -EAGAIN, but this doesn't automatically imply that
>>> the sending side is blocked, too.
>>> So check if there are pending requests once nvme_tcp_try_recv()
>>> returns -EAGAIN and continue with the sending loop to avoid
>>> I/O stalls.
>>>
>>> Acked-by: Chris Leech <cleech@redhat.com>
>>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>> ---
>>>   drivers/nvme/host/tcp.c | 5 ++++-
>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 1a319cb86453..87f1d7a4ea06 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct 
>>> work_struct *w)
>>>           result = nvme_tcp_try_recv(queue);
>>>           if (result > 0)
>>>               pending = true;
>>> -        else if (unlikely(result < 0))
>>> +        else if (unlikely(result < 0) && result != -EAGAIN)
>>>               return;
>>>   +        if (nvme_tcp_queue_has_pending(queue))
>>> +            pending = true;
>>> +
>>>           if (!pending || !queue->rd_enabled)
>>>               return;
>>
>> The various 'try_send' function will return -EAGAIN for a partial send.
>> But it doesn't indicate a blocked Tx, rather we should retry directly.
>> Hence this check.
>>
>> Unless you tell me differently and even a partial send will cause
>> ->write_space() to be invoked, then we wouldn't _need_ it.
> 
> Umm, that is my understanding. If you tried to send X and were able to
> send Y where Y < X, you shouldn't have to keep trying in a busy loop, 
> the stack should tell you when you can send again.
> 
Okay, I could try that.

>> It would
>> still be an optimisation as we're saving the round-trip via socket
>> callbacks.
> 
> But you are doing a busy loop on a socket that cannot accept new data, 
> there are other sockets that the kthread can be working on.
> 
But we might be _sending_ data, right?

>>
>> We could aim for a different error here, to differentiate between a
>> 'real' EAGAIN and a partial send.
>> Whatever you prefer.
> 
> I still don't understand why a partial send warrants a busy loop call to 
> sock_sendmsg...
> 
This is, it's not just sendmsg. It's the combination of send and recv.
In my tests I have seen sendmsg return with a partial/incomplete status,
consequently recvmsg has nothing to receive, and io_work stops.
And that is what the patch fixes.

I haven't checked the ->write_space() callback (which should've been 
triggered), but my feeling is that the ->write_space() callback
hit when we were still busy processing, so the queue_work() went
nowhere.

Maybe we can fix it with setting a flag when ->write_space()
triggers ('hey, more data to process'), to be evaluated during
the io_work() loop. But that would be pretty close to the
check 'nvme_tcp_queue_has_pending', so I'm not sure if we
gain anything.

And I really haven't seen any detrimental performance effects
with this patch; in the contrary, performance was on par,
and if anything standard deviation went down.

> My assumption is that the call right after the partial send, will see 
> EAGAIN error. But I may
> be missing something here... I just never expected that a partial write 
> means that we must busy loop sending to the socket.
> 
Well ... we do it in nvme_tcp_try_send_data().

> What does a blocking sendmsg do under the hood? does it also follow this 
> practice?

How would I know. But I haven't seen any negative effects
during performance testing with this patch.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-16  7:56       ` Hannes Reinecke
@ 2025-04-16 22:09         ` Sagi Grimberg
  2025-04-17 23:03           ` Kamaljit Singh
  2025-04-24 11:26           ` Hannes Reinecke
  0 siblings, 2 replies; 48+ messages in thread
From: Sagi Grimberg @ 2025-04-16 22:09 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chris Leech



On 16/04/2025 10:56, Hannes Reinecke wrote:
> On 4/15/25 23:35, Sagi Grimberg wrote:
>>
>>
>> On 15/04/2025 10:07, Hannes Reinecke wrote:
>>> On 4/3/25 08:55, Hannes Reinecke wrote:
>>>> When the socket is busy processing nvme_tcp_try_recv() might
>>>> return -EAGAIN, but this doesn't automatically imply that
>>>> the sending side is blocked, too.
>>>> So check if there are pending requests once nvme_tcp_try_recv()
>>>> returns -EAGAIN and continue with the sending loop to avoid
>>>> I/O stalls.
>>>>
>>>> Acked-by: Chris Leech <cleech@redhat.com>
>>>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>>> ---
>>>>   drivers/nvme/host/tcp.c | 5 ++++-
>>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>> index 1a319cb86453..87f1d7a4ea06 100644
>>>> --- a/drivers/nvme/host/tcp.c
>>>> +++ b/drivers/nvme/host/tcp.c
>>>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct 
>>>> work_struct *w)
>>>>           result = nvme_tcp_try_recv(queue);
>>>>           if (result > 0)
>>>>               pending = true;
>>>> -        else if (unlikely(result < 0))
>>>> +        else if (unlikely(result < 0) && result != -EAGAIN)
>>>>               return;
>>>>   +        if (nvme_tcp_queue_has_pending(queue))
>>>> +            pending = true;
>>>> +
>>>>           if (!pending || !queue->rd_enabled)
>>>>               return;
>>>
>>> The various 'try_send' function will return -EAGAIN for a partial send.
>>> But it doesn't indicate a blocked Tx, rather we should retry directly.
>>> Hence this check.
>>>
>>> Unless you tell me differently and even a partial send will cause
>>> ->write_space() to be invoked, then we wouldn't _need_ it.
>>
>> Umm, that is my understanding. If you tried to send X and were able to
>> send Y where Y < X, you shouldn't have to keep trying in a busy loop, 
>> the stack should tell you when you can send again.
>>
> Okay, I could try that.
>
>>> It would
>>> still be an optimisation as we're saving the round-trip via socket
>>> callbacks.
>>
>> But you are doing a busy loop on a socket that cannot accept new 
>> data, there are other sockets that the kthread can be working on.
>>
> But we might be _sending_ data, right?

I'd say that odds are you're not in the next attempt...

>
>>>
>>> We could aim for a different error here, to differentiate between a
>>> 'real' EAGAIN and a partial send.
>>> Whatever you prefer.
>>
>> I still don't understand why a partial send warrants a busy loop call 
>> to sock_sendmsg...
>>
> This is, it's not just sendmsg. It's the combination of send and recv.

I agree with you that the recv fix is correct.

> In my tests I have seen sendmsg return with a partial/incomplete status,
> consequently recvmsg has nothing to receive, and io_work stops.
> And that is what the patch fixes.

The question is if the recv fix is sufficient?

>
> I haven't checked the ->write_space() callback (which should've been 
> triggered), but my feeling is that the ->write_space() callback
> hit when we were still busy processing, so the queue_work() went
> nowhere.

That makes sense.

>
> Maybe we can fix it with setting a flag when ->write_space()
> triggers ('hey, more data to process'), to be evaluated during
> the io_work() loop. But that would be pretty close to the
> check 'nvme_tcp_queue_has_pending', so I'm not sure if we
> gain anything.

How about checking sk_stream_is_writeable() instead?
I think we also need to flag the queue for the write_space during IO work...

>
> And I really haven't seen any detrimental performance effects
> with this patch; in the contrary, performance was on par,
> and if anything standard deviation went down.

Still you agree that busy looping a socket that may not have space is 
less efficient?
When there are other queues waiting to execute io_work?

I agree there is a problem here, but I am trying to figure out what we 
want to do here.

How about these two (untested) patches:
[1 based on your recv-side fix]:
--diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 72d260201d8c..4eb9a2dec07e 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1348,7 +1348,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue 
*queue)
         queue->nr_cqe = 0;
         consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
         release_sock(sk);
-       return consumed;
+       return consumed == -EAGAIN ? 0 : consumed;
  }

  static void nvme_tcp_io_work(struct work_struct *w)
--

[2 based on your partial write fix]:
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 4eb9a2dec07e..daf59e75cf15 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -129,6 +129,7 @@ enum nvme_tcp_queue_flags {
         NVME_TCP_Q_LIVE         = 1,
         NVME_TCP_Q_POLLING      = 2,
         NVME_TCP_Q_IO_CPU_SET   = 3,
+       NVME_TCP_Q_WAKE_SENDER  = 4,
  };

  enum nvme_tcp_recv_state {
@@ -1063,6 +1064,7 @@ static void nvme_tcp_write_space(struct sock *sk)
         queue = sk->sk_user_data;
         if (likely(queue && sk_stream_is_writeable(sk))) {
                 clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+               set_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
                 queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
         }
         read_unlock_bh(&sk->sk_callback_lock);
@@ -1357,6 +1359,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
                 container_of(w, struct nvme_tcp_queue, io_work);
         unsigned long deadline = jiffies + msecs_to_jiffies(1);

+       clear_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
         do {
                 bool pending = false;
                 int result;
@@ -1376,7 +1379,15 @@ static void nvme_tcp_io_work(struct work_struct *w)
                 else if (unlikely(result < 0))
                         return;

-               if (!pending || !queue->rd_enabled)
+               /* did we get some space after spending time in recv ? */
+               if (nvme_tcp_queue_has_pending(queue) &&
+                   sk_stream_is_writeable(queue->sock->sk))
+                       pending = true;
+
+               if (!queue->rd_enabled)
+                       return;
+
+               if (!pending && !test_bit(NVME_TCP_Q_WAKE_SENDER, 
&queue->flags))
                         return;

         } while (!time_after(jiffies, deadline)); /* quota is exhausted */
--


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-16 22:09         ` Sagi Grimberg
@ 2025-04-17 23:03           ` Kamaljit Singh
  2025-04-17 23:06             ` Kamaljit Singh
  2025-04-18 10:51             ` Sagi Grimberg
  2025-04-24 11:26           ` Hannes Reinecke
  1 sibling, 2 replies; 48+ messages in thread
From: Kamaljit Singh @ 2025-04-17 23:03 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

Sagi,
I tried both of these patches but looks like #1 causes an infinite loop. dmesg was full of panics.
I had tried just #1 and later with #1+#2. Both failed the same way.

>How about these two (untested) patches:
>[1 based on your recv-side fix]:
>--diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>index 72d260201d8c..4eb9a2dec07e 100644
>--- a/drivers/nvme/host/tcp.c
>+++ b/drivers/nvme/host/tcp.c
>@@ -1348,7 +1348,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue
>*queue)
>         queue->nr_cqe = 0;
>         consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>         release_sock(sk);
>-       return consumed;
>+       return consumed == -EAGAIN ? 0 : consumed;
>  }
>
>  static void nvme_tcp_io_work(struct work_struct *w)
>--
>
>[2 based on your partial write fix]:
>diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>index 4eb9a2dec07e..daf59e75cf15 100644
>--- a/drivers/nvme/host/tcp.c
>+++ b/drivers/nvme/host/tcp.c
>@@ -129,6 +129,7 @@ enum nvme_tcp_queue_flags {
>         NVME_TCP_Q_LIVE         = 1,
>         NVME_TCP_Q_POLLING      = 2,
>         NVME_TCP_Q_IO_CPU_SET   = 3,
>+       NVME_TCP_Q_WAKE_SENDER  = 4,
>  };
>
>  enum nvme_tcp_recv_state {
>@@ -1063,6 +1064,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>         queue = sk->sk_user_data;
>         if (likely(queue && sk_stream_is_writeable(sk))) {
>                 clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>+               set_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>                 queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>         }
>         read_unlock_bh(&sk->sk_callback_lock);
>@@ -1357,6 +1359,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
>                 container_of(w, struct nvme_tcp_queue, io_work);
>         unsigned long deadline = jiffies + msecs_to_jiffies(1);
>
>+       clear_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>         do {
>                 bool pending = false;
>                 int result;
>@@ -1376,7 +1379,15 @@ static void nvme_tcp_io_work(struct work_struct *w)
>                 else if (unlikely(result < 0))
>                         return;
>
>-               if (!pending || !queue->rd_enabled)
>+               /* did we get some space after spending time in recv ? */
>+               if (nvme_tcp_queue_has_pending(queue) &&
>+                   sk_stream_is_writeable(queue->sock->sk))
>+                       pending = true;
>+
>+               if (!queue->rd_enabled)
>+                       return;
>+
>+               if (!pending && !test_bit(NVME_TCP_Q_WAKE_SENDER,
>&queue->flags))
>                         return;
>
>         } while (!time_after(jiffies, deadline)); /* quota is exhausted */
>--
> 

-Kamaljit


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-17 23:03           ` Kamaljit Singh
@ 2025-04-17 23:06             ` Kamaljit Singh
  2025-04-18 10:51             ` Sagi Grimberg
  1 sibling, 0 replies; 48+ messages in thread
From: Kamaljit Singh @ 2025-04-17 23:06 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

Sagi,
I tried both of these patches but looks like #1 causes an infinite loop. dmesg was full of panics.
I had tried just #1 and later with #1+#2. Both failed the same way.

>How about these two (untested) patches:
>[1 based on your recv-side fix]:
>--diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>index 72d260201d8c..4eb9a2dec07e 100644
>--- a/drivers/nvme/host/tcp.c
>+++ b/drivers/nvme/host/tcp.c
>@@ -1348,7 +1348,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue
>*queue)
>         queue->nr_cqe = 0;
>         consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>         release_sock(sk);
>-       return consumed;
>+       return consumed == -EAGAIN ? 0 : consumed;
>  }
>
>  static void nvme_tcp_io_work(struct work_struct *w)
>--
>
>[2 based on your partial write fix]:
>diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>index 4eb9a2dec07e..daf59e75cf15 100644
>--- a/drivers/nvme/host/tcp.c
>+++ b/drivers/nvme/host/tcp.c
>@@ -129,6 +129,7 @@ enum nvme_tcp_queue_flags {
>         NVME_TCP_Q_LIVE         = 1,
>         NVME_TCP_Q_POLLING      = 2,
>         NVME_TCP_Q_IO_CPU_SET   = 3,
>+       NVME_TCP_Q_WAKE_SENDER  = 4,
>  };
>
>  enum nvme_tcp_recv_state {
>@@ -1063,6 +1064,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>         queue = sk->sk_user_data;
>         if (likely(queue && sk_stream_is_writeable(sk))) {
>                 clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>+               set_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>                 queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>         }
>         read_unlock_bh(&sk->sk_callback_lock);
>@@ -1357,6 +1359,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
>                 container_of(w, struct nvme_tcp_queue, io_work);
>         unsigned long deadline = jiffies + msecs_to_jiffies(1);
>
>+       clear_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>         do {
>                 bool pending = false;
>                 int result;
>@@ -1376,7 +1379,15 @@ static void nvme_tcp_io_work(struct work_struct *w)
>                 else if (unlikely(result < 0))
>                         return;
>
>-               if (!pending || !queue->rd_enabled)
>+               /* did we get some space after spending time in recv ? */
>+               if (nvme_tcp_queue_has_pending(queue) &&
>+                   sk_stream_is_writeable(queue->sock->sk))
>+                       pending = true;
>+
>+               if (!queue->rd_enabled)
>+                       return;
>+
>+               if (!pending && !test_bit(NVME_TCP_Q_WAKE_SENDER,
>&queue->flags))
>                         return;
>
>         } while (!time_after(jiffies, deadline)); /* quota is exhausted */
>--
> 

-Kamaljit


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-17 23:03           ` Kamaljit Singh
  2025-04-17 23:06             ` Kamaljit Singh
@ 2025-04-18 10:51             ` Sagi Grimberg
  2025-04-18 18:55               ` Kamaljit Singh
  1 sibling, 1 reply; 48+ messages in thread
From: Sagi Grimberg @ 2025-04-18 10:51 UTC (permalink / raw)
  To: Kamaljit Singh, Hannes Reinecke, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech



On 4/18/25 02:03, Kamaljit Singh wrote:
> Sagi,
> I tried both of these patches but looks like #1 causes an infinite loop. dmesg was full of panics.
> I had tried just #1 and later with #1+#2. Both failed the same way.

That's no good :)

Can you share the panics? This version should be identical to what
Hannes introduced in:

@@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct work_struct *w)
  		result = nvme_tcp_try_recv(queue);
  		if (result > 0)
  			pending = true;
-		else if (unlikely(result < 0))
+		else if (unlikely(result < 0) && result != -EAGAIN)
  			return;


>
>> How about these two (untested) patches:
>> [1 based on your recv-side fix]:
>> --diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 72d260201d8c..4eb9a2dec07e 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1348,7 +1348,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue
>> *queue)
>>           queue->nr_cqe = 0;
>>           consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>>           release_sock(sk);
>> -       return consumed;
>> +       return consumed == -EAGAIN ? 0 : consumed;
>>    }
>>
>>    static void nvme_tcp_io_work(struct work_struct *w)
>> --
>>
>> [2 based on your partial write fix]:
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 4eb9a2dec07e..daf59e75cf15 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -129,6 +129,7 @@ enum nvme_tcp_queue_flags {
>>           NVME_TCP_Q_LIVE         = 1,
>>           NVME_TCP_Q_POLLING      = 2,
>>           NVME_TCP_Q_IO_CPU_SET   = 3,
>> +       NVME_TCP_Q_WAKE_SENDER  = 4,
>>    };
>>
>>    enum nvme_tcp_recv_state {
>> @@ -1063,6 +1064,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>>           queue = sk->sk_user_data;
>>           if (likely(queue && sk_stream_is_writeable(sk))) {
>>                   clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>> +               set_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>>                   queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>           }
>>           read_unlock_bh(&sk->sk_callback_lock);
>> @@ -1357,6 +1359,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
>>                   container_of(w, struct nvme_tcp_queue, io_work);
>>           unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>
>> +       clear_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>>           do {
>>                   bool pending = false;
>>                   int result;
>> @@ -1376,7 +1379,15 @@ static void nvme_tcp_io_work(struct work_struct *w)
>>                   else if (unlikely(result < 0))
>>                           return;
>>
>> -               if (!pending || !queue->rd_enabled)
>> +               /* did we get some space after spending time in recv ? */
>> +               if (nvme_tcp_queue_has_pending(queue) &&
>> +                   sk_stream_is_writeable(queue->sock->sk))
>> +                       pending = true;
>> +
>> +               if (!queue->rd_enabled)
>> +                       return;
>> +
>> +               if (!pending && !test_bit(NVME_TCP_Q_WAKE_SENDER,
>> &queue->flags))
>>                           return;
>>
>>           } while (!time_after(jiffies, deadline)); /* quota is exhausted */
>> --
>>   
> -Kamaljit



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-18 10:51             ` Sagi Grimberg
@ 2025-04-18 18:55               ` Kamaljit Singh
  2025-04-19 16:10                 ` Sagi Grimberg
  0 siblings, 1 reply; 48+ messages in thread
From: Kamaljit Singh @ 2025-04-18 18:55 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

Sagi,

From: Sagi Grimberg <sagi@grimberg.me>
Date: Friday, April 18, 2025 at 03:51
>> Sagi,
>> I tried both of these patches but looks like #1 causes an infinite loop. dmesg was full of panics.
>> I had tried just #1 and later with #1+#2. Both failed the same way.

>That's no good :)

>Can you share the panics? This version should be identical to what
>Hannes introduced in:
Sorry, I'm posting ~200 lines at the bottom, starting from the beginning of the panic until the next "---cut here---". I can send the whole dmesg as attachment as well. Please let me know.

The patches I applied are these:
1: nvme-tcp: partial write tcp-q-wake-sender untested by Sagi
2: nvme-tcp: recv-side EAGAIN untested fix by Sagi
3: nvme-tcp: fix I/O stalls on congested sockets
4: nvme-tcp: sanitize request list handling
5: nvme-tcp: avoid inline sending when handling R2T PDUs
6: tls: use independent record sz for data sends  <--- this is my internal/temp fix for a TLS send issue we found 
You can probably ignore #6.




[1] nvme-tcp: partial write tcp-q-wake-sender untested by Sagi
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 1839d2110c0d..12a790b534aa 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -129,6 +129,7 @@ enum nvme_tcp_queue_flags {
        NVME_TCP_Q_LIVE         = 1,
        NVME_TCP_Q_POLLING      = 2,
        NVME_TCP_Q_IO_CPU_SET   = 3,
+       NVME_TCP_Q_WAKE_SENDER  = 4,
 };

 enum nvme_tcp_recv_state {
@@ -1074,6 +1075,7 @@ static void nvme_tcp_write_space(struct sock *sk)
        queue = sk->sk_user_data;
        if (likely(queue && sk_stream_is_writeable(sk))) {
                clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+               set_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
                queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
        }
        read_unlock_bh(&sk->sk_callback_lock);
@@ -1368,6 +1370,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
                container_of(w, struct nvme_tcp_queue, io_work);
        unsigned long deadline = jiffies + msecs_to_jiffies(1);

+       clear_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
        do {
                bool pending = false;
                int result;
@@ -1387,10 +1390,15 @@ static void nvme_tcp_io_work(struct work_struct *w)
                else if (unlikely(result < 0) && result != -EAGAIN)
                        return;

-               if (nvme_tcp_queue_has_pending(queue))
+               /* did we get some space after spending time in recv ? */
+               if (nvme_tcp_queue_has_pending(queue) &&
+                       sk_stream_is_writeable(queue->sock->sk))
                        pending = true;

-               if (!pending || !queue->rd_enabled)
+               if (!queue->rd_enabled)
+                       return;
+
+               if (!pending && !test_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags))
                        return;

        } while (!time_after(jiffies, deadline)); /* quota is exhausted */

[2]     nvme-tcp: recv-side EAGAIN untested fix by Sagi
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index d3f12b17e68e..1839d2110c0d 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1359,7 +1359,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
        queue->nr_cqe = 0;
        consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
        release_sock(sk);
-       return consumed;
+       return consumed == -EAGAIN ? 0 : consumed;
 }

 static void nvme_tcp_io_work(struct work_struct *w)


[3]     nvme-tcp: fix I/O stalls on congested sockets (by Hannes R.)

    When the socket is busy processing nvme_tcp_try_recv() might
    return -EAGAIN, but this doesn't automatically imply that
    the sending side is blocked, too.
    So check if there are pending requests once nvme_tcp_try_recv()
    returns -EAGAIN and continue with the sending loop to avoid
    I/O stalls.

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 4647d5d547f8..d3f12b17e68e 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1384,9 +1384,12 @@ static void nvme_tcp_io_work(struct work_struct *w)
                result = nvme_tcp_try_recv(queue);
                if (result > 0)
                        pending = true;
-               else if (unlikely(result < 0))
+               else if (unlikely(result < 0) && result != -EAGAIN)
                        return;

+               if (nvme_tcp_queue_has_pending(queue))
+                       pending = true;
+
                if (!pending || !queue->rd_enabled)
                        return;


[4]         nvme-tcp: sanitize request list handling

    Validate the request in nvme_tcp_handle_r2t() to ensure it's not
    part of any list, otherwise a malicious R2T PDU might inject a
    loop in request list processing.

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 40dce5dc3ff5..4647d5d547f8 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -455,6 +455,7 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
        }

        list_del(&req->entry);
+       init_llist_node(&req->lentry);
        return req;
 }

@@ -562,6 +563,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
        req->queue = queue;
        nvme_req(rq)->ctrl = &ctrl->ctrl;
        nvme_req(rq)->cmd = &pdu->cmd;
+       init_llist_node(&req->lentry);
+       INIT_LIST_HEAD(&req->entry);

        return 0;
 }
@@ -773,8 +776,12 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,

        nvme_tcp_setup_h2c_data_pdu(req);

+       WARN_ON(queue->request == req);
+       WARN_ON(llist_on_list(&req->lentry));
+       WARN_ON(!list_empty(&req->entry));
        llist_add(&req->lentry, &queue->req_list);
-       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+       if (list_empty(&queue->send_list))
+               queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);

        return 0;
 }
@@ -2632,6 +2639,8 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
        ctrl->async_req.offset = 0;
        ctrl->async_req.curr_bio = NULL;
        ctrl->async_req.data_len = 0;
+       init_llist_node(&ctrl->async_req.lentry);
+       INIT_LIST_HEAD(&ctrl->async_req.entry);

        nvme_tcp_queue_request(&ctrl->async_req, true);
 }


[5]     nvme-tcp: avoid inline sending when handling R2T PDUs  (by Hannes R.)

    When handling an R2T PDU we should not attempt to send consecutive
    PDUs as we are running from an softirq context, and sending PDUs
    from the receive context will mess up latencies.
    So just queue it and let the io_work workqueue function do the work.

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 14047b23a559..40dce5dc3ff5 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -405,7 +405,7 @@ static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
 }

 static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
-               bool sync, bool last)
+               bool last)
 {
        struct nvme_tcp_queue *queue = req->queue;
        bool empty;
@@ -419,7 +419,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
         * are on the same cpu, so we don't introduce contention.
         */
        if (queue->io_cpu == raw_smp_processor_id() &&
-           sync && empty && mutex_trylock(&queue->send_mutex)) {
+           empty && mutex_trylock(&queue->send_mutex)) {
                nvme_tcp_send_all(queue);
                mutex_unlock(&queue->send_mutex);
        }
@@ -772,7 +772,9 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
        req->ttag = pdu->ttag;

        nvme_tcp_setup_h2c_data_pdu(req);
-       nvme_tcp_queue_request(req, false, true);
+
+       llist_add(&req->lentry, &queue->req_list);
+       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);

        return 0;
 }
@@ -2631,7 +2633,7 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
        ctrl->async_req.curr_bio = NULL;
        ctrl->async_req.data_len = 0;

-       nvme_tcp_queue_request(&ctrl->async_req, true, true);
+       nvme_tcp_queue_request(&ctrl->async_req, true);
 }

 static void nvme_tcp_complete_timed_out(struct request *rq)
@@ -2783,7 +2785,7 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,

        nvme_start_request(rq);

-       nvme_tcp_queue_request(req, true, bd->last);
+       nvme_tcp_queue_request(req, bd->last);

        return BLK_STS_OK;
 }


[6] tls: use independent record sz for data sends (by Kamaljit)

Author: Kamaljit Singh <kamaljit.singh1@wdc.com>
Date:   Wed Apr 9 15:11:54 2025 -0700

    tls: use independent record sz for data sends

    [Hack] send data to use a smaller default value of 4096 instead of the
    common macro TLS_MAX_PAYLOAD_SIZE, which applies to both directions.

    Add a new user configurable parameter max_send_record_bytes to allow
    testing with different values other than 4096 bytes.

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index f672a62a9a52..968dde6848ad 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -55,6 +55,8 @@ static DEFINE_SPINLOCK(tls_device_lock);

 static struct page *dummy_page;

+extern unsigned int max_send_record_bytes;
+
 static void tls_device_free_ctx(struct tls_context *ctx)
 {
        if (ctx->tx_conf == TLS_HW)
@@ -459,7 +461,7 @@ static int tls_push_data(struct sock *sk,
        /* TLS_HEADER_SIZE is not counted as part of the TLS record, and
         * we need to leave room for an authentication tag.
         */
-       max_open_record_len = TLS_MAX_PAYLOAD_SIZE +
+       max_open_record_len = max_send_record_bytes +
                              prot->prepend_size;
        do {
                rc = tls_do_allocation(sk, ctx, pfrag, prot->prepend_size);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index cb86b0bf9a53..6f2c42e5b603 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -52,6 +52,11 @@ MODULE_DESCRIPTION("Transport Layer Security Support");
 MODULE_LICENSE("Dual BSD/GPL");
 MODULE_ALIAS_TCP_ULP("tls");

+unsigned int max_send_record_bytes = 4096;
+module_param(max_send_record_bytes, uint, 0644);
+MODULE_PARM_DESC(max_send_record_bytes , " max send record size in bytes (default 4096)");
+EXPORT_SYMBOL_GPL(max_send_record_bytes);
+
 enum {
        TLSV4,                                                                                
        TLSV6,
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 914d4e1516a3..45486773ad07 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -48,6 +48,8 @@
 #include "tls.h"

+extern unsigned int max_send_record_bytes;
+
 struct tls_decrypt_arg {                                                                     struct_group(inargs,
        bool zc;
@@ -1059,7 +1061,7 @@ static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg,
                orig_size = msg_pl->sg.size;
                full_record = false;                                                                  try_to_copy = msg_data_left(msg);
-               record_room = TLS_MAX_PAYLOAD_SIZE - msg_pl->sg.size;
+               record_room = max_send_record_bytes  - msg_pl->sg.size;
                if (try_to_copy >= record_room) {
                        try_to_copy = record_room;
                        full_record = true;

================================
Snapshot of panic messages:

...
[2025-04-16 22:55:32.992] [Wed Apr 16 22:55:34 2025] nvme nvme0: QID 0x0: traddr=10.10.10.223,trsvcid=4420,sport=0x8ac6,dport=0x1144
[2025-04-16 22:55:33.445] [Wed Apr 16 22:55:34 2025] nvme nvme0: D3 entry latency set to 16 seconds
[2025-04-16 22:55:33.898] [Wed Apr 16 22:55:35 2025] nvme nvme0: queue_size 128 > ctrl sqsize 17, clamping down
[2025-04-16 22:55:33.929] [Wed Apr 16 22:55:35 2025] nvme nvme0: creating 3 I/O queues.
[2025-04-16 22:55:35.398] [Wed Apr 16 22:55:36 2025] nvme nvme0: mapped 3/0/0 default/read/poll queues.
[2025-04-16 22:55:35.445] [Wed Apr 16 22:55:36 2025] nvme nvme0: QID 0x1: traddr=10.10.10.223,trsvcid=4420,sport=0x8ac8,dport=0x1144
[2025-04-16 22:55:35.508] [Wed Apr 16 22:55:36 2025] nvme nvme0: QID 0x2: traddr=10.10.10.223,trsvcid=4420,sport=0x8ad4,dport=0x1144
[2025-04-16 22:55:35.554] [Wed Apr 16 22:55:36 2025] nvme nvme0: QID 0x3: traddr=10.10.10.223,trsvcid=4420,sport=0x8ad8,dport=0x1144
[2025-04-16 22:55:35.632] [Wed Apr 16 22:55:37 2025] nvme nvme0: new ctrl: NQN "nqn.2015-09.com.wdc:nvme.1", addr 10.10.10.223:4420, hostnqn: nqn.2014-08.org.nvmexpress:host.0
[2025-04-16 22:55:36.195] [Wed Apr 16 22:55:37 2025] nvme nvme0: Failed to get ANA log: 16386
[2025-04-16 22:55:43.023]
[2025-04-16 22:55:43.492]
[2025-04-16 22:56:24.758] [Wed Apr 16 22:56:26 2025] ------------[ cut here ]------------
[2025-04-16 22:56:24.758] [Wed Apr 16 22:56:26 2025] WARNING: CPU: 22 PID: 134878 at tcp.c:782 nvme_tcp_recv_skb+0x115e/0x11c0 [nvme_tcp]
[2025-04-16 22:56:24.758] [Wed Apr 16 22:56:26 2025] Modules linked in: nvme_tcp(OE) nvme_rdma(E) nvme_fabrics(E) nvme(E) nvme_core(E) nvme_keyring(E) nvme_auth(E) xt_tcpudp(E) nft_compat(E) rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) cmac(E) nls_utf8(E) nfs(E) cifs(E) lockd(E) cifs_arc4(E) nls_ucs2_utils(E) grace(E) cifs_md4(E) nf_tables(E) netfs(E) cfg80211(E) ipmi_ssif(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) rpcrdma(E) amd64_edac(E) rdma_ucm(E) edac_mce_amd(E) scsi_transport_iscsi(E) rdma_cm(E) ib_umad(E) ib_ipoib(E) iw_cm(E) kvm_amd(E) ib_cm(E) kvm(E) ftdi_sio(E) ast(E)
[2025-04-16 22:56:24.758] [Wed Apr 16 22:56:26 2025] ------------[ cut here ]------------
[2025-04-16 22:56:24.758] [Wed Apr 16 22:56:26 2025]  usbserial(E)
[2025-04-16 22:56:24.758] [Wed Apr 16 22:56:26 2025]  rapl(E) i2c_algo_bit(E) acpi_ipmi(E) ccp(E) ipmi_si(E) i2c_piix4(E) ipmi_devintf(E) k10temp(E) ptdma(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025] WARNING: CPU: 44 PID: 1609 at tcp.c:782 nvme_tcp_recv_skb+0x115e/0x11c0 [nvme_tcp]
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  i2c_smbus(E) joydev(E) input_leds(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025] Modules linked in:
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  ipmi_msghandler(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  nvme_tcp(OE) nvme_rdma(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  mac_hid(E) bnxt_re(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  nvme_fabrics(E) nvme(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  sunrpc(E) binfmt_misc(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  nvme_core(E) nvme_keyring(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  sch_fq_codel(E) efi_pstore(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  nvme_auth(E) xt_tcpudp(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  nfnetlink(E) dmi_sysfs(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  nft_compat(E) rpcsec_gss_krb5(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  ip_tables(E) x_tables(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  auth_rpcgss(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  autofs4(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  nfsv4(E) cmac(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  btrfs(E) blake2b_generic(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  nls_utf8(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  zstd_compress(E) raid10(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  nfs(E) cifs(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  raid456(E) async_raid6_recov(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  lockd(E) cifs_arc4(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  async_memcpy(E) async_pq(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  nls_ucs2_utils(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  async_xor(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  grace(E) cifs_md4(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  async_tx(E) xor(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  nf_tables(E) netfs(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  raid6_pq(E) raid1(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  cfg80211(E) ipmi_ssif(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  raid0(E) mlx5_ib(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  amd_atl(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  ib_uverbs(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  intel_rapl_msr(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  macsec(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  intel_rapl_common(E) rpcrdma(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  ib_core(E) mlx5_core(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  amd64_edac(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025] ------------[ cut here ]------------
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  mlxfw(E) psample(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  rdma_ucm(E) edac_mce_amd(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  polyval_clmulni(E) polyval_generic(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  scsi_transport_iscsi(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  ghash_clmulni_intel(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025] WARNING: CPU: 0 PID: 832 at tcp.c:782 nvme_tcp_recv_skb+0x115e/0x11c0 [nvme_tcp]
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  sha256_ssse3(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  rdma_cm(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025] Modules linked in:
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  ib_umad(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  sha1_ssse3(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  nvme_tcp(OE)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  ahci(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  nvme_rdma(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  ib_ipoib(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  bnxt_en(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  nvme_fabrics(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  iw_cm(E)
[2025-04-16 22:56:24.773] [Wed Apr 16 22:56:26 2025]  nvme(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  tls(E) pci_hyperv_intf(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  nvme_core(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  kvm_amd(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  ib_cm(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  nvme_keyring(E) nvme_auth(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  kvm(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  libahci(E) hid_generic(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  ftdi_sio(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  xt_tcpudp(E) nft_compat(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  ast(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  usbhid(E) hid(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  usbserial(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  rpcsec_gss_krb5(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  rapl(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  aesni_intel(E)
[2025-04-16 22:56:24.805] [Wed Apr 16 22:56:26 2025]  auth_rpcgss(E) nfsv4(E)
[2025-04-16 22:56:24.820] [Wed Apr 16 22:56:26 2025]  crypto_simd(E)
[2025-04-16 22:56:24.820] [Wed Apr 16 22:56:26 2025]  cmac(E)
[2025-04-16 22:56:24.820] [Wed Apr 16 22:56:26 2025]  i2c_algo_bit(E)
[2025-04-16 22:56:24.820] [Wed Apr 16 22:56:26 2025]  acpi_ipmi(E)
[2025-04-16 22:56:24.820] [Wed Apr 16 22:56:26 2025]  nls_utf8(E) nfs(E)
[2025-04-16 22:56:24.820] [Wed Apr 16 22:56:26 2025]  cryptd(E)
[2025-04-16 22:56:24.820] [Wed Apr 16 22:56:26 2025]  cifs(E)
[2025-04-16 22:56:24.820] [Wed Apr 16 22:56:26 2025]  [last unloaded: nvme_tcp(E)]
[2025-04-16 22:56:24.820] [Wed Apr 16 22:56:26 2025]  ccp(E) ipmi_si(E)
[2025-04-16 22:56:24.820]
[2025-04-16 22:56:24.820] [Wed Apr 16 22:56:26 2025]  i2c_piix4(E)
[2025-04-16 22:56:24.820] [Wed Apr 16 22:56:26 2025]  lockd(E)
[2025-04-16 22:56:24.820] [Wed Apr 16 22:56:26 2025]  ipmi_devintf(E)
[2025-04-16 22:56:24.820] [Wed Apr 16 22:56:26 2025]  cifs_arc4(E) nls_ucs2_utils(E)
[2025-04-16 22:56:24.820] [Wed Apr 16 22:56:26 2025] CPU: 22 UID: 0 PID: 134878 Comm: kworker/22:0H Tainted: G           OE       6.14.0-qp-sdprt-adc-49-50-51-52-ktls+ #1 PREEMPT(voluntary)
[2025-04-16 22:56:24.820] [Wed Apr 16 22:56:26 2025]  k10temp(E) ptdma(E)
[2025-04-16 22:56:24.835] [Wed Apr 16 22:56:26 2025]  grace(E)
[2025-04-16 22:56:24.835] [Wed Apr 16 22:56:26 2025]  i2c_smbus(E)
[2025-04-16 22:56:24.835] [Wed Apr 16 22:56:26 2025] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[2025-04-16 22:56:24.835] [Wed Apr 16 22:56:26 2025]  cifs_md4(E)
[2025-04-16 22:56:24.835] [Wed Apr 16 22:56:26 2025] Hardware name: Supermicro AS -1114S-WTRT/H12SSW-NT, BIOS 1.0b 11/15/2019
[2025-04-16 22:56:24.835] [Wed Apr 16 22:56:26 2025]  joydev(E)
[2025-04-16 22:56:24.835] [Wed Apr 16 22:56:26 2025]  nf_tables(E)
[2025-04-16 22:56:24.835] [Wed Apr 16 22:56:26 2025]  input_leds(E)
[2025-04-16 22:56:24.835] [Wed Apr 16 22:56:26 2025]  ipmi_msghandler(E)
[2025-04-16 22:56:24.835] [Wed Apr 16 22:56:26 2025] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
[2025-04-16 22:56:24.835] [Wed Apr 16 22:56:26 2025]  netfs(E) cfg80211(E)
[2025-04-16 22:56:24.835]
[2025-04-16 22:56:24.835] [Wed Apr 16 22:56:26 2025]  mac_hid(E) bnxt_re(E)
[2025-04-16 22:56:24.835] [Wed Apr 16 22:56:26 2025]  ipmi_ssif(E)
[2025-04-16 22:56:24.835] [Wed Apr 16 22:56:26 2025] RIP: 0010:nvme_tcp_recv_skb+0x115e/0x11c0 [nvme_tcp]
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025]  amd_atl(E)
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025]  sunrpc(E)
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025] Code: 00 41 8b 57 20 44 89 d1 4d 89 c8 48 c7 c6 88 d1 7a c1 e8 a5 b6 2f c9 e9 cc fb ff ff 0f 0b e9 28 fb ff ff 0f 0b e9 35 fb ff ff <0f> 0b e9 45 fb ff ff 49 8b 87 28 01 00 00 89 d1 89 f2 48 c7 c6 10
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025]  binfmt_misc(E) sch_fq_codel(E)
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025]  intel_rapl_msr(E)
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025] RSP: 0018:ffffcdccf87afc80 EFLAGS: 00010293
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025]  intel_rapl_common(E)
[2025-04-16 22:56:24.851]
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025]  efi_pstore(E) nfnetlink(E)
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025]  rpcrdma(E)
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025] RAX: dead000000000100 RBX: ffff8dbda9193680 RCX: 0000000000000000
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025]  amd64_edac(E)
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8dbd986b14a8
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025]  dmi_sysfs(E)
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025]  rdma_ucm(E)
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025]  ip_tables(E) x_tables(E)
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025]  edac_mce_amd(E)
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025] RBP: ffffcdccf87afd08 R08: 0000000000008000 R09: 0000000000000000
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025]  scsi_transport_iscsi(E)
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025] R10: 0000000000000000 R11: 000000000000001c R12: ffff8dbe027c8290
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025]  autofs4(E) btrfs(E)
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025] R13: 0000000000000005 R14: 000000000000001c R15: ffff8dbd986b1498
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025]  rdma_cm(E) ib_umad(E)
[2025-04-16 22:56:24.851] [Wed Apr 16 22:56:26 2025] FS:  0000000000000000(0000) GS:ffff8e3b3f1b4000(0000) knlGS:0000000000000000
[2025-04-16 22:56:24.867] [Wed Apr 16 22:56:26 2025]  blake2b_generic(E)
[2025-04-16 22:56:24.867] [Wed Apr 16 22:56:26 2025]  ib_ipoib(E)
[2025-04-16 22:56:24.867] [Wed Apr 16 22:56:26 2025]  zstd_compress(E)
[2025-04-16 22:56:24.867] [Wed Apr 16 22:56:26 2025]  raid10(E)
[2025-04-16 22:56:24.867] [Wed Apr 16 22:56:26 2025]  iw_cm(E)
[2025-04-16 22:56:24.867] [Wed Apr 16 22:56:26 2025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2025-04-16 22:56:24.867] [Wed Apr 16 22:56:26 2025]  kvm_amd(E)
[2025-04-16 22:56:24.867] [Wed Apr 16 22:56:26 2025] CR2: 00005e9c7424be98 CR3: 00000002357ec000 CR4: 0000000000350ef0
[2025-04-16 22:56:24.867] [Wed Apr 16 22:56:26 2025]  raid456(E) async_raid6_recov(E)
[2025-04-16 22:56:24.867] [Wed Apr 16 22:56:26 2025] Call Trace:
[2025-04-16 22:56:24.867] [Wed Apr 16 22:56:26 2025]  ib_cm(E)
[2025-04-16 22:56:24.867] [Wed Apr 16 22:56:26 2025]  async_memcpy(E)
[2025-04-16 22:56:24.867] [Wed Apr 16 22:56:26 2025]  <TASK>
[2025-04-16 22:56:24.867] [Wed Apr 16 22:56:26 2025]  async_pq(E)
[2025-04-16 22:56:24.882] [Wed Apr 16 22:56:26 2025]  kvm(E) ftdi_sio(E) ast(E)
[2025-04-16 22:56:24.882] [Wed Apr 16 22:56:26 2025]  async_xor(E) async_tx(E)
[2025-04-16 22:56:24.882] [Wed Apr 16 22:56:26 2025]  usbserial(E)
[2025-04-16 22:56:24.882] [Wed Apr 16 22:56:26 2025]  xor(E)
[2025-04-16 22:56:24.882] [Wed Apr 16 22:56:26 2025]  tls_sw_read_sock+0x12c/0x4a0 [tls]
[2025-04-16 22:56:24.882] [Wed Apr 16 22:56:26 2025]  rapl(E) i2c_algo_bit(E)
[2025-04-16 22:56:24.882] [Wed Apr 16 22:56:26 2025]  raid6_pq(E) raid1(E)
[2025-04-16 22:56:24.882] [Wed Apr 16 22:56:26 2025]  acpi_ipmi(E)
[2025-04-16 22:56:24.882] [Wed Apr 16 22:56:26 2025]  ? __pfx_nvme_tcp_recv_skb+0x10/0x10 [nvme_tcp]
[2025-04-16 22:56:24.882] [Wed Apr 16 22:56:26 2025]  ccp(E)
[2025-04-16 22:56:24.882] [Wed Apr 16 22:56:26 2025]  raid0(E) mlx5_ib(E) ib_uverbs(E)
[2025-04-16 22:56:24.882] [Wed Apr 16 22:56:26 2025]  ipmi_si(E)
[2025-04-16 22:56:24.882] [Wed Apr 16 22:56:26 2025]  nvme_tcp_try_recv+0x78/0xc0 [nvme_tcp]
[2025-04-16 22:56:24.882] [Wed Apr 16 22:56:26 2025]  i2c_piix4(E)
[2025-04-16 22:56:24.882] [Wed Apr 16 22:56:26 2025]  macsec(E) ib_core(E)
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  ipmi_devintf(E)
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  nvme_tcp_io_work+0xc1/0x1e0 [nvme_tcp]
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  k10temp(E)
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  mlx5_core(E) mlxfw(E)
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  ptdma(E)
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  process_one_work+0x191/0x3e0
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  i2c_smbus(E)
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  psample(E) polyval_clmulni(E)
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  joydev(E)
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  worker_thread+0x2e3/0x420
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  polyval_generic(E) ghash_clmulni_intel(E)
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  input_leds(E)
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  sha256_ssse3(E)
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  ? srso_return_thunk+0x5/0x5f
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  sha1_ssse3(E)
[2025-04-16 22:56:24.898] [Wed Apr 16 22:56:26 2025]  ipmi_msghandler(E)
[2025-04-16 22:56:24.914] [Wed Apr 16 22:56:26 2025]  ? __pfx_worker_thread+0x10/0x10
[2025-04-16 22:56:24.914] [Wed Apr 16 22:56:26 2025]  mac_hid(E)
[2025-04-16 22:56:24.914] [Wed Apr 16 22:56:26 2025]  ahci(E)
[2025-04-16 22:56:24.914] [Wed Apr 16 22:56:26 2025]  kthread+0x10d/0x230
[2025-04-16 22:56:24.914] [Wed Apr 16 22:56:26 2025]  bnxt_re(E) sunrpc(E)
[2025-04-16 22:56:24.914] [Wed Apr 16 22:56:26 2025]  bnxt_en(E)
[2025-04-16 22:56:24.914] [Wed Apr 16 22:56:26 2025]  ? __pfx_kthread+0x10/0x10
[2025-04-16 22:56:24.914] [Wed Apr 16 22:56:26 2025]  tls(E)
[2025-04-16 22:56:24.914] [Wed Apr 16 22:56:26 2025]  binfmt_misc(E)
[2025-04-16 22:56:24.914] [Wed Apr 16 22:56:26 2025]  pci_hyperv_intf(E)
[2025-04-16 22:56:24.914] [Wed Apr 16 22:56:26 2025]  ret_from_fork+0x47/0x70
[2025-04-16 22:56:24.914] [Wed Apr 16 22:56:26 2025]  sch_fq_codel(E) efi_pstore(E)
[2025-04-16 22:56:24.914] [Wed Apr 16 22:56:26 2025]  libahci(E)
[2025-04-16 22:56:24.914] [Wed Apr 16 22:56:26 2025]  ? __pfx_kthread+0x10/0x10
[2025-04-16 22:56:24.914] [Wed Apr 16 22:56:26 2025]  hid_generic(E)
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  nfnetlink(E)
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  ret_from_fork_asm+0x1a/0x30
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  usbhid(E)
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  dmi_sysfs(E) ip_tables(E)
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  hid(E)
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  </TASK>
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  aesni_intel(E)
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  x_tables(E) autofs4(E)
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  crypto_simd(E)
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  btrfs(E)
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  blake2b_generic(E)
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  cryptd(E)
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025] ---[ end trace 0000000000000000 ]---
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  [last unloaded: nvme_tcp(E)]
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  zstd_compress(E)
[2025-04-16 22:56:24.929]
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E)
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025] CPU: 44 UID: 0 PID: 1609 Comm: kworker/44:1H Tainted: G        W  OE       6.14.0-qp-sdprt-adc-49-50-51-52-ktls+ #1 PREEMPT(voluntary)
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025]  async_pq(E) async_xor(E)
[2025-04-16 22:56:24.929] [Wed Apr 16 22:56:26 2025] ------------[ cut here ]------------
...
...





^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-18 18:55               ` Kamaljit Singh
@ 2025-04-19 16:10                 ` Sagi Grimberg
  2025-04-21 19:11                   ` Kamaljit Singh
  0 siblings, 1 reply; 48+ messages in thread
From: Sagi Grimberg @ 2025-04-19 16:10 UTC (permalink / raw)
  To: Kamaljit Singh, Hannes Reinecke, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech



On 4/18/25 21:55, Kamaljit Singh wrote:
> Sagi,
> 
> From: Sagi Grimberg <sagi@grimberg.me>
> Date: Friday, April 18, 2025 at 03:51
>>> Sagi,
>>> I tried both of these patches but looks like #1 causes an infinite loop. dmesg was full of panics.
>>> I had tried just #1 and later with #1+#2. Both failed the same way.
> 
>> That's no good :)
> 
>> Can you share the panics? This version should be identical to what
>> Hannes introduced in:
> Sorry, I'm posting ~200 lines at the bottom, starting from the beginning of the panic until the next "---cut here---". I can send the whole dmesg as attachment as well. Please let me know.

No need.

However the log prints are WARN messages that does not exist in any of
the patches.

> 
> The patches I applied are these:
> 1: nvme-tcp: partial write tcp-q-wake-sender untested by Sagi
> 2: nvme-tcp: recv-side EAGAIN untested fix by Sagi
> 3: nvme-tcp: fix I/O stalls on congested sockets

The patch I suggested was a replacement for patch (3). Didn't you
get conflicts?

> 4: nvme-tcp: sanitize request list handling
> 5: nvme-tcp: avoid inline sending when handling R2T PDUs
> 6: tls: use independent record sz for data sends  <--- this is my internal/temp fix for a TLS send issue we found
> You can probably ignore #6.

Ignoring.


> [4]         nvme-tcp: sanitize request list handling
> 
>      Validate the request in nvme_tcp_handle_r2t() to ensure it's not
>      part of any list, otherwise a malicious R2T PDU might inject a
>      loop in request list processing.
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 40dce5dc3ff5..4647d5d547f8 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -455,6 +455,7 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
>          }
> 
>          list_del(&req->entry);
> +       init_llist_node(&req->lentry);
>          return req;
>   }
> 
> @@ -562,6 +563,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
>          req->queue = queue;
>          nvme_req(rq)->ctrl = &ctrl->ctrl;
>          nvme_req(rq)->cmd = &pdu->cmd;
> +       init_llist_node(&req->lentry);
> +       INIT_LIST_HEAD(&req->entry);
> 
>          return 0;
>   }
> @@ -773,8 +776,12 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
> 
>          nvme_tcp_setup_h2c_data_pdu(req);
> 
> +       WARN_ON(queue->request == req);
> +       WARN_ON(llist_on_list(&req->lentry));
> +       WARN_ON(!list_empty(&req->entry));

These are not included in Hannes patch afaict.
I'm assuming that these are the stack traces that are firing...

Can you please share the output of gdb:
l *nvme_tcp_recv_skb+0x115e

My assumption is that in your case, queue->request is still referenced
with the current request when the r2t for it arrives, which is very much
possible...

So this means that the patch "nvme-tcp: sanitize request list handling"
is wrong, queue->request may still be referenced when its r2t arrives.

I think this patch needs the following:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index f2c829f1aff6..c2d4893ecb94 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -767,8 +767,7 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue 
*queue,
                 return -EPROTO;
         }

-       if (queue->request == req ||
-           llist_on_list(&req->lentry) ||
+       if (llist_on_list(&req->lentry) ||
             !list_empty(&req->entry)) {
                 dev_err(queue->ctrl->ctrl.device,
                         "req %d unexpected r2t while processing request\n",
--

Or in your case - remove the above:
--
	WARN_ON(queue->request == req);
--



> ================================
> Snapshot of panic messages:
> 
> ...
> [2025-04-16 22:55:32.992] [Wed Apr 16 22:55:34 2025] nvme nvme0: QID 0x0: traddr=10.10.10.223,trsvcid=4420,sport=0x8ac6,dport=0x1144
> [2025-04-16 22:55:33.445] [Wed Apr 16 22:55:34 2025] nvme nvme0: D3 entry latency set to 16 seconds
> [2025-04-16 22:55:33.898] [Wed Apr 16 22:55:35 2025] nvme nvme0: queue_size 128 > ctrl sqsize 17, clamping down
> [2025-04-16 22:55:33.929] [Wed Apr 16 22:55:35 2025] nvme nvme0: creating 3 I/O queues.
> [2025-04-16 22:55:35.398] [Wed Apr 16 22:55:36 2025] nvme nvme0: mapped 3/0/0 default/read/poll queues.
> [2025-04-16 22:55:35.445] [Wed Apr 16 22:55:36 2025] nvme nvme0: QID 0x1: traddr=10.10.10.223,trsvcid=4420,sport=0x8ac8,dport=0x1144
> [2025-04-16 22:55:35.508] [Wed Apr 16 22:55:36 2025] nvme nvme0: QID 0x2: traddr=10.10.10.223,trsvcid=4420,sport=0x8ad4,dport=0x1144
> [2025-04-16 22:55:35.554] [Wed Apr 16 22:55:36 2025] nvme nvme0: QID 0x3: traddr=10.10.10.223,trsvcid=4420,sport=0x8ad8,dport=0x1144
> [2025-04-16 22:55:35.632] [Wed Apr 16 22:55:37 2025] nvme nvme0: new ctrl: NQN "nqn.2015-09.com.wdc:nvme.1", addr 10.10.10.223:4420, hostnqn: nqn.2014-08.org.nvmexpress:host.0
> [2025-04-16 22:55:36.195] [Wed Apr 16 22:55:37 2025] nvme nvme0: Failed to get ANA log: 16386
> [2025-04-16 22:55:43.023]
> [2025-04-16 22:55:43.492]
> [2025-04-16 22:56:24.758] [Wed Apr 16 22:56:26 2025] ------------[ cut here ]------------
> [2025-04-16 22:56:24.758] [Wed Apr 16 22:56:26 2025] WARNING: CPU: 22 PID: 134878 at tcp.c:782 nvme_tcp_recv_skb+0x115e/0x11c0 [nvme_tcp]

My assumption is that this is a result of
WARN_ON(queue->request == req);


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-19 16:10                 ` Sagi Grimberg
@ 2025-04-21 19:11                   ` Kamaljit Singh
  2025-04-22 11:43                     ` Sagi Grimberg
  0 siblings, 1 reply; 48+ messages in thread
From: Kamaljit Singh @ 2025-04-21 19:11 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

Sagi,

>However the log prints are WARN messages that does not exist in any of
>the patches.
Just found the issue with my patch integration. I had picked up Hannes' patch in email 
titled "[PATCH 2/3] nvme-tcp:  sanitize request list handling" and dated Fri Match 7th. 
This had the 3 WARN_ON msgs.

I just found another patch-set with almost the same title dated March 27th. 
I'll integrate this one "PATCH 2/5] nvme-tcp: sanitize request list handling".
This replaced those WARN_ON msgs w/ dev_err() after if-checks for the same 
conditions.

>> The patches I applied are these:
>> 1: nvme-tcp: partial write tcp-q-wake-sender untested by Sagi
>> 2: nvme-tcp: recv-side EAGAIN untested fix by Sagi
>> 3: nvme-tcp: fix I/O stalls on congested sockets
>
>The patch I suggested was a replacement for patch (3). Didn't you
>get conflicts?
No. But I reviewed it again and have undone this part of patch-3 in my setup.
-               else if (unlikely(result < 0))
+               else if (unlikely(result < 0) && result != -EAGAIN)


>> 4: nvme-tcp: sanitize request list handling
>> 5: nvme-tcp: avoid inline sending when handling R2T PDUs

>
>> [4]         nvme-tcp: sanitize request list handling
>>
>>      Validate the request in nvme_tcp_handle_r2t() to ensure it's not
>>      part of any list, otherwise a malicious R2T PDU might inject a
>>      loop in request list processing.
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 40dce5dc3ff5..4647d5d547f8 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -455,6 +455,7 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
>>          }
>>
>>          list_del(&req->entry);
>> +       init_llist_node(&req->lentry);
>>          return req;
>>   }
>>
>> @@ -562,6 +563,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
>>          req->queue = queue;
>>          nvme_req(rq)->ctrl = &ctrl->ctrl;
>>          nvme_req(rq)->cmd = &pdu->cmd;
>> +       init_llist_node(&req->lentry);
>> +       INIT_LIST_HEAD(&req->entry);
>>
>>          return 0;
>>   }
>> @@ -773,8 +776,12 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
>>
>>          nvme_tcp_setup_h2c_data_pdu(req);
>>
>> +       WARN_ON(queue->request == req);
>> +       WARN_ON(llist_on_list(&req->lentry));
>> +       WARN_ON(!list_empty(&req->entry));
>
>These are not included in Hannes patch afaict.
They were, as I noted on above.


>I'm assuming that these are the stack traces that are firing...
>
>Can you please share the output of gdb:
>l *nvme_tcp_recv_skb+0x115e
It points to the 3rd WARN_ON msg. I've replaced those warnings with the patch you suggested below.

(gdb) l *(nvme_tcp_recv_skb+0x115e)
0x628e is in nvme_tcp_recv_skb (tcp.c:782).
777
778             nvme_tcp_setup_h2c_data_pdu(req);
779
780             WARN_ON(queue->request == req);
781             WARN_ON(llist_on_list(&req->lentry));
782             WARN_ON(!list_empty(&req->entry));
783             llist_add(&req->lentry, &queue->req_list);
784             if (list_empty(&queue->send_list))
785                     queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);

>
>My assumption is that in your case, queue->request is still referenced
>with the current request when the r2t for it arrives, which is very much
>possible...
>
>So this means that the patch "nvme-tcp: sanitize request list handling"
>is wrong, queue->request may still be referenced when its r2t arrives.
>
>I think this patch needs the following:
>--
>diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>index f2c829f1aff6..c2d4893ecb94 100644
>--- a/drivers/nvme/host/tcp.c
>+++ b/drivers/nvme/host/tcp.c
>@@ -767,8 +767,7 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue
>*queue,
>                 return -EPROTO;
>         }
>
>-       if (queue->request == req ||
>-           llist_on_list(&req->lentry) ||
>+       if (llist_on_list(&req->lentry) ||
>             !list_empty(&req->entry)) {
>                 dev_err(queue->ctrl->ctrl.device,
>                         "req %d unexpected r2t while processing request\n",
>--
>
>Or in your case - remove the above:
>--
>        WARN_ON(queue->request == req);
>--
Made that change. Tested and seeing this. -71 is -EPROTO that patch-2 returns . 
Trying to narrow it down further. If you have any suggestion, let me know.

[Mon Apr 21 11:18:58 2025] nvme nvme0: req 7 unexpected r2t while processing request
[Mon Apr 21 11:18:58 2025] nvme nvme0: receive failed:  -71
[Mon Apr 21 11:18:58 2025] nvme nvme0: starting error recovery
[Mon Apr 21 11:18:58 2025] nvme nvme0: req 11 unexpected r2t while processing request
[Mon Apr 21 11:18:58 2025] block nvme0n1: no usable path - requeuing I/O
[Mon Apr 21 11:18:58 2025] nvme nvme0: receive failed:  -71


>> [2025-04-16 22:56:24.758] [Wed Apr 16 22:56:26 2025] WARNING: CPU: 22 PID: 134878 at tcp.c:782 nvme_tcp_recv_skb+0x115e/0x11c0 [nvme_tcp]
>
>My assumption is that this is a result of
>WARN_ON(queue->request == req);
From the gdb decode it points to WARN_ON(!list_empty(&req->entry))


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: open-code nvme_tcp_queue_request() for R2T
  2025-04-03  6:55 ` [PATCH 1/3] nvme-tcp: open-code nvme_tcp_queue_request() for R2T Hannes Reinecke
  2025-04-13 22:56   ` Sagi Grimberg
@ 2025-04-22  8:09   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2025-04-22  8:09 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Christoph Hellwig, Sagi Grimberg, Keith Busch, linux-nvme

On Thu, Apr 03, 2025 at 08:55:20AM +0200, Hannes Reinecke wrote:
> When handling an R2T PDU we short-circuit nvme_tcp_queue_request()
> as we should not attempt to send consecutive PDUs. So open-code
> nvme_tcp_queue_request() for R2T and drop the last argument.

I've added this patch to nvme-6.16.  I'm waiting for the discussion
to conclude or a resend on the others.



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-21 19:11                   ` Kamaljit Singh
@ 2025-04-22 11:43                     ` Sagi Grimberg
  2025-04-23 17:26                       ` Kamaljit Singh
  0 siblings, 1 reply; 48+ messages in thread
From: Sagi Grimberg @ 2025-04-22 11:43 UTC (permalink / raw)
  To: Kamaljit Singh, Hannes Reinecke, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech



On 21/04/2025 22:11, Kamaljit Singh wrote:
> Sagi,
>
>> However the log prints are WARN messages that does not exist in any of
>> the patches.
> Just found the issue with my patch integration. I had picked up Hannes' patch in email
> titled "[PATCH 2/3] nvme-tcp:  sanitize request list handling" and dated Fri Match 7th.
> This had the 3 WARN_ON msgs.
>
> I just found another patch-set with almost the same title dated March 27th.
> I'll integrate this one "PATCH 2/5] nvme-tcp: sanitize request list handling".
> This replaced those WARN_ON msgs w/ dev_err() after if-checks for the same
> conditions.
>
>>> The patches I applied are these:
>>> 1: nvme-tcp: partial write tcp-q-wake-sender untested by Sagi
>>> 2: nvme-tcp: recv-side EAGAIN untested fix by Sagi
>>> 3: nvme-tcp: fix I/O stalls on congested sockets
>> The patch I suggested was a replacement for patch (3). Didn't you
>> get conflicts?
> No. But I reviewed it again and have undone this part of patch-3 in my setup.
> -               else if (unlikely(result < 0))
> +               else if (unlikely(result < 0) && result != -EAGAIN)
>
>
>>> 4: nvme-tcp: sanitize request list handling
>>> 5: nvme-tcp: avoid inline sending when handling R2T PDUs
>>> [4]         nvme-tcp: sanitize request list handling
>>>
>>>       Validate the request in nvme_tcp_handle_r2t() to ensure it's not
>>>       part of any list, otherwise a malicious R2T PDU might inject a
>>>       loop in request list processing.
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 40dce5dc3ff5..4647d5d547f8 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -455,6 +455,7 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
>>>           }
>>>
>>>           list_del(&req->entry);
>>> +       init_llist_node(&req->lentry);
>>>           return req;
>>>    }
>>>
>>> @@ -562,6 +563,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
>>>           req->queue = queue;
>>>           nvme_req(rq)->ctrl = &ctrl->ctrl;
>>>           nvme_req(rq)->cmd = &pdu->cmd;
>>> +       init_llist_node(&req->lentry);
>>> +       INIT_LIST_HEAD(&req->entry);
>>>
>>>           return 0;
>>>    }
>>> @@ -773,8 +776,12 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
>>>
>>>           nvme_tcp_setup_h2c_data_pdu(req);
>>>
>>> +       WARN_ON(queue->request == req);
>>> +       WARN_ON(llist_on_list(&req->lentry));
>>> +       WARN_ON(!list_empty(&req->entry));
>> These are not included in Hannes patch afaict.
> They were, as I noted on above.
>
>
>> I'm assuming that these are the stack traces that are firing...
>>
>> Can you please share the output of gdb:
>> l *nvme_tcp_recv_skb+0x115e
> It points to the 3rd WARN_ON msg. I've replaced those warnings with the patch you suggested below.
>
> (gdb) l *(nvme_tcp_recv_skb+0x115e)
> 0x628e is in nvme_tcp_recv_skb (tcp.c:782).
> 777
> 778             nvme_tcp_setup_h2c_data_pdu(req);
> 779
> 780             WARN_ON(queue->request == req);
> 781             WARN_ON(llist_on_list(&req->lentry));
> 782             WARN_ON(!list_empty(&req->entry));
> 783             llist_add(&req->lentry, &queue->req_list);
> 784             if (list_empty(&queue->send_list))
> 785                     queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);

That is strange - should not happen afaict. You did not change the code 
and recompile
before checking this correct?

>> My assumption is that in your case, queue->request is still referenced
>> with the current request when the r2t for it arrives, which is very much
>> possible...
>>
>> So this means that the patch "nvme-tcp: sanitize request list handling"
>> is wrong, queue->request may still be referenced when its r2t arrives.
>>
>> I think this patch needs the following:
>> --
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index f2c829f1aff6..c2d4893ecb94 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -767,8 +767,7 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue
>> *queue,
>>                  return -EPROTO;
>>          }
>>
>> -       if (queue->request == req ||
>> -           llist_on_list(&req->lentry) ||
>> +       if (llist_on_list(&req->lentry) ||
>>              !list_empty(&req->entry)) {
>>                  dev_err(queue->ctrl->ctrl.device,
>>                          "req %d unexpected r2t while processing request\n",
>> --
>>
>> Or in your case - remove the above:
>> --
>>         WARN_ON(queue->request == req);
>> --
> Made that change. Tested and seeing this. -71 is -EPROTO that patch-2 returns .
> Trying to narrow it down further. If you have any suggestion, let me know.
>
> [Mon Apr 21 11:18:58 2025] nvme nvme0: req 7 unexpected r2t while processing request
> [Mon Apr 21 11:18:58 2025] nvme nvme0: receive failed:  -71
> [Mon Apr 21 11:18:58 2025] nvme nvme0: starting error recovery
> [Mon Apr 21 11:18:58 2025] nvme nvme0: req 11 unexpected r2t while processing request
> [Mon Apr 21 11:18:58 2025] block nvme0n1: no usable path - requeuing I/O
> [Mon Apr 21 11:18:58 2025] nvme nvme0: receive failed:  -71
>
>
>>> [2025-04-16 22:56:24.758] [Wed Apr 16 22:56:26 2025] WARNING: CPU: 22 PID: 134878 at tcp.c:782 nvme_tcp_recv_skb+0x115e/0x11c0 [nvme_tcp]
>> My assumption is that this is a result of
>> WARN_ON(queue->request == req);
>  From the gdb decode it points to WARN_ON(!list_empty(&req->entry))

Can you please apply this on top:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 9453cf453d03..334154a58bdb 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -453,7 +453,7 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
                         return NULL;
         }

-       list_del(&req->entry);
+       list_del_init(&req->entry);
         init_llist_node(&req->lentry);
         return req;
  }
--


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-22 11:43                     ` Sagi Grimberg
@ 2025-04-23 17:26                       ` Kamaljit Singh
  0 siblings, 0 replies; 48+ messages in thread
From: Kamaljit Singh @ 2025-04-23 17:26 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

Sagi, Hannes,

>>> Can you please share the output of gdb:
>>> l *nvme_tcp_recv_skb+0x115e
>> It points to the 3rd WARN_ON msg. I've replaced those warnings with the patch you suggested below.
>>
>> (gdb) l *(nvme_tcp_recv_skb+0x115e)
>> 0x628e is in nvme_tcp_recv_skb (tcp.c:782).
>> 777
>> 778             nvme_tcp_setup_h2c_data_pdu(req);
>> 779
>> 780             WARN_ON(queue->request == req);
>> 781             WARN_ON(llist_on_list(&req->lentry));
>> 782             WARN_ON(!list_empty(&req->entry));
>> 783             llist_add(&req->lentry, &queue->req_list);
>> 784             if (list_empty(&queue->send_list))
>> 785                     queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>
>That is strange - should not happen afaict. You did not change the code
>and recompile
>before checking this correct?
No, no other code changes were done before checking above.


>>> I think this patch needs the following:
>>> --
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index f2c829f1aff6..c2d4893ecb94 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -767,8 +767,7 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue
>>> *queue,
>>>                  return -EPROTO;
>>>          }
>>>
>>> -       if (queue->request == req ||
>>> -           llist_on_list(&req->lentry) ||
>>> +       if (llist_on_list(&req->lentry) ||
>>>              !list_empty(&req->entry)) {
>>>                  dev_err(queue->ctrl->ctrl.device,
>>>                          "req %d unexpected r2t while processing request\n",
>>> --
>>>
>>> Or in your case - remove the above:
>>> --
>>>         WARN_ON(queue->request == req);
>>> --
>> Made that change. Tested and seeing this. -71 is -EPROTO that patch-2 returns .
>> Trying to narrow it down further. If you have any suggestion, let me know.
>>
>> [Mon Apr 21 11:18:58 2025] nvme nvme0: req 7 unexpected r2t while processing request
>> [Mon Apr 21 11:18:58 2025] nvme nvme0: receive failed:  -71
>> [Mon Apr 21 11:18:58 2025] nvme nvme0: starting error recovery
>> [Mon Apr 21 11:18:58 2025] nvme nvme0: req 11 unexpected r2t while processing request
>> [Mon Apr 21 11:18:58 2025] block nvme0n1: no usable path - requeuing I/O
>> [Mon Apr 21 11:18:58 2025] nvme nvme0: receive failed:  -71
The last patch had failed in under a minute. I had removed all WARN_ONs by this time.


>>>> [2025-04-16 22:56:24.758] [Wed Apr 16 22:56:26 2025] WARNING: CPU: 22 PID: 134878 at tcp.c:782 nvme_tcp_recv_skb+0x115e/0x11c0 [nvme_tcp]
>>> My assumption is that this is a result of
>>> WARN_ON(queue->request == req);
>>  From the gdb decode it points to WARN_ON(!list_empty(&req->entry))
>
>Can you please apply this on top:
>--
>diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>index 9453cf453d03..334154a58bdb 100644
>--- a/drivers/nvme/host/tcp.c
>+++ b/drivers/nvme/host/tcp.c
>@@ -453,7 +453,7 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
>                         return NULL;
>         }
>
>-       list_del(&req->entry);
>+       list_del_init(&req->entry);
>         init_llist_node(&req->lentry);
>         return req;
>  }
>--
list_del_init did the trick. Test ran successfully for >18 hours, while the
original failure took <6 hours. Can one of you please merge this in?
Which branch will this be targeted for?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-16 22:09         ` Sagi Grimberg
  2025-04-17 23:03           ` Kamaljit Singh
@ 2025-04-24 11:26           ` Hannes Reinecke
  2025-04-25 21:55             ` Sagi Grimberg
  1 sibling, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2025-04-24 11:26 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chris Leech

On 4/17/25 00:09, Sagi Grimberg wrote:
> 
> 
> On 16/04/2025 10:56, Hannes Reinecke wrote:
>> On 4/15/25 23:35, Sagi Grimberg wrote:
>>>
>>>
>>> On 15/04/2025 10:07, Hannes Reinecke wrote:
>>>> On 4/3/25 08:55, Hannes Reinecke wrote:
>>>>> When the socket is busy processing nvme_tcp_try_recv() might
>>>>> return -EAGAIN, but this doesn't automatically imply that
>>>>> the sending side is blocked, too.
>>>>> So check if there are pending requests once nvme_tcp_try_recv()
>>>>> returns -EAGAIN and continue with the sending loop to avoid
>>>>> I/O stalls.
>>>>>
>>>>> Acked-by: Chris Leech <cleech@redhat.com>
>>>>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>>>> ---
>>>>>   drivers/nvme/host/tcp.c | 5 ++++-
>>>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>>> index 1a319cb86453..87f1d7a4ea06 100644
>>>>> --- a/drivers/nvme/host/tcp.c
>>>>> +++ b/drivers/nvme/host/tcp.c
>>>>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct 
>>>>> work_struct *w)
>>>>>           result = nvme_tcp_try_recv(queue);
>>>>>           if (result > 0)
>>>>>               pending = true;
>>>>> -        else if (unlikely(result < 0))
>>>>> +        else if (unlikely(result < 0) && result != -EAGAIN)
>>>>>               return;
>>>>>   +        if (nvme_tcp_queue_has_pending(queue))
>>>>> +            pending = true;
>>>>> +
>>>>>           if (!pending || !queue->rd_enabled)
>>>>>               return;
>>>>
>>>> The various 'try_send' function will return -EAGAIN for a partial send.
>>>> But it doesn't indicate a blocked Tx, rather we should retry directly.
>>>> Hence this check.
>>>>
>>>> Unless you tell me differently and even a partial send will cause
>>>> ->write_space() to be invoked, then we wouldn't _need_ it.
>>>
>>> Umm, that is my understanding. If you tried to send X and were able to
>>> send Y where Y < X, you shouldn't have to keep trying in a busy loop, 
>>> the stack should tell you when you can send again.
>>>
>> Okay, I could try that.
>>
>>>> It would
>>>> still be an optimisation as we're saving the round-trip via socket
>>>> callbacks.
>>>
>>> But you are doing a busy loop on a socket that cannot accept new 
>>> data, there are other sockets that the kthread can be working on.
>>>
>> But we might be _sending_ data, right?
> 
> I'd say that odds are you're not in the next attempt...
> 
>>
>>>>
>>>> We could aim for a different error here, to differentiate between a
>>>> 'real' EAGAIN and a partial send.
>>>> Whatever you prefer.
>>>
>>> I still don't understand why a partial send warrants a busy loop call 
>>> to sock_sendmsg...
>>>
>> This is, it's not just sendmsg. It's the combination of send and recv.
> 
> I agree with you that the recv fix is correct.
> 
>> In my tests I have seen sendmsg return with a partial/incomplete status,
>> consequently recvmsg has nothing to receive, and io_work stops.
>> And that is what the patch fixes.
> 
> The question is if the recv fix is sufficient?
> 
>>
>> I haven't checked the ->write_space() callback (which should've been 
>> triggered), but my feeling is that the ->write_space() callback
>> hit when we were still busy processing, so the queue_work() went
>> nowhere.
> 
> That makes sense.
> 
>>
>> Maybe we can fix it with setting a flag when ->write_space()
>> triggers ('hey, more data to process'), to be evaluated during
>> the io_work() loop. But that would be pretty close to the
>> check 'nvme_tcp_queue_has_pending', so I'm not sure if we
>> gain anything.
> 
> How about checking sk_stream_is_writeable() instead?
> I think we also need to flag the queue for the write_space during IO 
> work...
> 
>>
>> And I really haven't seen any detrimental performance effects
>> with this patch; in the contrary, performance was on par,
>> and if anything standard deviation went down.
> 
> Still you agree that busy looping a socket that may not have space is 
> less efficient?
> When there are other queues waiting to execute io_work?
> 
> I agree there is a problem here, but I am trying to figure out what we 
> want to do here.
> 
> How about these two (untested) patches:
> [1 based on your recv-side fix]:
> --diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 72d260201d8c..4eb9a2dec07e 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1348,7 +1348,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue 
> *queue)
>          queue->nr_cqe = 0;
>          consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>          release_sock(sk);
> -       return consumed;
> +       return consumed == -EAGAIN ? 0 : consumed;
>   }
> 
>   static void nvme_tcp_io_work(struct work_struct *w)
> -- 
> 
> [2 based on your partial write fix]:
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 4eb9a2dec07e..daf59e75cf15 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -129,6 +129,7 @@ enum nvme_tcp_queue_flags {
>          NVME_TCP_Q_LIVE         = 1,
>          NVME_TCP_Q_POLLING      = 2,
>          NVME_TCP_Q_IO_CPU_SET   = 3,
> +       NVME_TCP_Q_WAKE_SENDER  = 4,
>   };
> 
>   enum nvme_tcp_recv_state {
> @@ -1063,6 +1064,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>          queue = sk->sk_user_data;
>          if (likely(queue && sk_stream_is_writeable(sk))) {
>                  clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> +               set_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>                  queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue- 
>  >io_work);
>          }
>          read_unlock_bh(&sk->sk_callback_lock);
> @@ -1357,6 +1359,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
>                  container_of(w, struct nvme_tcp_queue, io_work);
>          unsigned long deadline = jiffies + msecs_to_jiffies(1);
> 
> +       clear_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>          do {
>                  bool pending = false;
>                  int result;
> @@ -1376,7 +1379,15 @@ static void nvme_tcp_io_work(struct work_struct *w)
>                  else if (unlikely(result < 0))
>                          return;
> 
> -               if (!pending || !queue->rd_enabled)
> +               /* did we get some space after spending time in recv ? */
> +               if (nvme_tcp_queue_has_pending(queue) &&
> +                   sk_stream_is_writeable(queue->sock->sk))
> +                       pending = true;
> +
> +               if (!queue->rd_enabled)
> +                       return;
> +
> +               if (!pending && !test_bit(NVME_TCP_Q_WAKE_SENDER, 
> &queue->flags))
>                          return;
> 
>          } while (!time_after(jiffies, deadline)); /* quota is exhausted */
> -- 
> 
While trying to sift through the patches trying to come up with
a next version:
Why do you check for Q_WAKE_SENDER after checking for
sk_stream_is_writable?
Surely sk_stream_is_writable() is a necessary condition for 
nvme_tcp_try_send(), no?

Maybe it's a better idea to check sk_stream_is_writable() just
before nvme_tcp_try_send()...

Let me see how things pan out.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-24 11:26           ` Hannes Reinecke
@ 2025-04-25 21:55             ` Sagi Grimberg
  2025-04-25 22:09               ` Sagi Grimberg
  0 siblings, 1 reply; 48+ messages in thread
From: Sagi Grimberg @ 2025-04-25 21:55 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chris Leech



On 24/04/2025 14:26, Hannes Reinecke wrote:
> On 4/17/25 00:09, Sagi Grimberg wrote:
>>
>>
>> On 16/04/2025 10:56, Hannes Reinecke wrote:
>>> On 4/15/25 23:35, Sagi Grimberg wrote:
>>>>
>>>>
>>>> On 15/04/2025 10:07, Hannes Reinecke wrote:
>>>>> On 4/3/25 08:55, Hannes Reinecke wrote:
>>>>>> When the socket is busy processing nvme_tcp_try_recv() might
>>>>>> return -EAGAIN, but this doesn't automatically imply that
>>>>>> the sending side is blocked, too.
>>>>>> So check if there are pending requests once nvme_tcp_try_recv()
>>>>>> returns -EAGAIN and continue with the sending loop to avoid
>>>>>> I/O stalls.
>>>>>>
>>>>>> Acked-by: Chris Leech <cleech@redhat.com>
>>>>>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>>>>> ---
>>>>>>   drivers/nvme/host/tcp.c | 5 ++++-
>>>>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>>>> index 1a319cb86453..87f1d7a4ea06 100644
>>>>>> --- a/drivers/nvme/host/tcp.c
>>>>>> +++ b/drivers/nvme/host/tcp.c
>>>>>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct 
>>>>>> work_struct *w)
>>>>>>           result = nvme_tcp_try_recv(queue);
>>>>>>           if (result > 0)
>>>>>>               pending = true;
>>>>>> -        else if (unlikely(result < 0))
>>>>>> +        else if (unlikely(result < 0) && result != -EAGAIN)
>>>>>>               return;
>>>>>>   +        if (nvme_tcp_queue_has_pending(queue))
>>>>>> +            pending = true;
>>>>>> +
>>>>>>           if (!pending || !queue->rd_enabled)
>>>>>>               return;
>>>>>
>>>>> The various 'try_send' function will return -EAGAIN for a partial 
>>>>> send.
>>>>> But it doesn't indicate a blocked Tx, rather we should retry 
>>>>> directly.
>>>>> Hence this check.
>>>>>
>>>>> Unless you tell me differently and even a partial send will cause
>>>>> ->write_space() to be invoked, then we wouldn't _need_ it.
>>>>
>>>> Umm, that is my understanding. If you tried to send X and were able to
>>>> send Y where Y < X, you shouldn't have to keep trying in a busy 
>>>> loop, the stack should tell you when you can send again.
>>>>
>>> Okay, I could try that.
>>>
>>>>> It would
>>>>> still be an optimisation as we're saving the round-trip via socket
>>>>> callbacks.
>>>>
>>>> But you are doing a busy loop on a socket that cannot accept new 
>>>> data, there are other sockets that the kthread can be working on.
>>>>
>>> But we might be _sending_ data, right?
>>
>> I'd say that odds are you're not in the next attempt...
>>
>>>
>>>>>
>>>>> We could aim for a different error here, to differentiate between a
>>>>> 'real' EAGAIN and a partial send.
>>>>> Whatever you prefer.
>>>>
>>>> I still don't understand why a partial send warrants a busy loop 
>>>> call to sock_sendmsg...
>>>>
>>> This is, it's not just sendmsg. It's the combination of send and recv.
>>
>> I agree with you that the recv fix is correct.
>>
>>> In my tests I have seen sendmsg return with a partial/incomplete 
>>> status,
>>> consequently recvmsg has nothing to receive, and io_work stops.
>>> And that is what the patch fixes.
>>
>> The question is if the recv fix is sufficient?
>>
>>>
>>> I haven't checked the ->write_space() callback (which should've been 
>>> triggered), but my feeling is that the ->write_space() callback
>>> hit when we were still busy processing, so the queue_work() went
>>> nowhere.
>>
>> That makes sense.
>>
>>>
>>> Maybe we can fix it with setting a flag when ->write_space()
>>> triggers ('hey, more data to process'), to be evaluated during
>>> the io_work() loop. But that would be pretty close to the
>>> check 'nvme_tcp_queue_has_pending', so I'm not sure if we
>>> gain anything.
>>
>> How about checking sk_stream_is_writeable() instead?
>> I think we also need to flag the queue for the write_space during IO 
>> work...
>>
>>>
>>> And I really haven't seen any detrimental performance effects
>>> with this patch; in the contrary, performance was on par,
>>> and if anything standard deviation went down.
>>
>> Still you agree that busy looping a socket that may not have space is 
>> less efficient?
>> When there are other queues waiting to execute io_work?
>>
>> I agree there is a problem here, but I am trying to figure out what 
>> we want to do here.
>>
>> How about these two (untested) patches:
>> [1 based on your recv-side fix]:
>> --diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 72d260201d8c..4eb9a2dec07e 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1348,7 +1348,7 @@ static int nvme_tcp_try_recv(struct 
>> nvme_tcp_queue *queue)
>>          queue->nr_cqe = 0;
>>          consumed = sock->ops->read_sock(sk, &rd_desc, 
>> nvme_tcp_recv_skb);
>>          release_sock(sk);
>> -       return consumed;
>> +       return consumed == -EAGAIN ? 0 : consumed;
>>   }
>>
>>   static void nvme_tcp_io_work(struct work_struct *w)
>> -- 
>>
>> [2 based on your partial write fix]:
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 4eb9a2dec07e..daf59e75cf15 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -129,6 +129,7 @@ enum nvme_tcp_queue_flags {
>>          NVME_TCP_Q_LIVE         = 1,
>>          NVME_TCP_Q_POLLING      = 2,
>>          NVME_TCP_Q_IO_CPU_SET   = 3,
>> +       NVME_TCP_Q_WAKE_SENDER  = 4,
>>   };
>>
>>   enum nvme_tcp_recv_state {
>> @@ -1063,6 +1064,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>>          queue = sk->sk_user_data;
>>          if (likely(queue && sk_stream_is_writeable(sk))) {
>>                  clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>> +               set_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>>                  queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue- 
>>  >io_work);
>>          }
>>          read_unlock_bh(&sk->sk_callback_lock);
>> @@ -1357,6 +1359,7 @@ static void nvme_tcp_io_work(struct work_struct 
>> *w)
>>                  container_of(w, struct nvme_tcp_queue, io_work);
>>          unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>
>> +       clear_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>>          do {
>>                  bool pending = false;
>>                  int result;
>> @@ -1376,7 +1379,15 @@ static void nvme_tcp_io_work(struct 
>> work_struct *w)
>>                  else if (unlikely(result < 0))
>>                          return;
>>
>> -               if (!pending || !queue->rd_enabled)
>> +               /* did we get some space after spending time in recv 
>> ? */
>> +               if (nvme_tcp_queue_has_pending(queue) &&
>> + sk_stream_is_writeable(queue->sock->sk))
>> +                       pending = true;
>> +
>> +               if (!queue->rd_enabled)
>> +                       return;
>> +
>> +               if (!pending && !test_bit(NVME_TCP_Q_WAKE_SENDER, 
>> &queue->flags))
>>                          return;
>>
>>          } while (!time_after(jiffies, deadline)); /* quota is 
>> exhausted */
>> -- 
>>
> While trying to sift through the patches trying to come up with
> a next version:
> Why do you check for Q_WAKE_SENDER after checking for
> sk_stream_is_writable?
> Surely sk_stream_is_writable() is a necessary condition for 
> nvme_tcp_try_send(), no?

Right after the check sk_stream_is_writeable(), there could have new 
room in the socket
send buffer, state_change() fired, and queued io_work, before we get to 
the pending
check. This is the "lost event" problem.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-25 21:55             ` Sagi Grimberg
@ 2025-04-25 22:09               ` Sagi Grimberg
  2025-04-25 23:47                 ` Kamaljit Singh
  0 siblings, 1 reply; 48+ messages in thread
From: Sagi Grimberg @ 2025-04-25 22:09 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Chris Leech



On 26/04/2025 0:55, Sagi Grimberg wrote:
>
>
> On 24/04/2025 14:26, Hannes Reinecke wrote:
>> On 4/17/25 00:09, Sagi Grimberg wrote:
>>>
>>>
>>> On 16/04/2025 10:56, Hannes Reinecke wrote:
>>>> On 4/15/25 23:35, Sagi Grimberg wrote:
>>>>>
>>>>>
>>>>> On 15/04/2025 10:07, Hannes Reinecke wrote:
>>>>>> On 4/3/25 08:55, Hannes Reinecke wrote:
>>>>>>> When the socket is busy processing nvme_tcp_try_recv() might
>>>>>>> return -EAGAIN, but this doesn't automatically imply that
>>>>>>> the sending side is blocked, too.
>>>>>>> So check if there are pending requests once nvme_tcp_try_recv()
>>>>>>> returns -EAGAIN and continue with the sending loop to avoid
>>>>>>> I/O stalls.
>>>>>>>
>>>>>>> Acked-by: Chris Leech <cleech@redhat.com>
>>>>>>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>>>>>> ---
>>>>>>>   drivers/nvme/host/tcp.c | 5 ++++-
>>>>>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>>>>> index 1a319cb86453..87f1d7a4ea06 100644
>>>>>>> --- a/drivers/nvme/host/tcp.c
>>>>>>> +++ b/drivers/nvme/host/tcp.c
>>>>>>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct 
>>>>>>> work_struct *w)
>>>>>>>           result = nvme_tcp_try_recv(queue);
>>>>>>>           if (result > 0)
>>>>>>>               pending = true;
>>>>>>> -        else if (unlikely(result < 0))
>>>>>>> +        else if (unlikely(result < 0) && result != -EAGAIN)
>>>>>>>               return;
>>>>>>>   +        if (nvme_tcp_queue_has_pending(queue))
>>>>>>> +            pending = true;
>>>>>>> +
>>>>>>>           if (!pending || !queue->rd_enabled)
>>>>>>>               return;
>>>>>>
>>>>>> The various 'try_send' function will return -EAGAIN for a partial 
>>>>>> send.
>>>>>> But it doesn't indicate a blocked Tx, rather we should retry 
>>>>>> directly.
>>>>>> Hence this check.
>>>>>>
>>>>>> Unless you tell me differently and even a partial send will cause
>>>>>> ->write_space() to be invoked, then we wouldn't _need_ it.
>>>>>
>>>>> Umm, that is my understanding. If you tried to send X and were 
>>>>> able to
>>>>> send Y where Y < X, you shouldn't have to keep trying in a busy 
>>>>> loop, the stack should tell you when you can send again.
>>>>>
>>>> Okay, I could try that.
>>>>
>>>>>> It would
>>>>>> still be an optimisation as we're saving the round-trip via socket
>>>>>> callbacks.
>>>>>
>>>>> But you are doing a busy loop on a socket that cannot accept new 
>>>>> data, there are other sockets that the kthread can be working on.
>>>>>
>>>> But we might be _sending_ data, right?
>>>
>>> I'd say that odds are you're not in the next attempt...
>>>
>>>>
>>>>>>
>>>>>> We could aim for a different error here, to differentiate between a
>>>>>> 'real' EAGAIN and a partial send.
>>>>>> Whatever you prefer.
>>>>>
>>>>> I still don't understand why a partial send warrants a busy loop 
>>>>> call to sock_sendmsg...
>>>>>
>>>> This is, it's not just sendmsg. It's the combination of send and recv.
>>>
>>> I agree with you that the recv fix is correct.
>>>
>>>> In my tests I have seen sendmsg return with a partial/incomplete 
>>>> status,
>>>> consequently recvmsg has nothing to receive, and io_work stops.
>>>> And that is what the patch fixes.
>>>
>>> The question is if the recv fix is sufficient?
>>>
>>>>
>>>> I haven't checked the ->write_space() callback (which should've 
>>>> been triggered), but my feeling is that the ->write_space() callback
>>>> hit when we were still busy processing, so the queue_work() went
>>>> nowhere.
>>>
>>> That makes sense.
>>>
>>>>
>>>> Maybe we can fix it with setting a flag when ->write_space()
>>>> triggers ('hey, more data to process'), to be evaluated during
>>>> the io_work() loop. But that would be pretty close to the
>>>> check 'nvme_tcp_queue_has_pending', so I'm not sure if we
>>>> gain anything.
>>>
>>> How about checking sk_stream_is_writeable() instead?
>>> I think we also need to flag the queue for the write_space during IO 
>>> work...
>>>
>>>>
>>>> And I really haven't seen any detrimental performance effects
>>>> with this patch; in the contrary, performance was on par,
>>>> and if anything standard deviation went down.
>>>
>>> Still you agree that busy looping a socket that may not have space 
>>> is less efficient?
>>> When there are other queues waiting to execute io_work?
>>>
>>> I agree there is a problem here, but I am trying to figure out what 
>>> we want to do here.
>>>
>>> How about these two (untested) patches:
>>> [1 based on your recv-side fix]:
>>> --diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 72d260201d8c..4eb9a2dec07e 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1348,7 +1348,7 @@ static int nvme_tcp_try_recv(struct 
>>> nvme_tcp_queue *queue)
>>>          queue->nr_cqe = 0;
>>>          consumed = sock->ops->read_sock(sk, &rd_desc, 
>>> nvme_tcp_recv_skb);
>>>          release_sock(sk);
>>> -       return consumed;
>>> +       return consumed == -EAGAIN ? 0 : consumed;
>>>   }
>>>
>>>   static void nvme_tcp_io_work(struct work_struct *w)
>>> -- 
>>>
>>> [2 based on your partial write fix]:
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 4eb9a2dec07e..daf59e75cf15 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -129,6 +129,7 @@ enum nvme_tcp_queue_flags {
>>>          NVME_TCP_Q_LIVE         = 1,
>>>          NVME_TCP_Q_POLLING      = 2,
>>>          NVME_TCP_Q_IO_CPU_SET   = 3,
>>> +       NVME_TCP_Q_WAKE_SENDER  = 4,
>>>   };
>>>
>>>   enum nvme_tcp_recv_state {
>>> @@ -1063,6 +1064,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>>>          queue = sk->sk_user_data;
>>>          if (likely(queue && sk_stream_is_writeable(sk))) {
>>>                  clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>>> +               set_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>>>                  queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue- 
>>>  >io_work);
>>>          }
>>>          read_unlock_bh(&sk->sk_callback_lock);
>>> @@ -1357,6 +1359,7 @@ static void nvme_tcp_io_work(struct 
>>> work_struct *w)
>>>                  container_of(w, struct nvme_tcp_queue, io_work);
>>>          unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>>
>>> +       clear_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>>>          do {
>>>                  bool pending = false;
>>>                  int result;
>>> @@ -1376,7 +1379,15 @@ static void nvme_tcp_io_work(struct 
>>> work_struct *w)
>>>                  else if (unlikely(result < 0))
>>>                          return;
>>>
>>> -               if (!pending || !queue->rd_enabled)
>>> +               /* did we get some space after spending time in recv 
>>> ? */
>>> +               if (nvme_tcp_queue_has_pending(queue) &&
>>> + sk_stream_is_writeable(queue->sock->sk))
>>> +                       pending = true;
>>> +
>>> +               if (!queue->rd_enabled)
>>> +                       return;
>>> +
>>> +               if (!pending && !test_bit(NVME_TCP_Q_WAKE_SENDER, 
>>> &queue->flags))
>>>                          return;
>>>
>>>          } while (!time_after(jiffies, deadline)); /* quota is 
>>> exhausted */
>>> -- 
>>>
>> While trying to sift through the patches trying to come up with
>> a next version:
>> Why do you check for Q_WAKE_SENDER after checking for
>> sk_stream_is_writable?
>> Surely sk_stream_is_writable() is a necessary condition for 
>> nvme_tcp_try_send(), no?
>
> Right after the check sk_stream_is_writeable(), there could have new 
> room in the socket
> send buffer, state_change() fired, and queued io_work, before we get 
> to the pending
> check. This is the "lost event" problem.

Actually, this is incorrect. write_space() -> queue_work(io_work) will 
be re-queued if io_work
is currently running. Hence I'm not sure that this flag is needed.

Kamaljit, can you run with the below and report if it fixes your problem?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 4d20fcc0a230..835e29014841 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1390,6 +1390,11 @@ static void nvme_tcp_io_work(struct work_struct *w)
                 else if (unlikely(result < 0))
                         return;

+               /* did we get some space after spending time in recv? */
+               if (nvme_tcp_queue_has_pending(queue) &&
+                   sk_stream_is_writeable(queue->sock->sk))
+                       pending = true;
+
                 if (!pending || !queue->rd_enabled)
                         return;
--




^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-25 22:09               ` Sagi Grimberg
@ 2025-04-25 23:47                 ` Kamaljit Singh
  2025-04-27 10:35                   ` Sagi Grimberg
  0 siblings, 1 reply; 48+ messages in thread
From: Kamaljit Singh @ 2025-04-25 23:47 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

Sagi,

>>
>> Right after the check sk_stream_is_writeable(), there could have new
>> room in the socket
>> send buffer, state_change() fired, and queued io_work, before we get
>> to the pending
>> check. This is the "lost event" problem.
>
>Actually, this is incorrect. write_space() -> queue_work(io_work) will
>be re-queued if io_work
>is currently running. Hence I'm not sure that this flag is needed.
>
>Kamaljit, can you run with the below and report if it fixes your problem?
With the last set of fixes that included list_del_init() you suggested, all
tests have been stable. I'm not seeing any more issues. Are you asking to
test this patch regardless? If so, does this need to be in addition to
previous patches?


>--
>diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>index 4d20fcc0a230..835e29014841 100644
>--- a/drivers/nvme/host/tcp.c
>+++ b/drivers/nvme/host/tcp.c
>@@ -1390,6 +1390,11 @@ static void nvme_tcp_io_work(struct work_struct *w)
>                 else if (unlikely(result < 0))
>                         return;
>
>+               /* did we get some space after spending time in recv? */
>+               if (nvme_tcp_queue_has_pending(queue) &&
>+                   sk_stream_is_writeable(queue->sock->sk))
>+                       pending = true;
>+
>                 if (!pending || !queue->rd_enabled)
>                         return;
>--

Thanks,
Kamaljit

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-04-25 23:47                 ` Kamaljit Singh
@ 2025-04-27 10:35                   ` Sagi Grimberg
       [not found]                     ` <BY5PR04MB6849D4BF87755B82A073BBDABC89A@BY5PR04MB6849.namprd04.prod.outlook.com>
  0 siblings, 1 reply; 48+ messages in thread
From: Sagi Grimberg @ 2025-04-27 10:35 UTC (permalink / raw)
  To: Kamaljit Singh, Hannes Reinecke, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech



On 26/04/2025 2:47, Kamaljit Singh wrote:
> Sagi,
>
>>> Right after the check sk_stream_is_writeable(), there could have new
>>> room in the socket
>>> send buffer, state_change() fired, and queued io_work, before we get
>>> to the pending
>>> check. This is the "lost event" problem.
>> Actually, this is incorrect. write_space() -> queue_work(io_work) will
>> be re-queued if io_work
>> is currently running. Hence I'm not sure that this flag is needed.
>>
>> Kamaljit, can you run with the below and report if it fixes your problem?
> With the last set of fixes that included list_del_init() you suggested, all
> tests have been stable. I'm not seeing any more issues. Are you asking to
> test this patch regardless? If so, does this need to be in addition to
> previous patches?

Instead of my patch that introduces WAKE_SENDER - apply just the change 
below
that looks at sk_stream_is_writeable() only. I no longer think that the 
other changes in that patch solve anything.
>
>
>> --
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 4d20fcc0a230..835e29014841 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1390,6 +1390,11 @@ static void nvme_tcp_io_work(struct work_struct *w)
>>                   else if (unlikely(result < 0))
>>                           return;
>>
>> +               /* did we get some space after spending time in recv? */
>> +               if (nvme_tcp_queue_has_pending(queue) &&
>> +                   sk_stream_is_writeable(queue->sock->sk))
>> +                       pending = true;
>> +
>>                   if (!pending || !queue->rd_enabled)
>>                           return;
>> --
> Thanks,
> Kamaljit



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
       [not found]                     ` <BY5PR04MB6849D4BF87755B82A073BBDABC89A@BY5PR04MB6849.namprd04.prod.outlook.com>
@ 2025-05-06 18:05                       ` Hannes Reinecke
  2025-05-07 22:26                         ` Kamaljit Singh
  0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2025-05-06 18:05 UTC (permalink / raw)
  To: Kamaljit Singh, Sagi Grimberg, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

Hi Kamaljit,

On 5/6/25 19:40, Kamaljit Singh wrote:
> Hi Sagi,
> 
>  >On 26/04/2025 2:47, Kamaljit Singh wrote:
>  >> Sagi,
>  >>
>  >>>> Right after the check sk_stream_is_writeable(), there could have new
>  >>>> room in the socket
>  >>>> send buffer, state_change() fired, and queued io_work, before we get
>  >>>> to the pending
>  >>>> check. This is the "lost event" problem.
>  >>> Actually, this is incorrect. write_space() -> queue_work(io_work) will
>  >>> be re-queued if io_work
>  >>> is currently running. Hence I'm not sure that this flag is needed.
>  >>>
>  >>> Kamaljit, can you run with the below and report if it fixes your 
> problem?
>  >> With the last set of fixes that included list_del_init() you 
> suggested, all
>  >> tests have been stable. I'm not seeing any more issues. Are you 
> asking to
>  >> test this patch regardless? If so, does this need to be in addition to
>  >> previous patches?
>  >
>  >Instead of my patch that introduces WAKE_SENDER - apply just the change
>  >below
>  >that looks at sk_stream_is_writeable() only. I no longer think that the
>  >other changes in that patch solve anything.
> Sorry, I was away on PTO. Working now to get the below patch tested.
> 
Can you please retest with the patchset '[PATCHv5 0/2] nvme-tcp: fixup 
I/O stall on congested sockets' _only_ ?
(on top of nvme-6.16 latest, of course).
I think I _should_ have included all the suggestions floating here,
but we need to have confirmation.

Thanks a bunch.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-06 18:05                       ` Hannes Reinecke
@ 2025-05-07 22:26                         ` Kamaljit Singh
  2025-05-08 21:23                           ` Kamaljit Singh
  0 siblings, 1 reply; 48+ messages in thread
From: Kamaljit Singh @ 2025-05-07 22:26 UTC (permalink / raw)
  To: Hannes Reinecke, Sagi Grimberg, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

Hi Hannes,

On 5/6/25 11:05, Hannes Reinecke wrote:

>Can you please retest with the patchset '[PATCHv5 0/2] nvme-tcp: fixup
>I/O stall on congested sockets' _only_ ?
>(on top of nvme-6.16 latest, of course).
>I think I _should_ have included all the suggestions floating here,
>but we need to have confirmation.

I've built the kernel against the latest of nvme-6.16 branch along with
these patches. Its with test now. Will let you know when we get some
results back.

0. [PATCHv5 0/2] nvme-tcp: fixup I/O stall on congested sockets - Apr 29, 01:18
    Hannes Reinecke (2):
      nvme-tcp: sanitize request list handling
      nvme-tcp: fix I/O stalls on congested sockets

1. [PATCH 1/2] nvme-tcp: sanitize request list handling - Apr 29, 02:31
2. [PATCH 2/2] nvme-tcp: fix I/O stalls on congested sockets - Apr 29, 01:18

Meanwhile, quick question, is there a known issue in cloning from this
repo? I'm seeing super slow response, which eventually timesout. Had to
try multiple times over several hours.

$ git clone git://git.infradead.org/nvme.git  linux-nvme-6.16
    Cloning into 'linux-nvme-6.16'...
    remote: Enumerating objects: 10795230, done.
    remote: Counting objects: 100% (333577/333577), done.
    remote: Compressing objects: 100% (99866/99866), done.
    fatal: read error: Operation timed out 76.01 KiB | 1024 bytes/s
    fatal: early EOF
    fatal: fetch-pack: invalid index-pack output

Even a git pull --rebase into an older clone was failing to refresh
remotes & show the nvme-6.16 branch.

Thanks,
Kamaljit Singh

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-07 22:26                         ` Kamaljit Singh
@ 2025-05-08 21:23                           ` Kamaljit Singh
  2025-05-09  6:52                             ` Hannes Reinecke
  2025-05-11  9:11                             ` Sagi Grimberg
  0 siblings, 2 replies; 48+ messages in thread
From: Kamaljit Singh @ 2025-05-08 21:23 UTC (permalink / raw)
  To: Hannes Reinecke, Sagi Grimberg, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

Hi Hannes, Sagi,

On 5/7/25 15:26, Kamaljit Singh wrote:
>>Can you please retest with the patchset '[PATCHv5 0/2] nvme-tcp: fixup
>>I/O stall on congested sockets' _only_ ?
>>(on top of nvme-6.16 latest, of course).
>>I think I _should_ have included all the suggestions floating here,
>>but we need to have confirmation.
>
>I've built the kernel against the latest of nvme-6.16 branch along with
>these patches. Its with test now. Will let you know when we get some
>results back.
>
>0. [PATCHv5 0/2] nvme-tcp: fixup I/O stall on congested sockets - Apr 29, 01:18
>    Hannes Reinecke (2):
>      nvme-tcp: sanitize request list handling
>      nvme-tcp: fix I/O stalls on congested sockets
>
>1. [PATCH 1/2] nvme-tcp: sanitize request list handling - Apr 29, 02:31
>2. [PATCH 2/2] nvme-tcp: fix I/O stalls on congested sockets - Apr 29, 01:18
IO timeouts are still occurring with Writes. The only Read that timed
out was most likely due to the path error. It takes ~4.5 hours to fail.

However, this test does not fail if either ECN is off or if digests
are not enabled. These passing combinations were run for 16+ hours
without any issues. Both ECN and Header+Data Digests need to be turned
on for it to fail.

Do you have a failing test as well? If so, is it quicker to cause the
failure? Would you mind sharing any details?

  [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 2 (f002) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
  [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 1 (2001) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
  [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 4 (c004) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
  [2025-05-07 19:57:13.295] nvme nvme1: starting error recovery
  [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 15 (000f) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
  [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 6 (5006) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
  [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 3 (2003) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
  [2025-05-07 19:57:13.295] block nvme1n3: no usable path - requeuing I/O
  [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 8 (0008) type 4 opcode 0x2 (I/O Cmd) QID 4 timeout
  [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 14 (400e) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
  [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 13 (100d) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
  [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
  [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
  [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
  [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing I/O
  [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
  [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing I/O
  [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing I/O
  [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 5 (5005) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
  [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 7 (0007) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
  [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 11 (a00b) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
  [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 12 (f00c) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
  [2025-05-07 19:57:13.295] block nvme1n1: no usable path - requeuing I/O
  [2025-05-07 19:57:13.295] block nvme1n1: no usable path - requeuing I/O
  [2025-05-07 19:57:13.295] nvme nvme1: Reconnecting in 10 seconds...

In the current build I had these patches on top of the "nvme-6.16" branch:
  41b2c90a51bd nvme-tcp: sanitize request list handling
  9260acd6c230 nvme-tcp: fix I/O stalls on congested sockets

Thanks,
Kamaljit Singh

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-08 21:23                           ` Kamaljit Singh
@ 2025-05-09  6:52                             ` Hannes Reinecke
  2025-05-09  8:58                               ` Hannes Reinecke
  2025-05-11  9:12                               ` Sagi Grimberg
  2025-05-11  9:11                             ` Sagi Grimberg
  1 sibling, 2 replies; 48+ messages in thread
From: Hannes Reinecke @ 2025-05-09  6:52 UTC (permalink / raw)
  To: Kamaljit Singh, Sagi Grimberg, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

On 5/8/25 23:23, Kamaljit Singh wrote:
> Hi Hannes, Sagi,
> 
> On 5/7/25 15:26, Kamaljit Singh wrote:
>>> Can you please retest with the patchset '[PATCHv5 0/2] nvme-tcp: fixup
>>> I/O stall on congested sockets' _only_ ?
>>> (on top of nvme-6.16 latest, of course).
>>> I think I _should_ have included all the suggestions floating here,
>>> but we need to have confirmation.
>>
>> I've built the kernel against the latest of nvme-6.16 branch along with
>> these patches. Its with test now. Will let you know when we get some
>> results back.
>>
>> 0. [PATCHv5 0/2] nvme-tcp: fixup I/O stall on congested sockets - Apr 29, 01:18
>>      Hannes Reinecke (2):
>>        nvme-tcp: sanitize request list handling
>>        nvme-tcp: fix I/O stalls on congested sockets
>>
>> 1. [PATCH 1/2] nvme-tcp: sanitize request list handling - Apr 29, 02:31
>> 2. [PATCH 2/2] nvme-tcp: fix I/O stalls on congested sockets - Apr 29, 01:18
> IO timeouts are still occurring with Writes. The only Read that timed
> out was most likely due to the path error. It takes ~4.5 hours to fail.
> 
> However, this test does not fail if either ECN is off or if digests
> are not enabled. These passing combinations were run for 16+ hours
> without any issues. Both ECN and Header+Data Digests need to be turned
> on for it to fail.
> 
> Do you have a failing test as well? If so, is it quicker to cause the
> failure? Would you mind sharing any details?
> 
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 2 (f002) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 1 (2001) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 4 (c004) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: starting error recovery
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 15 (000f) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 6 (5006) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 3 (2003) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] block nvme1n3: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 8 (0008) type 4 opcode 0x2 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 14 (400e) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 13 (100d) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 5 (5005) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 7 (0007) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 11 (a00b) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 12 (f00c) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] block nvme1n1: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] block nvme1n1: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] nvme nvme1: Reconnecting in 10 seconds...
> 
> In the current build I had these patches on top of the "nvme-6.16" branch:
>    41b2c90a51bd nvme-tcp: sanitize request list handling
>    9260acd6c230 nvme-tcp: fix I/O stalls on congested sockets
> 
Extremely wild guess: Can you try with this patch on top?

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 0e178115dc04..cdb8ea4eb467 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1277,10 +1277,7 @@ static int nvme_tcp_try_send_ddgst(struct 
nvme_tcp_request *req)
                 .iov_len = NVME_TCP_DIGEST_LENGTH - req->offset
         };

-       if (nvme_tcp_queue_more(queue))
-               msg.msg_flags |= MSG_MORE;
-       else
-               msg.msg_flags |= MSG_EOR;
+       msg.msg_flags |= MSG_EOR;

         ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
         if (unlikely(ret <= 0))

It _could_ be that we're waiting in sendmsg() due to MSG_MORE, causing
these I/O timeouts as processing doesn't continue.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-09  6:52                             ` Hannes Reinecke
@ 2025-05-09  8:58                               ` Hannes Reinecke
  2025-05-11  9:12                               ` Sagi Grimberg
  1 sibling, 0 replies; 48+ messages in thread
From: Hannes Reinecke @ 2025-05-09  8:58 UTC (permalink / raw)
  To: Kamaljit Singh, Sagi Grimberg, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

On 5/9/25 08:52, Hannes Reinecke wrote:
> On 5/8/25 23:23, Kamaljit Singh wrote:
>> Hi Hannes, Sagi,
>>
>> On 5/7/25 15:26, Kamaljit Singh wrote:
>>>> Can you please retest with the patchset '[PATCHv5 0/2] nvme-tcp: fixup
>>>> I/O stall on congested sockets' _only_ ?
>>>> (on top of nvme-6.16 latest, of course).
>>>> I think I _should_ have included all the suggestions floating here,
>>>> but we need to have confirmation.
>>>
>>> I've built the kernel against the latest of nvme-6.16 branch along with
>>> these patches. Its with test now. Will let you know when we get some
>>> results back.
>>>
>>> 0. [PATCHv5 0/2] nvme-tcp: fixup I/O stall on congested sockets - Apr 
>>> 29, 01:18
>>>      Hannes Reinecke (2):
>>>        nvme-tcp: sanitize request list handling
>>>        nvme-tcp: fix I/O stalls on congested sockets
>>>
>>> 1. [PATCH 1/2] nvme-tcp: sanitize request list handling - Apr 29, 02:31
>>> 2. [PATCH 2/2] nvme-tcp: fix I/O stalls on congested sockets - Apr 
>>> 29, 01:18
>> IO timeouts are still occurring with Writes. The only Read that timed
>> out was most likely due to the path error. It takes ~4.5 hours to fail.
>>
>> However, this test does not fail if either ECN is off or if digests
>> are not enabled. These passing combinations were run for 16+ hours
>> without any issues. Both ECN and Header+Data Digests need to be turned
>> on for it to fail.
>>
>> Do you have a failing test as well? If so, is it quicker to cause the
>> failure? Would you mind sharing any details?
>>
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 2 (f002) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 1 (2001) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 4 (c004) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: starting error recovery
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 15 (000f) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 6 (5006) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 3 (2003) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] block nvme1n3: no usable path - requeuing 
>> I/O
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 8 (0008) type 4 
>> opcode 0x2 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 14 (400e) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 13 (100d) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing 
>> I/O
>>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing 
>> I/O
>>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing 
>> I/O
>>    [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing 
>> I/O
>>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing 
>> I/O
>>    [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing 
>> I/O
>>    [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing 
>> I/O
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 5 (5005) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 7 (0007) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 11 (a00b) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 12 (f00c) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] block nvme1n1: no usable path - requeuing 
>> I/O
>>    [2025-05-07 19:57:13.295] block nvme1n1: no usable path - requeuing 
>> I/O
>>    [2025-05-07 19:57:13.295] nvme nvme1: Reconnecting in 10 seconds...
>>
>> In the current build I had these patches on top of the "nvme-6.16" 
>> branch:
>>    41b2c90a51bd nvme-tcp: sanitize request list handling
>>    9260acd6c230 nvme-tcp: fix I/O stalls on congested sockets
>>
> Extremely wild guess: Can you try with this patch on top?
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 0e178115dc04..cdb8ea4eb467 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1277,10 +1277,7 @@ static int nvme_tcp_try_send_ddgst(struct 
> nvme_tcp_request *req)
>                  .iov_len = NVME_TCP_DIGEST_LENGTH - req->offset
>          };
> 
> -       if (nvme_tcp_queue_more(queue))
> -               msg.msg_flags |= MSG_MORE;
> -       else
> -               msg.msg_flags |= MSG_EOR;
> +       msg.msg_flags |= MSG_EOR;
> 
>          ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
>          if (unlikely(ret <= 0))
> 
> It _could_ be that we're waiting in sendmsg() due to MSG_MORE, causing
> these I/O timeouts as processing doesn't continue.
> 
After a bit more thought, I really do think this is a different issue.
The original issue I've encountered was an I/O stall where the io_work
thread would never be restarted even though I/O was pending.

But we still have the problem that we will encounter I/O timeouts on
oversubscribed links, where we shove too much work onto the same link.
In that case we also will be seeing I/O timeouts (as the PDUs take too
long to be sent), but this is not something my patchset does address.

And seeing that your error depends on ECN (which _will_ cause a delay
on packet transmission) I tend to think it's actually a problem with
oversubscription, not with io_work not being restarted.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-08 21:23                           ` Kamaljit Singh
  2025-05-09  6:52                             ` Hannes Reinecke
@ 2025-05-11  9:11                             ` Sagi Grimberg
  2025-05-13 19:24                               ` Kamaljit Singh
  1 sibling, 1 reply; 48+ messages in thread
From: Sagi Grimberg @ 2025-05-11  9:11 UTC (permalink / raw)
  To: Kamaljit Singh, Hannes Reinecke, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech



On 09/05/2025 0:23, Kamaljit Singh wrote:
> Hi Hannes, Sagi,
>
> On 5/7/25 15:26, Kamaljit Singh wrote:
>>> Can you please retest with the patchset '[PATCHv5 0/2] nvme-tcp: fixup
>>> I/O stall on congested sockets' _only_ ?
>>> (on top of nvme-6.16 latest, of course).
>>> I think I _should_ have included all the suggestions floating here,
>>> but we need to have confirmation.
>> I've built the kernel against the latest of nvme-6.16 branch along with
>> these patches. Its with test now. Will let you know when we get some
>> results back.
>>
>> 0. [PATCHv5 0/2] nvme-tcp: fixup I/O stall on congested sockets - Apr 29, 01:18
>>      Hannes Reinecke (2):
>>        nvme-tcp: sanitize request list handling
>>        nvme-tcp: fix I/O stalls on congested sockets
>>
>> 1. [PATCH 1/2] nvme-tcp: sanitize request list handling - Apr 29, 02:31
>> 2. [PATCH 2/2] nvme-tcp: fix I/O stalls on congested sockets - Apr 29, 01:18
> IO timeouts are still occurring with Writes. The only Read that timed
> out was most likely due to the path error. It takes ~4.5 hours to fail.
>
> However, this test does not fail if either ECN is off or if digests
> are not enabled. These passing combinations were run for 16+ hours
> without any issues. Both ECN and Header+Data Digests need to be turned
> on for it to fail.
>
> Do you have a failing test as well? If so, is it quicker to cause the
> failure? Would you mind sharing any details?
>
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 2 (f002) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 1 (2001) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 4 (c004) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: starting error recovery
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 15 (000f) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 6 (5006) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 3 (2003) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] block nvme1n3: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 8 (0008) type 4 opcode 0x2 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 14 (400e) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 13 (100d) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 5 (5005) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 7 (0007) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 11 (a00b) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 12 (f00c) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>    [2025-05-07 19:57:13.295] block nvme1n1: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] block nvme1n1: no usable path - requeuing I/O
>    [2025-05-07 19:57:13.295] nvme nvme1: Reconnecting in 10 seconds...
>
> In the current build I had these patches on top of the "nvme-6.16" branch:
>    41b2c90a51bd nvme-tcp: sanitize request list handling
>    9260acd6c230 nvme-tcp: fix I/O stalls on congested sockets

Kamaljit, with the prior version of the patchset (the proposal with the 
wake_sender flag)
did this not reproduce regardless of ECN?


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-09  6:52                             ` Hannes Reinecke
  2025-05-09  8:58                               ` Hannes Reinecke
@ 2025-05-11  9:12                               ` Sagi Grimberg
  1 sibling, 0 replies; 48+ messages in thread
From: Sagi Grimberg @ 2025-05-11  9:12 UTC (permalink / raw)
  To: Hannes Reinecke, Kamaljit Singh, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech



On 09/05/2025 9:52, Hannes Reinecke wrote:
> On 5/8/25 23:23, Kamaljit Singh wrote:
>> Hi Hannes, Sagi,
>>
>> On 5/7/25 15:26, Kamaljit Singh wrote:
>>>> Can you please retest with the patchset '[PATCHv5 0/2] nvme-tcp: fixup
>>>> I/O stall on congested sockets' _only_ ?
>>>> (on top of nvme-6.16 latest, of course).
>>>> I think I _should_ have included all the suggestions floating here,
>>>> but we need to have confirmation.
>>>
>>> I've built the kernel against the latest of nvme-6.16 branch along with
>>> these patches. Its with test now. Will let you know when we get some
>>> results back.
>>>
>>> 0. [PATCHv5 0/2] nvme-tcp: fixup I/O stall on congested sockets - 
>>> Apr 29, 01:18
>>>      Hannes Reinecke (2):
>>>        nvme-tcp: sanitize request list handling
>>>        nvme-tcp: fix I/O stalls on congested sockets
>>>
>>> 1. [PATCH 1/2] nvme-tcp: sanitize request list handling - Apr 29, 02:31
>>> 2. [PATCH 2/2] nvme-tcp: fix I/O stalls on congested sockets - Apr 
>>> 29, 01:18
>> IO timeouts are still occurring with Writes. The only Read that timed
>> out was most likely due to the path error. It takes ~4.5 hours to fail.
>>
>> However, this test does not fail if either ECN is off or if digests
>> are not enabled. These passing combinations were run for 16+ hours
>> without any issues. Both ECN and Header+Data Digests need to be turned
>> on for it to fail.
>>
>> Do you have a failing test as well? If so, is it quicker to cause the
>> failure? Would you mind sharing any details?
>>
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 2 (f002) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 1 (2001) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 4 (c004) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: starting error recovery
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 15 (000f) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 6 (5006) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 3 (2003) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] block nvme1n3: no usable path - 
>> requeuing I/O
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 8 (0008) type 4 
>> opcode 0x2 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 14 (400e) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 13 (100d) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>> requeuing I/O
>>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>> requeuing I/O
>>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>> requeuing I/O
>>    [2025-05-07 19:57:13.295] block nvme1n2: no usable path - 
>> requeuing I/O
>>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>> requeuing I/O
>>    [2025-05-07 19:57:13.295] block nvme1n2: no usable path - 
>> requeuing I/O
>>    [2025-05-07 19:57:13.295] block nvme1n2: no usable path - 
>> requeuing I/O
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 5 (5005) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 7 (0007) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 11 (a00b) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 12 (f00c) type 4 
>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] block nvme1n1: no usable path - 
>> requeuing I/O
>>    [2025-05-07 19:57:13.295] block nvme1n1: no usable path - 
>> requeuing I/O
>>    [2025-05-07 19:57:13.295] nvme nvme1: Reconnecting in 10 seconds...
>>
>> In the current build I had these patches on top of the "nvme-6.16" 
>> branch:
>>    41b2c90a51bd nvme-tcp: sanitize request list handling
>>    9260acd6c230 nvme-tcp: fix I/O stalls on congested sockets
>>
> Extremely wild guess: Can you try with this patch on top?
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 0e178115dc04..cdb8ea4eb467 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1277,10 +1277,7 @@ static int nvme_tcp_try_send_ddgst(struct 
> nvme_tcp_request *req)
>                 .iov_len = NVME_TCP_DIGEST_LENGTH - req->offset
>         };
>
> -       if (nvme_tcp_queue_more(queue))
> -               msg.msg_flags |= MSG_MORE;
> -       else
> -               msg.msg_flags |= MSG_EOR;
> +       msg.msg_flags |= MSG_EOR;
>
>         ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
>         if (unlikely(ret <= 0))
>
> It _could_ be that we're waiting in sendmsg() due to MSG_MORE, causing
> these I/O timeouts as processing doesn't continue.

This change doesn't make sense to me...


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-11  9:11                             ` Sagi Grimberg
@ 2025-05-13 19:24                               ` Kamaljit Singh
  2025-05-14  6:35                                 ` Hannes Reinecke
  0 siblings, 1 reply; 48+ messages in thread
From: Kamaljit Singh @ 2025-05-13 19:24 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

Hi Sagi, Hannes,

On 09/11/2025 02:11, Sagi Grimberg wrote:
>> IO timeouts are still occurring with Writes. The only Read that timed
>> out was most likely due to the path error. It takes ~4.5 hours to fail.
>>
>> However, this test does not fail if either ECN is off or if digests
>> are not enabled. These passing combinations were run for 16+ hours
>> without any issues. Both ECN and Header+Data Digests need to be turned
>> on for it to fail.
>>
>> Do you have a failing test as well? If so, is it quicker to cause the
>> failure? Would you mind sharing any details?
>>
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 2 (f002) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 1 (2001) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 4 (c004) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: starting error recovery
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 15 (000f) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 6 (5006) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 3 (2003) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] block nvme1n3: no usable path - requeuing I/O
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 8 (0008) type 4 opcode 0x2 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 14 (400e) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 13 (100d) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>>    [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing I/O
>>    [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>>    [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing I/O
>>    [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing I/O
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 5 (5005) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 7 (0007) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 11 (a00b) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 12 (f00c) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>    [2025-05-07 19:57:13.295] block nvme1n1: no usable path - requeuing I/O
>>    [2025-05-07 19:57:13.295] block nvme1n1: no usable path - requeuing I/O
>>    [2025-05-07 19:57:13.295] nvme nvme1: Reconnecting in 10 seconds...
>>
>> In the current build I had these patches on top of the "nvme-6.16" branch:
>>    41b2c90a51bd nvme-tcp: sanitize request list handling
>>    9260acd6c230 nvme-tcp: fix I/O stalls on congested sockets
>
>Kamaljit, with the prior version of the patchset (the proposal with the
>wake_sender flag) did this not reproduce regardless of ECN?
 
With the last patchset, when ECN=off, we did not see any IO timeouts
even with a weekend long test. This was true for both cases, i.e. with
Inband Auth and with SecureConcat.

With ECN=on & HD+DD=on IO timeout still persists for both Inband Auth &
SC. I’m currently debugging a possible target side issue with ECN. I’ll
let you know once I have some resolution.

I don't have any clear indications of the original kernel issue to be able to 
differentiate against the current target-side issue. So, if you want to go 
ahead and merge those two patchsets that may be fine for now.

Thanks,
Kamaljit

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-13 19:24                               ` Kamaljit Singh
@ 2025-05-14  6:35                                 ` Hannes Reinecke
  2025-05-17 10:01                                   ` Sagi Grimberg
  0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2025-05-14  6:35 UTC (permalink / raw)
  To: Kamaljit Singh, Sagi Grimberg, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

On 5/13/25 21:24, Kamaljit Singh wrote:
> Hi Sagi, Hannes,
> 
> On 09/11/2025 02:11, Sagi Grimberg wrote:
>>> IO timeouts are still occurring with Writes. The only Read that timed
>>> out was most likely due to the path error. It takes ~4.5 hours to fail.
>>>
>>> However, this test does not fail if either ECN is off or if digests
>>> are not enabled. These passing combinations were run for 16+ hours
>>> without any issues. Both ECN and Header+Data Digests need to be turned
>>> on for it to fail.
>>>
>>> Do you have a failing test as well? If so, is it quicker to cause the
>>> failure? Would you mind sharing any details?
>>>
>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 2 (f002) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 1 (2001) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 4 (c004) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>      [2025-05-07 19:57:13.295] nvme nvme1: starting error recovery
>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 15 (000f) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 6 (5006) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 3 (2003) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>      [2025-05-07 19:57:13.295] block nvme1n3: no usable path - requeuing I/O
>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 8 (0008) type 4 opcode 0x2 (I/O Cmd) QID 4 timeout
>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 14 (400e) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 13 (100d) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>>>      [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing I/O
>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - requeuing I/O
>>>      [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing I/O
>>>      [2025-05-07 19:57:13.295] block nvme1n2: no usable path - requeuing I/O
>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 5 (5005) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 7 (0007) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 11 (a00b) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 12 (f00c) type 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>      [2025-05-07 19:57:13.295] block nvme1n1: no usable path - requeuing I/O
>>>      [2025-05-07 19:57:13.295] block nvme1n1: no usable path - requeuing I/O
>>>      [2025-05-07 19:57:13.295] nvme nvme1: Reconnecting in 10 seconds...
>>>
>>> In the current build I had these patches on top of the "nvme-6.16" branch:
>>>      41b2c90a51bd nvme-tcp: sanitize request list handling
>>>      9260acd6c230 nvme-tcp: fix I/O stalls on congested sockets
>>
>> Kamaljit, with the prior version of the patchset (the proposal with the
>> wake_sender flag) did this not reproduce regardless of ECN?
>   
> With the last patchset, when ECN=off, we did not see any IO timeouts
> even with a weekend long test. This was true for both cases, i.e. with
> Inband Auth and with SecureConcat.
> 
> With ECN=on & HD+DD=on IO timeout still persists for both Inband Auth &
> SC. I’m currently debugging a possible target side issue with ECN. I’ll
> let you know once I have some resolution.
> 
> I don't have any clear indications of the original kernel issue to be able to
> differentiate against the current target-side issue. So, if you want to go
> ahead and merge those two patchsets that may be fine for now.
> 
Thanks a lot for your confirmation.
We continue to have issues with high load or oversubscribed fabrics.
But this patchset is addressing the problem of I/O timeouts during
_connect_, which I would argue is a different story.


Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-14  6:35                                 ` Hannes Reinecke
@ 2025-05-17 10:01                                   ` Sagi Grimberg
  2025-05-17 10:12                                     ` Sagi Grimberg
  0 siblings, 1 reply; 48+ messages in thread
From: Sagi Grimberg @ 2025-05-17 10:01 UTC (permalink / raw)
  To: Hannes Reinecke, Kamaljit Singh, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech



On 14/05/2025 9:35, Hannes Reinecke wrote:
> On 5/13/25 21:24, Kamaljit Singh wrote:
>> Hi Sagi, Hannes,
>>
>> On 09/11/2025 02:11, Sagi Grimberg wrote:
>>>> IO timeouts are still occurring with Writes. The only Read that timed
>>>> out was most likely due to the path error. It takes ~4.5 hours to 
>>>> fail.
>>>>
>>>> However, this test does not fail if either ECN is off or if digests
>>>> are not enabled. These passing combinations were run for 16+ hours
>>>> without any issues. Both ECN and Header+Data Digests need to be turned
>>>> on for it to fail.
>>>>
>>>> Do you have a failing test as well? If so, is it quicker to cause the
>>>> failure? Would you mind sharing any details?
>>>>
>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 2 (f002) type 4 
>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 1 (2001) type 4 
>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 4 (c004) type 4 
>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>      [2025-05-07 19:57:13.295] nvme nvme1: starting error recovery
>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 15 (000f) type 4 
>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 6 (5006) type 4 
>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 3 (2003) type 4 
>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>      [2025-05-07 19:57:13.295] block nvme1n3: no usable path - 
>>>> requeuing I/O
>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 8 (0008) type 4 
>>>> opcode 0x2 (I/O Cmd) QID 4 timeout
>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 14 (400e) type 4 
>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 13 (100d) type 4 
>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>>>> requeuing I/O
>>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>>>> requeuing I/O
>>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>>>> requeuing I/O
>>>>      [2025-05-07 19:57:13.295] block nvme1n2: no usable path - 
>>>> requeuing I/O
>>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>>>> requeuing I/O
>>>>      [2025-05-07 19:57:13.295] block nvme1n2: no usable path - 
>>>> requeuing I/O
>>>>      [2025-05-07 19:57:13.295] block nvme1n2: no usable path - 
>>>> requeuing I/O
>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 5 (5005) type 4 
>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 7 (0007) type 4 
>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 11 (a00b) type 4 
>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 12 (f00c) type 4 
>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>      [2025-05-07 19:57:13.295] block nvme1n1: no usable path - 
>>>> requeuing I/O
>>>>      [2025-05-07 19:57:13.295] block nvme1n1: no usable path - 
>>>> requeuing I/O
>>>>      [2025-05-07 19:57:13.295] nvme nvme1: Reconnecting in 10 
>>>> seconds...
>>>>
>>>> In the current build I had these patches on top of the "nvme-6.16" 
>>>> branch:
>>>>      41b2c90a51bd nvme-tcp: sanitize request list handling
>>>>      9260acd6c230 nvme-tcp: fix I/O stalls on congested sockets
>>>
>>> Kamaljit, with the prior version of the patchset (the proposal with the
>>> wake_sender flag) did this not reproduce regardless of ECN?
>>   With the last patchset, when ECN=off, we did not see any IO timeouts
>> even with a weekend long test. This was true for both cases, i.e. with
>> Inband Auth and with SecureConcat.
>>
>> With ECN=on & HD+DD=on IO timeout still persists for both Inband Auth &
>> SC. I’m currently debugging a possible target side issue with ECN. I’ll
>> let you know once I have some resolution.
>>
>> I don't have any clear indications of the original kernel issue to be 
>> able to
>> differentiate against the current target-side issue. So, if you want 
>> to go
>> ahead and merge those two patchsets that may be fine for now.
>>
> Thanks a lot for your confirmation.
> We continue to have issues with high load or oversubscribed fabrics.
> But this patchset is addressing the problem of I/O timeouts during
> _connect_, which I would argue is a different story.

We still need to hunt these down. I'm still puzzled why adding the 
WAKE_SENDER
flag was able to make this issue disappear? I'll have another look at 
this patch.

For now, I think we can go with this patchset, and then incrementally 
fix the remains.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-17 10:01                                   ` Sagi Grimberg
@ 2025-05-17 10:12                                     ` Sagi Grimberg
  2025-05-27 21:49                                       ` Sagi Grimberg
  0 siblings, 1 reply; 48+ messages in thread
From: Sagi Grimberg @ 2025-05-17 10:12 UTC (permalink / raw)
  To: Hannes Reinecke, Kamaljit Singh, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech



On 17/05/2025 13:01, Sagi Grimberg wrote:
>
>
> On 14/05/2025 9:35, Hannes Reinecke wrote:
>> On 5/13/25 21:24, Kamaljit Singh wrote:
>>> Hi Sagi, Hannes,
>>>
>>> On 09/11/2025 02:11, Sagi Grimberg wrote:
>>>>> IO timeouts are still occurring with Writes. The only Read that timed
>>>>> out was most likely due to the path error. It takes ~4.5 hours to 
>>>>> fail.
>>>>>
>>>>> However, this test does not fail if either ECN is off or if digests
>>>>> are not enabled. These passing combinations were run for 16+ hours
>>>>> without any issues. Both ECN and Header+Data Digests need to be 
>>>>> turned
>>>>> on for it to fail.
>>>>>
>>>>> Do you have a failing test as well? If so, is it quicker to cause the
>>>>> failure? Would you mind sharing any details?
>>>>>
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 2 (f002) type 4 
>>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 1 (2001) type 4 
>>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 4 (c004) type 4 
>>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: starting error recovery
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 15 (000f) type 
>>>>> 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 6 (5006) type 4 
>>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 3 (2003) type 4 
>>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] block nvme1n3: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 8 (0008) type 4 
>>>>> opcode 0x2 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 14 (400e) type 
>>>>> 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 13 (100d) type 
>>>>> 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] block nvme1n2: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] block nvme1n2: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] block nvme1n2: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 5 (5005) type 4 
>>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 7 (0007) type 4 
>>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 11 (a00b) type 
>>>>> 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 12 (f00c) type 
>>>>> 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] block nvme1n1: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] block nvme1n1: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: Reconnecting in 10 
>>>>> seconds...
>>>>>
>>>>> In the current build I had these patches on top of the "nvme-6.16" 
>>>>> branch:
>>>>>      41b2c90a51bd nvme-tcp: sanitize request list handling
>>>>>      9260acd6c230 nvme-tcp: fix I/O stalls on congested sockets
>>>>
>>>> Kamaljit, with the prior version of the patchset (the proposal with 
>>>> the
>>>> wake_sender flag) did this not reproduce regardless of ECN?
>>>   With the last patchset, when ECN=off, we did not see any IO timeouts
>>> even with a weekend long test. This was true for both cases, i.e. with
>>> Inband Auth and with SecureConcat.
>>>
>>> With ECN=on & HD+DD=on IO timeout still persists for both Inband Auth &
>>> SC. I’m currently debugging a possible target side issue with ECN. I’ll
>>> let you know once I have some resolution.
>>>
>>> I don't have any clear indications of the original kernel issue to 
>>> be able to
>>> differentiate against the current target-side issue. So, if you want 
>>> to go
>>> ahead and merge those two patchsets that may be fine for now.
>>>
>> Thanks a lot for your confirmation.
>> We continue to have issues with high load or oversubscribed fabrics.
>> But this patchset is addressing the problem of I/O timeouts during
>> _connect_, which I would argue is a different story.
>
> We still need to hunt these down. I'm still puzzled why adding the 
> WAKE_SENDER
> flag was able to make this issue disappear? I'll have another look at 
> this patch.
>
> For now, I think we can go with this patchset, and then incrementally 
> fix the remains.

Kamaljit, can you check the following patch on top of the patchset from 
Hannes that
gets a reproduction?

--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 835e29014841..2f5f2fcfb078 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1075,7 +1075,7 @@ static void nvme_tcp_write_space(struct sock *sk)

         read_lock_bh(&sk->sk_callback_lock);
         queue = sk->sk_user_data;
-       if (likely(queue && sk_stream_is_writeable(sk))) {
+      if (likely(queue)) {
                 clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
                 queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
         }
--

This think this may be preventing the scheduling of io_work. But now 
that io_work is
also ceasing based on sk_stream_is_writeable, we should probably still 
schedule it.


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-17 10:12                                     ` Sagi Grimberg
@ 2025-05-27 21:49                                       ` Sagi Grimberg
  2025-05-28  1:43                                         ` Kamaljit Singh
  0 siblings, 1 reply; 48+ messages in thread
From: Sagi Grimberg @ 2025-05-27 21:49 UTC (permalink / raw)
  To: Hannes Reinecke, Kamaljit Singh, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

>>
>> We still need to hunt these down. I'm still puzzled why adding the 
>> WAKE_SENDER
>> flag was able to make this issue disappear? I'll have another look at 
>> this patch.
>>
>> For now, I think we can go with this patchset, and then incrementally 
>> fix the remains.
>
> Kamaljit, can you check the following patch on top of the patchset 
> from Hannes that
> gets a reproduction?
>
> -- 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 835e29014841..2f5f2fcfb078 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1075,7 +1075,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>
>         read_lock_bh(&sk->sk_callback_lock);
>         queue = sk->sk_user_data;
> -       if (likely(queue && sk_stream_is_writeable(sk))) {
> +      if (likely(queue)) {
>                 clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>                 queue_work_on(queue->io_cpu, nvme_tcp_wq, 
> &queue->io_work);
>         }
> -- 
>
> This think this may be preventing the scheduling of io_work. But now 
> that io_work is
> also ceasing based on sk_stream_is_writeable, we should probably still 
> schedule it.

Kamaljit? Any info on this?


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-27 21:49                                       ` Sagi Grimberg
@ 2025-05-28  1:43                                         ` Kamaljit Singh
  2025-05-28  6:33                                           ` Hannes Reinecke
  0 siblings, 1 reply; 48+ messages in thread
From: Kamaljit Singh @ 2025-05-28  1:43 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

Hi Sagi,

>On 27/05/2025 2:49, Sagi Grimberg wrote:
>>>
>>> We still need to hunt these down. I'm still puzzled why adding the
>>> WAKE_SENDER
>>> flag was able to make this issue disappear? I'll have another look at
>>> this patch.
>>>
>>> For now, I think we can go with this patchset, and then incrementally
>>> fix the remains.
>>
>> Kamaljit, can you check the following patch on top of the patchset
>> from Hannes that
>> gets a reproduction?
>>
>> --
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 835e29014841..2f5f2fcfb078 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1075,7 +1075,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>>
>>         read_lock_bh(&sk->sk_callback_lock);
>>         queue = sk->sk_user_data;
>> -       if (likely(queue && sk_stream_is_writeable(sk))) {
>> +      if (likely(queue)) {
>>                 clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>>                 queue_work_on(queue->io_cpu, nvme_tcp_wq,
>> &queue->io_work);
>>         }
>> --
>>
>> This think this may be preventing the scheduling of io_work. But now
>> that io_work is
>> also ceasing based on sk_stream_is_writeable, we should probably still
>> schedule it.
>
>Kamaljit? Any info on this?

Sorry for the delayed reply. We've been hitting one issue after another,
hence the delays. Hoping that by tomorrow I should have clean results with
a clear a-b comparison of only the changes you suggested.

Thanks,
Kamaljit

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-28  1:43                                         ` Kamaljit Singh
@ 2025-05-28  6:33                                           ` Hannes Reinecke
  2025-05-28 22:45                                             ` Kamaljit Singh
  0 siblings, 1 reply; 48+ messages in thread
From: Hannes Reinecke @ 2025-05-28  6:33 UTC (permalink / raw)
  To: Kamaljit Singh, Sagi Grimberg, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

On 5/28/25 03:43, Kamaljit Singh wrote:
> Hi Sagi,
> 
>> On 27/05/2025 2:49, Sagi Grimberg wrote:
>>>>
>>>> We still need to hunt these down. I'm still puzzled why adding the
>>>> WAKE_SENDER
>>>> flag was able to make this issue disappear? I'll have another look at
>>>> this patch.
>>>>
>>>> For now, I think we can go with this patchset, and then incrementally
>>>> fix the remains.
>>>
>>> Kamaljit, can you check the following patch on top of the patchset
>>> from Hannes that
>>> gets a reproduction?
>>>
>>> --
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 835e29014841..2f5f2fcfb078 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1075,7 +1075,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>>>
>>>          read_lock_bh(&sk->sk_callback_lock);
>>>          queue = sk->sk_user_data;
>>> -       if (likely(queue && sk_stream_is_writeable(sk))) {
>>> +      if (likely(queue)) {
>>>                  clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>>>                  queue_work_on(queue->io_cpu, nvme_tcp_wq,
>>> &queue->io_work);
>>>          }
>>> --
>>>
>>> This think this may be preventing the scheduling of io_work. But now
>>> that io_work is
>>> also ceasing based on sk_stream_is_writeable, we should probably still
>>> schedule it.
>>
>> Kamaljit? Any info on this?
> 
> Sorry for the delayed reply. We've been hitting one issue after another,
> hence the delays. Hoping that by tomorrow I should have clean results with
> a clear a-b comparison of only the changes you suggested.
> 
Just to make you aware, I've found another issue wrt 
sk_stream_is_writeable(); we are checking it from the data_ready
callback, and do not schedule io_work if no write space is available.
Problem here is that the data_ready callback indicates that there
_should_ be space available, so by checking sk_stream_is_writeable()
we already assume that the actual write space might change, and
the callback might not be a reliable indicator.
But from that follows that also the opposite might be true, namely
that write space _might_ be available by the time io_work is scheduled.
I'll repost the series.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-28  6:33                                           ` Hannes Reinecke
@ 2025-05-28 22:45                                             ` Kamaljit Singh
  0 siblings, 0 replies; 48+ messages in thread
From: Kamaljit Singh @ 2025-05-28 22:45 UTC (permalink / raw)
  To: Hannes Reinecke, Sagi Grimberg, Hannes Reinecke, hch
  Cc: Keith Busch, linux-nvme@lists.infradead.org, Chris Leech

Hi Hannes & Sagi,

On 5/27/25 23:33, Hannes Reinecke wrote:
>>> On 27/05/2025 2:49, Sagi Grimberg wrote:
>>>>>
>>>>> We still need to hunt these down. I'm still puzzled why adding the
>>>>> WAKE_SENDER
>>>>> flag was able to make this issue disappear? I'll have another look at
>>>>> this patch.
>>>>>
>>>>> For now, I think we can go with this patchset, and then incrementally
>>>>> fix the remains.
>>>>
>>>> Kamaljit, can you check the following patch on top of the patchset
>>>> from Hannes that
>>>> gets a reproduction?
>>>>
>>>> --
>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>> index 835e29014841..2f5f2fcfb078 100644
>>>> --- a/drivers/nvme/host/tcp.c
>>>> +++ b/drivers/nvme/host/tcp.c
>>>> @@ -1075,7 +1075,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>>>>
>>>>          read_lock_bh(&sk->sk_callback_lock);
>>>>          queue = sk->sk_user_data;
>>>> -       if (likely(queue && sk_stream_is_writeable(sk))) {
>>>> +      if (likely(queue)) {
>>>>                  clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>>>>                  queue_work_on(queue->io_cpu, nvme_tcp_wq,
>>>> &queue->io_work);
>>>>          }
>>>> --
>>>>
>>>> This think this may be preventing the scheduling of io_work. But now
>>>> that io_work is
>>>> also ceasing based on sk_stream_is_writeable, we should probably still
>>>> schedule it.
>>>
>>> Kamaljit? Any info on this?
>>
>> Sorry for the delayed reply. We've been hitting one issue after another,
>> hence the delays. Hoping that by tomorrow I should have clean results with
>> a clear a-b comparison of only the changes you suggested.
Sagi,
After several more runs, I still don't have any definitive results for the
driver change you suggested. The tests did fail with IO timeouts but
the rootcauses were clearly unrelated to the Kernel/driver.


>Just to make you aware, I've found another issue wrt
>sk_stream_is_writeable(); we are checking it from the data_ready
>callback, and do not schedule io_work if no write space is available.
>Problem here is that the data_ready callback indicates that there
>_should_ be space available, so by checking sk_stream_is_writeable()
>we already assume that the actual write space might change, and
>the callback might not be a reliable indicator.
>But from that follows that also the opposite might be true, namely
>that write space _might_ be available by the time io_work is scheduled.
>I'll repost the series.

Hannes, Thank you for the V4 patchset. Looks like V4 includes the last change
that Sagi suggested, i.e. remove sk_stream_is_writeable(sk) from
nvme_tcp_write_space. We will continue testing w/ your V4 patchset and
provided any feedback.

Thanks,
Kamaljit


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2025-05-28 22:47 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-03  6:55 [PATCHv3 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke
2025-04-03  6:55 ` [PATCH 1/3] nvme-tcp: open-code nvme_tcp_queue_request() for R2T Hannes Reinecke
2025-04-13 22:56   ` Sagi Grimberg
2025-04-22  8:09   ` Christoph Hellwig
2025-04-03  6:55 ` [PATCH 2/3] nvme-tcp: sanitize request list handling Hannes Reinecke
2025-04-13 23:00   ` Sagi Grimberg
2025-04-14 12:35     ` Hannes Reinecke
2025-04-14 20:29       ` Sagi Grimberg
2025-04-03  6:55 ` [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke
2025-04-13 23:09   ` Sagi Grimberg
2025-04-14  6:21     ` Hannes Reinecke
2025-04-14 20:24       ` Sagi Grimberg
2025-04-15  7:07   ` Hannes Reinecke
2025-04-15 21:35     ` Sagi Grimberg
2025-04-16  7:56       ` Hannes Reinecke
2025-04-16 22:09         ` Sagi Grimberg
2025-04-17 23:03           ` Kamaljit Singh
2025-04-17 23:06             ` Kamaljit Singh
2025-04-18 10:51             ` Sagi Grimberg
2025-04-18 18:55               ` Kamaljit Singh
2025-04-19 16:10                 ` Sagi Grimberg
2025-04-21 19:11                   ` Kamaljit Singh
2025-04-22 11:43                     ` Sagi Grimberg
2025-04-23 17:26                       ` Kamaljit Singh
2025-04-24 11:26           ` Hannes Reinecke
2025-04-25 21:55             ` Sagi Grimberg
2025-04-25 22:09               ` Sagi Grimberg
2025-04-25 23:47                 ` Kamaljit Singh
2025-04-27 10:35                   ` Sagi Grimberg
     [not found]                     ` <BY5PR04MB6849D4BF87755B82A073BBDABC89A@BY5PR04MB6849.namprd04.prod.outlook.com>
2025-05-06 18:05                       ` Hannes Reinecke
2025-05-07 22:26                         ` Kamaljit Singh
2025-05-08 21:23                           ` Kamaljit Singh
2025-05-09  6:52                             ` Hannes Reinecke
2025-05-09  8:58                               ` Hannes Reinecke
2025-05-11  9:12                               ` Sagi Grimberg
2025-05-11  9:11                             ` Sagi Grimberg
2025-05-13 19:24                               ` Kamaljit Singh
2025-05-14  6:35                                 ` Hannes Reinecke
2025-05-17 10:01                                   ` Sagi Grimberg
2025-05-17 10:12                                     ` Sagi Grimberg
2025-05-27 21:49                                       ` Sagi Grimberg
2025-05-28  1:43                                         ` Kamaljit Singh
2025-05-28  6:33                                           ` Hannes Reinecke
2025-05-28 22:45                                             ` Kamaljit Singh
  -- strict thread matches above, loose matches on Subject: below --
2025-03-07 13:27 [PATCH 0/3] nvme-tcp: fixup I/O stall " Hannes Reinecke
2025-03-07 13:28 ` [PATCH 2/3] nvme-tcp: sanitize request list handling Hannes Reinecke
2025-03-11 18:53   ` Chris Leech
2025-03-12  8:16     ` Hannes Reinecke
2025-03-12 12:27       ` Maurizio Lombardi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).