[PATCH 0/3] nvme-tcp: fixup I/O stall on congested sockets

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] nvme-tcp: fixup I/O stall on congested sockets
@ 2025-03-07 13:27 Hannes Reinecke
  2025-03-07 13:28 ` [PATCH 1/3] nvme-tcp: avoid inline sending when handling R2T PDUs Hannes Reinecke
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Hannes Reinecke @ 2025-03-07 13:27 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, Keith Busch, linux-nvme, Hannes Reinecke

Hi all,

I have been chasing keep-alive timeouts with TLS enabled in the last few
days (weeks, even :-( ). On larger setups (eg with 32 queues) the connection
never got established properly as I've been hitting keep-alive timeouts before
the last queue got connected.
Turns out that occasionally we simply do not send the keep-alive request; it's
been added to the request list but the io_work workqueue function is never
restarted as it bails out after nvme_tcp_try_recv() returns -EAGAIN.
During debugging I also found that we're quite lazy with the list
handling of requests, so I've added two preliminary patches to ensure
that all list elements are properly terminated.

As usual, comments and reviews are welcome.

Hannes Reinecke (3):
  nvme-tcp: avoid inline sending when handling R2T PDUs
  nvme-tcp: sanitize request list handling
  nvme-tcp: fix I/O stalls on congested sockets

 drivers/nvme/host/tcp.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

-- 
2.35.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] nvme-tcp: avoid inline sending when handling R2T PDUs
  2025-03-07 13:27 [PATCH 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke
@ 2025-03-07 13:28 ` Hannes Reinecke
  2025-03-11 18:40   ` Chris Leech
  2025-03-07 13:28 ` [PATCH 2/3] nvme-tcp: sanitize request list handling Hannes Reinecke
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2025-03-07 13:28 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, Keith Busch, linux-nvme, Hannes Reinecke

When handling an R2T PDU we should not attempt to send consecutive
PDUs as we are running from an softirq context, and sending PDUs
from the receive context will mess up latencies.
So just queue it and let the io_work workqueue function do the work.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/tcp.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 8c14018201db..034edf852878 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -391,7 +391,7 @@ static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
 }
 
 static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
-		bool sync, bool last)
+		bool last)
 {
 	struct nvme_tcp_queue *queue = req->queue;
 	bool empty;
@@ -405,7 +405,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 	 * are on the same cpu, so we don't introduce contention.
 	 */
 	if (queue->io_cpu == raw_smp_processor_id() &&
-	    sync && empty && mutex_trylock(&queue->send_mutex)) {
+	    empty && mutex_trylock(&queue->send_mutex)) {
 		nvme_tcp_send_all(queue);
 		mutex_unlock(&queue->send_mutex);
 	}
@@ -758,7 +758,9 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 	req->ttag = pdu->ttag;
 
 	nvme_tcp_setup_h2c_data_pdu(req);
-	nvme_tcp_queue_request(req, false, true);
+
+	llist_add(&req->lentry, &queue->req_list);
+	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 
 	return 0;
 }
@@ -2531,7 +2533,7 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
 	ctrl->async_req.curr_bio = NULL;
 	ctrl->async_req.data_len = 0;
 
-	nvme_tcp_queue_request(&ctrl->async_req, true, true);
+	nvme_tcp_queue_request(&ctrl->async_req, true);
 }
 
 static void nvme_tcp_complete_timed_out(struct request *rq)
@@ -2683,7 +2685,7 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	nvme_start_request(rq);
 
-	nvme_tcp_queue_request(req, true, bd->last);
+	nvme_tcp_queue_request(req, bd->last);
 
 	return BLK_STS_OK;
 }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: avoid inline sending when handling R2T PDUs
  2025-03-07 13:28 ` [PATCH 1/3] nvme-tcp: avoid inline sending when handling R2T PDUs Hannes Reinecke
@ 2025-03-11 18:40   ` Chris Leech
  2025-03-12  8:14     ` Hannes Reinecke
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Leech @ 2025-03-11 18:40 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme

On Fri, Mar 07, 2025 at 02:28:00PM +0100, Hannes Reinecke wrote:
> When handling an R2T PDU we should not attempt to send consecutive
> PDUs as we are running from an softirq context, and sending PDUs
> from the receive context will mess up latencies.
> So just queue it and let the io_work workqueue function do the work.
> 
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>  drivers/nvme/host/tcp.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 

Am I missing something, or does this patch not actually change anything?
With sync=false nvme_tcp_queue_request will queue io_work, so what is
removing that flag and open coding in handle_rt2 accomplishing?

- Chris

> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 8c14018201db..034edf852878 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -391,7 +391,7 @@ static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
>  }
>  
>  static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
> -		bool sync, bool last)
> +		bool last)
>  {
>  	struct nvme_tcp_queue *queue = req->queue;
>  	bool empty;
> @@ -405,7 +405,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
>  	 * are on the same cpu, so we don't introduce contention.
>  	 */
>  	if (queue->io_cpu == raw_smp_processor_id() &&
> -	    sync && empty && mutex_trylock(&queue->send_mutex)) {
> +	    empty && mutex_trylock(&queue->send_mutex)) {
>  		nvme_tcp_send_all(queue);
>  		mutex_unlock(&queue->send_mutex);
>  	}
> @@ -758,7 +758,9 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
>  	req->ttag = pdu->ttag;
>  
>  	nvme_tcp_setup_h2c_data_pdu(req);
> -	nvme_tcp_queue_request(req, false, true);
> +
> +	llist_add(&req->lentry, &queue->req_list);
> +	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>  
>  	return 0;
>  }
> @@ -2531,7 +2533,7 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
>  	ctrl->async_req.curr_bio = NULL;
>  	ctrl->async_req.data_len = 0;
>  
> -	nvme_tcp_queue_request(&ctrl->async_req, true, true);
> +	nvme_tcp_queue_request(&ctrl->async_req, true);
>  }
>  
>  static void nvme_tcp_complete_timed_out(struct request *rq)
> @@ -2683,7 +2685,7 @@ static blk_status_t nvme_tcp_queue_rq(struct blk_mq_hw_ctx *hctx,
>  
>  	nvme_start_request(rq);
>  
> -	nvme_tcp_queue_request(req, true, bd->last);
> +	nvme_tcp_queue_request(req, bd->last);
>  
>  	return BLK_STS_OK;
>  }
> -- 
> 2.35.3
> 
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: avoid inline sending when handling R2T PDUs
  2025-03-11 18:40   ` Chris Leech
@ 2025-03-12  8:14     ` Hannes Reinecke
  0 siblings, 0 replies; 15+ messages in thread
From: Hannes Reinecke @ 2025-03-12  8:14 UTC (permalink / raw)
  To: Chris Leech, Hannes Reinecke
  Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme

On 3/11/25 19:40, Chris Leech wrote:
> On Fri, Mar 07, 2025 at 02:28:00PM +0100, Hannes Reinecke wrote:
>> When handling an R2T PDU we should not attempt to send consecutive
>> PDUs as we are running from an softirq context, and sending PDUs
>> from the receive context will mess up latencies.
>> So just queue it and let the io_work workqueue function do the work.
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>>   drivers/nvme/host/tcp.c | 12 +++++++-----
>>   1 file changed, 7 insertions(+), 5 deletions(-)
>>
> 
> Am I missing something, or does this patch not actually change anything?
> With sync=false nvme_tcp_queue_request will queue io_work, so what is
> removing that flag and open coding in handle_rt2 accomplishing?
> 
Yeah; actually I wanted to reduce the complexity of the conditional in
nvme_tcp_queue_request(), and to emphasize that r2t handling should not
start sending PDUs.

Probably better to modify the patch description.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-03-07 13:27 [PATCH 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke
  2025-03-07 13:28 ` [PATCH 1/3] nvme-tcp: avoid inline sending when handling R2T PDUs Hannes Reinecke
@ 2025-03-07 13:28 ` Hannes Reinecke
  2025-03-11 18:53   ` Chris Leech
  2025-03-07 13:28 ` [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke
  2025-03-10 17:02 ` [PATCH 0/3] nvme-tcp: fixup I/O stall " Keith Busch
  3 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2025-03-07 13:28 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, Keith Busch, linux-nvme, Hannes Reinecke

Validate the request in nvme_tcp_handle_r2t() to ensure it's not
part of any list, otherwise a malicious R2T PDU might inject a
loop in request list processing.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/tcp.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 034edf852878..073c8c3538fd 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -441,6 +441,7 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
 	}
 
 	list_del(&req->entry);
+	init_llist_node(&req->lentry);
 	return req;
 }
 
@@ -548,6 +549,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
 	req->queue = queue;
 	nvme_req(rq)->ctrl = &ctrl->ctrl;
 	nvme_req(rq)->cmd = &pdu->cmd;
+	init_llist_node(&req->lentry);
+	INIT_LIST_HEAD(&req->entry);
 
 	return 0;
 }
@@ -759,8 +762,12 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 
 	nvme_tcp_setup_h2c_data_pdu(req);
 
+	WARN_ON(queue->request == req);
+	WARN_ON(llist_on_list(&req->lentry));
+	WARN_ON(!list_empty(&req->entry));
 	llist_add(&req->lentry, &queue->req_list);
-	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+	if (list_empty(&queue->send_list))
+		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 
 	return 0;
 }
@@ -2532,6 +2539,8 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
 	ctrl->async_req.offset = 0;
 	ctrl->async_req.curr_bio = NULL;
 	ctrl->async_req.data_len = 0;
+	init_llist_node(&ctrl->async_req.lentry);
+	INIT_LIST_HEAD(&ctrl->async_req.entry);
 
 	nvme_tcp_queue_request(&ctrl->async_req, true);
 }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-03-07 13:28 ` [PATCH 2/3] nvme-tcp: sanitize request list handling Hannes Reinecke
@ 2025-03-11 18:53   ` Chris Leech
  2025-03-12  8:16     ` Hannes Reinecke
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Leech @ 2025-03-11 18:53 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme

On Fri, Mar 07, 2025 at 02:28:01PM +0100, Hannes Reinecke wrote:
> Validate the request in nvme_tcp_handle_r2t() to ensure it's not
> part of any list, otherwise a malicious R2T PDU might inject a
> loop in request list processing.
> 
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>  drivers/nvme/host/tcp.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 

...

> @@ -759,8 +762,12 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
>  
>  	nvme_tcp_setup_h2c_data_pdu(req);
>  
> +	WARN_ON(queue->request == req);
> +	WARN_ON(llist_on_list(&req->lentry));
> +	WARN_ON(!list_empty(&req->entry));
>  	llist_add(&req->lentry, &queue->req_list);

Are we happy with a WARN here, or should this be handled as an error?
The idea of an duplicate R2Ts creating a loop in req_list is
frightening.

- Chris



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-03-11 18:53   ` Chris Leech
@ 2025-03-12  8:16     ` Hannes Reinecke
  2025-03-12 12:27       ` Maurizio Lombardi
  0 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2025-03-12  8:16 UTC (permalink / raw)
  To: Chris Leech, Hannes Reinecke
  Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme

On 3/11/25 19:53, Chris Leech wrote:
> On Fri, Mar 07, 2025 at 02:28:01PM +0100, Hannes Reinecke wrote:
>> Validate the request in nvme_tcp_handle_r2t() to ensure it's not
>> part of any list, otherwise a malicious R2T PDU might inject a
>> loop in request list processing.
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>>   drivers/nvme/host/tcp.c | 11 ++++++++++-
>>   1 file changed, 10 insertions(+), 1 deletion(-)
>>
> 
> ...
> 
>> @@ -759,8 +762,12 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
>>   
>>   	nvme_tcp_setup_h2c_data_pdu(req);
>>   
>> +	WARN_ON(queue->request == req);
>> +	WARN_ON(llist_on_list(&req->lentry));
>> +	WARN_ON(!list_empty(&req->entry));
>>   	llist_add(&req->lentry, &queue->req_list);
> 
> Are we happy with a WARN here, or should this be handled as an error?
> The idea of an duplicate R2Ts creating a loop in req_list is
> frightening.
> 
It is, but wanted to check if others see this as an issue, too.
We actually should bail out and reset the connection; BUG_ON()
is pretty harsh, and not really appropriate as this isn't an error
on our side.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-03-12  8:16     ` Hannes Reinecke
@ 2025-03-12 12:27       ` Maurizio Lombardi
  0 siblings, 0 replies; 15+ messages in thread
From: Maurizio Lombardi @ 2025-03-12 12:27 UTC (permalink / raw)
  To: Hannes Reinecke, Chris Leech, Hannes Reinecke
  Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme

On Wed Mar 12, 2025 at 9:16 AM CET, Hannes Reinecke wrote:
> On 3/11/25 19:53, Chris Leech wrote:
>>> @@ -759,8 +762,12 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
>>>   
>>>   	nvme_tcp_setup_h2c_data_pdu(req);
>>>   
>>> +	WARN_ON(queue->request == req);
>>> +	WARN_ON(llist_on_list(&req->lentry));
>>> +	WARN_ON(!list_empty(&req->entry));
>>>   	llist_add(&req->lentry, &queue->req_list);
>> 
>> Are we happy with a WARN here, or should this be handled as an error?
>> The idea of an duplicate R2Ts creating a loop in req_list is
>> frightening.
>> 
> It is, but wanted to check if others see this as an issue, too.
> We actually should bail out and reset the connection; BUG_ON()
> is pretty harsh, and not really appropriate as this isn't an error
> on our side.

So a malicious or buggy target could trigger WARNs host-side? Did I understand it
correctly?

This doesn't sound ok to me, I think the host should print an error
message and reset the connection.

Maurizio


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-03-07 13:27 [PATCH 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke
  2025-03-07 13:28 ` [PATCH 1/3] nvme-tcp: avoid inline sending when handling R2T PDUs Hannes Reinecke
  2025-03-07 13:28 ` [PATCH 2/3] nvme-tcp: sanitize request list handling Hannes Reinecke
@ 2025-03-07 13:28 ` Hannes Reinecke
  2025-03-11 18:55   ` Chris Leech
  2025-03-10 17:02 ` [PATCH 0/3] nvme-tcp: fixup I/O stall " Keith Busch
  3 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2025-03-07 13:28 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Christoph Hellwig, Keith Busch, linux-nvme, Hannes Reinecke

When the socket is busy processing nvme_tcp_try_recv() might
return -EAGAIN, but this doesn't automatically imply that
the sending side is blocked, too.
So check if there are pending requests once nvme_tcp_try_recv()
returns -EAGAIN and continue with the sending loop to avoid
I/O stalls.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/tcp.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 073c8c3538fd..d39d955bad99 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1357,9 +1357,12 @@ static void nvme_tcp_io_work(struct work_struct *w)
 		result = nvme_tcp_try_recv(queue);
 		if (result > 0)
 			pending = true;
-		else if (unlikely(result < 0))
+		else if (unlikely(result < 0) && result != -EAGAIN)
 			return;
 
+		if (nvme_tcp_queue_has_pending(queue))
+			pending = true;
+
 		if (!pending || !queue->rd_enabled)
 			return;
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-03-07 13:28 ` [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke
@ 2025-03-11 18:55   ` Chris Leech
  0 siblings, 0 replies; 15+ messages in thread
From: Chris Leech @ 2025-03-11 18:55 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, linux-nvme

On Fri, Mar 07, 2025 at 02:28:02PM +0100, Hannes Reinecke wrote:
> When the socket is busy processing nvme_tcp_try_recv() might
> return -EAGAIN, but this doesn't automatically imply that
> the sending side is blocked, too.
> So check if there are pending requests once nvme_tcp_try_recv()
> returns -EAGAIN and continue with the sending loop to avoid
> I/O stalls.
> 
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---

This looks good to me.

Acked-by: Chris Leech <cleech@redhat.com>

>  drivers/nvme/host/tcp.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 073c8c3538fd..d39d955bad99 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1357,9 +1357,12 @@ static void nvme_tcp_io_work(struct work_struct *w)
>  		result = nvme_tcp_try_recv(queue);
>  		if (result > 0)
>  			pending = true;
> -		else if (unlikely(result < 0))
> +		else if (unlikely(result < 0) && result != -EAGAIN)
>  			return;
>  
> +		if (nvme_tcp_queue_has_pending(queue))
> +			pending = true;
> +
>  		if (!pending || !queue->rd_enabled)
>  			return;
>  
> -- 
> 2.35.3
> 
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/3] nvme-tcp: fixup I/O stall on congested sockets
  2025-03-07 13:27 [PATCH 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke
                   ` (2 preceding siblings ...)
  2025-03-07 13:28 ` [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke
@ 2025-03-10 17:02 ` Keith Busch
  3 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2025-03-10 17:02 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Sagi Grimberg, Christoph Hellwig, linux-nvme

On Fri, Mar 07, 2025 at 02:27:59PM +0100, Hannes Reinecke wrote:
> I have been chasing keep-alive timeouts with TLS enabled in the last few
> days (weeks, even :-( ). On larger setups (eg with 32 queues) the connection
> never got established properly as I've been hitting keep-alive timeouts before
> the last queue got connected.
> Turns out that occasionally we simply do not send the keep-alive request; it's
> been added to the request list but the io_work workqueue function is never
> restarted as it bails out after nvme_tcp_try_recv() returns -EAGAIN.
> During debugging I also found that we're quite lazy with the list
> handling of requests, so I've added two preliminary patches to ensure
> that all list elements are properly terminated.

Thanks, this all looks pretty good to me.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCHv3 0/3] nvme-tcp: fixup I/O stall on congested sockets
@ 2025-04-03  6:55 Hannes Reinecke
  2025-04-03  6:55 ` [PATCH 2/3] nvme-tcp: sanitize request list handling Hannes Reinecke
  0 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2025-04-03  6:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Hi all,

I have been chasing keep-alive timeouts with TLS enabled in the last few
days (weeks, even :-( ). On larger setups (eg with 32 queues) the connection
never got established properly as I've been hitting keep-alive timeouts before
the last queue got connected.
Turns out that occasionally we simply do not send the keep-alive request; it's
been added to the request list but the io_work workqueue function is never
restarted as it bails out after nvme_tcp_try_recv() returns -EAGAIN.
During debugging I also found that we're quite lazy with the list
handling of requests, so I've added two preliminary patches to ensure
that all list elements are properly terminated.

As usual, comments and reviews are welcome.

Changes to v2:
- Removed AEN patches again

Changes to the original submission:
- Include reviews from Chris Leech
- Add patch to requeue namespace scan
- Add patch to re-read ANA log page

Hannes Reinecke (3):
  nvme-tcp: open-code nvme_tcp_queue_request() for R2T
  nvme-tcp: sanitize request list handling
  nvme-tcp: fix I/O stalls on congested sockets

 drivers/nvme/host/tcp.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

-- 
2.35.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-04-03  6:55 [PATCHv3 " Hannes Reinecke
@ 2025-04-03  6:55 ` Hannes Reinecke
  2025-04-13 23:00   ` Sagi Grimberg
  0 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2025-04-03  6:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sagi Grimberg, Keith Busch, linux-nvme, Hannes Reinecke

Validate the request in nvme_tcp_handle_r2t() to ensure it's not
part of any list, otherwise a malicious R2T PDU might inject a
loop in request list processing.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/tcp.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index b2f2aef2e2f2..1a319cb86453 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -454,6 +454,7 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
 	}
 
 	list_del(&req->entry);
+	init_llist_node(&req->lentry);
 	return req;
 }
 
@@ -561,6 +562,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
 	req->queue = queue;
 	nvme_req(rq)->ctrl = &ctrl->ctrl;
 	nvme_req(rq)->cmd = &pdu->cmd;
+	init_llist_node(&req->lentry);
+	INIT_LIST_HEAD(&req->entry);
 
 	return 0;
 }
@@ -765,6 +768,15 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 		return -EPROTO;
 	}
 
+	if (queue->request == req ||
+	    llist_on_list(&req->lentry) ||
+	    !list_empty(&req->entry)) {
+		dev_err(queue->ctrl->ctrl.device,
+			"req %d unexpected r2t while processing request\n",
+			rq->tag);
+		return -EPROTO;
+	}
+
 	req->pdu_len = 0;
 	req->h2cdata_left = r2t_length;
 	req->h2cdata_offset = r2t_offset;
@@ -773,7 +785,8 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 	nvme_tcp_setup_h2c_data_pdu(req);
 
 	llist_add(&req->lentry, &queue->req_list);
-	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+	if (list_empty(&queue->send_list))
+		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 
 	return 0;
 }
@@ -2558,6 +2571,8 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
 	ctrl->async_req.offset = 0;
 	ctrl->async_req.curr_bio = NULL;
 	ctrl->async_req.data_len = 0;
+	init_llist_node(&ctrl->async_req.lentry);
+	INIT_LIST_HEAD(&ctrl->async_req.entry);
 
 	nvme_tcp_queue_request(&ctrl->async_req, true);
 }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-04-03  6:55 ` [PATCH 2/3] nvme-tcp: sanitize request list handling Hannes Reinecke
@ 2025-04-13 23:00   ` Sagi Grimberg
  2025-04-14 12:35     ` Hannes Reinecke
  0 siblings, 1 reply; 15+ messages in thread
From: Sagi Grimberg @ 2025-04-13 23:00 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme



On 03/04/2025 9:55, Hannes Reinecke wrote:
> Validate the request in nvme_tcp_handle_r2t() to ensure it's not
> part of any list, otherwise a malicious R2T PDU might inject a
> loop in request list processing.

Not clear what do you mean by "malicious R2T PDU".
Can you please clarify what you have seen/observed in the commit msg 
(i.e. what led you
to this patch)?

>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>   drivers/nvme/host/tcp.c | 17 ++++++++++++++++-
>   1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index b2f2aef2e2f2..1a319cb86453 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -454,6 +454,7 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
>   	}
>   
>   	list_del(&req->entry);
> +	init_llist_node(&req->lentry);
>   	return req;
>   }
>   
> @@ -561,6 +562,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
>   	req->queue = queue;
>   	nvme_req(rq)->ctrl = &ctrl->ctrl;
>   	nvme_req(rq)->cmd = &pdu->cmd;
> +	init_llist_node(&req->lentry);
> +	INIT_LIST_HEAD(&req->entry);
>   
>   	return 0;
>   }
> @@ -765,6 +768,15 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
>   		return -EPROTO;
>   	}
>   
> +	if (queue->request == req ||
> +	    llist_on_list(&req->lentry) ||
> +	    !list_empty(&req->entry)) {
> +		dev_err(queue->ctrl->ctrl.device,
> +			"req %d unexpected r2t while processing request\n",
> +			rq->tag);
> +		return -EPROTO;
> +	}
> +
>   	req->pdu_len = 0;
>   	req->h2cdata_left = r2t_length;
>   	req->h2cdata_offset = r2t_offset;
> @@ -773,7 +785,8 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
>   	nvme_tcp_setup_h2c_data_pdu(req);
>   
>   	llist_add(&req->lentry, &queue->req_list);
> -	queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> +	if (list_empty(&queue->send_list))
> +		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);

Is this change mandatory? looks out of place.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-04-13 23:00   ` Sagi Grimberg
@ 2025-04-14 12:35     ` Hannes Reinecke
  2025-04-14 20:29       ` Sagi Grimberg
  0 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2025-04-14 12:35 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke, Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On 4/14/25 01:00, Sagi Grimberg wrote:
> 
> 
> On 03/04/2025 9:55, Hannes Reinecke wrote:
>> Validate the request in nvme_tcp_handle_r2t() to ensure it's not
>> part of any list, otherwise a malicious R2T PDU might inject a
>> loop in request list processing.
> 
> Not clear what do you mean by "malicious R2T PDU".
> Can you please clarify what you have seen/observed in the commit msg 
> (i.e. what led you to this patch)?
> 
This is coming from code inspection only.
In nvme_tcp_handle_r2t() we are looking up a request by the 'command_id'
value in the pdu, and then add it to 'queue->req_list' without further
checking.
So if a malicious R2T packet is received containing the command_id of
a command currently on 'queue->req_list' we'll end up with duplicate
entries on the list.

[ .. ]

>> @@ -773,7 +785,8 @@ static int nvme_tcp_handle_r2t(struct 
>> nvme_tcp_queue *queue,
>>       nvme_tcp_setup_h2c_data_pdu(req);
>>       llist_add(&req->lentry, &queue->req_list);
>> -    queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>> +    if (list_empty(&queue->send_list))
>> +        queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
> 
> Is this change mandatory? looks out of place.
> 
Arguably an optimisation. I can leave it out.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: sanitize request list handling
  2025-04-14 12:35     ` Hannes Reinecke
@ 2025-04-14 20:29       ` Sagi Grimberg
  0 siblings, 0 replies; 15+ messages in thread
From: Sagi Grimberg @ 2025-04-14 20:29 UTC (permalink / raw)
  To: Hannes Reinecke, Hannes Reinecke, Christoph Hellwig
  Cc: Keith Busch, linux-nvme



On 14/04/2025 15:35, Hannes Reinecke wrote:
> On 4/14/25 01:00, Sagi Grimberg wrote:
>>
>>
>> On 03/04/2025 9:55, Hannes Reinecke wrote:
>>> Validate the request in nvme_tcp_handle_r2t() to ensure it's not
>>> part of any list, otherwise a malicious R2T PDU might inject a
>>> loop in request list processing.
>>
>> Not clear what do you mean by "malicious R2T PDU".
>> Can you please clarify what you have seen/observed in the commit msg 
>> (i.e. what led you to this patch)?
>>
> This is coming from code inspection only.
> In nvme_tcp_handle_r2t() we are looking up a request by the 'command_id'
> value in the pdu, and then add it to 'queue->req_list' without further
> checking.
> So if a malicious R2T packet is received containing the command_id of
> a command currently on 'queue->req_list' we'll end up with duplicate
> entries on the list.

I think this falls into a bucket of a bunch checks that the host does 
not do. But ok.

>
> [ .. ]
>
>>> @@ -773,7 +785,8 @@ static int nvme_tcp_handle_r2t(struct 
>>> nvme_tcp_queue *queue,
>>>       nvme_tcp_setup_h2c_data_pdu(req);
>>>       llist_add(&req->lentry, &queue->req_list);
>>> -    queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>> +    if (list_empty(&queue->send_list))
>>> +        queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>
>> Is this change mandatory? looks out of place.
>>
> Arguably an optimisation. I can leave it out.

When it comes in this context it is not clear what it is designed to do. 
Don't mind having it,
but lets separate it from this patch.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-04-14 20:29 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-07 13:27 [PATCH 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke
2025-03-07 13:28 ` [PATCH 1/3] nvme-tcp: avoid inline sending when handling R2T PDUs Hannes Reinecke
2025-03-11 18:40   ` Chris Leech
2025-03-12  8:14     ` Hannes Reinecke
2025-03-07 13:28 ` [PATCH 2/3] nvme-tcp: sanitize request list handling Hannes Reinecke
2025-03-11 18:53   ` Chris Leech
2025-03-12  8:16     ` Hannes Reinecke
2025-03-12 12:27       ` Maurizio Lombardi
2025-03-07 13:28 ` [PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke
2025-03-11 18:55   ` Chris Leech
2025-03-10 17:02 ` [PATCH 0/3] nvme-tcp: fixup I/O stall " Keith Busch
  -- strict thread matches above, loose matches on Subject: below --
2025-04-03  6:55 [PATCHv3 " Hannes Reinecke
2025-04-03  6:55 ` [PATCH 2/3] nvme-tcp: sanitize request list handling Hannes Reinecke
2025-04-13 23:00   ` Sagi Grimberg
2025-04-14 12:35     ` Hannes Reinecke
2025-04-14 20:29       ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.