[PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets
@ 2025-05-28  6:45 Hannes Reinecke
  2025-05-28  6:45 ` [PATCH 1/3] nvme-tcp: sanitize request list handling Hannes Reinecke
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Hannes Reinecke @ 2025-05-28  6:45 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme,
	Hannes Reinecke

Hi all,

I have been chasing keep-alive timeouts with TLS enabled in the last few
days (weeks, even :-(). On larger setups (eg with 32 queues) the connection
never got established properly as I've been hitting keep-alive timeouts before
the last queue got connected.
Turns out that occasionally we simply do not send the keep-alive request; it's
been added to the request list but the io_work workqueue function is never
restarted as it bails out after nvme_tcp_try_recv() returns -EAGAIN.
During debugging I also found that we're quite lazy with the list
handling of requests, so I've added two preliminary patches to ensure
that all list elements are properly terminated.

As usual, comments and reviews are welcome.

Changes to v3:
- Drop merged patch to open-code nvme_tcp_queue_requests()
- Add patch to check for sk_stream_is_writeable()

Changes to v2:
- Removed AEN patches again

Changes to the original submission:
- Include reviews from Chris Leech
- Add patch to requeue namespace scan
- Add patch to re-read ANA log page

Hannes Reinecke (3):
  nvme-tcp: sanitize request list handling
  nvme-tcp: fix I/O stalls on congested sockets
  nvme-tcp: do not queue io_work on a full socket

 drivers/nvme/host/tcp.c | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

-- 
2.35.3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] nvme-tcp: sanitize request list handling
  2025-05-28  6:45 [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke
@ 2025-05-28  6:45 ` Hannes Reinecke
  2025-05-29 12:34   ` Sagi Grimberg
  2025-05-28  6:45 ` [PATCH 2/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Hannes Reinecke @ 2025-05-28  6:45 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme,
	Hannes Reinecke

Validate the request in nvme_tcp_handle_r2t() to ensure it's not
part of any list, otherwise a malicious R2T PDU might inject a
loop in request list processing.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/tcp.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 853bc67d045c..69b89a3c09a1 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -452,7 +452,8 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
 			return NULL;
 	}
 
-	list_del(&req->entry);
+	list_del_init(&req->entry);
+	init_llist_node(&req->lentry);
 	return req;
 }
 
@@ -560,6 +561,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
 	req->queue = queue;
 	nvme_req(rq)->ctrl = &ctrl->ctrl;
 	nvme_req(rq)->cmd = &pdu->cmd;
+	init_llist_node(&req->lentry);
+	INIT_LIST_HEAD(&req->entry);
 
 	return 0;
 }
@@ -764,6 +767,14 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
 		return -EPROTO;
 	}
 
+	if (llist_on_list(&req->lentry) ||
+	    !list_empty(&req->entry)) {
+		dev_err(queue->ctrl->ctrl.device,
+			"req %d unexpected r2t while processing request\n",
+			rq->tag);
+		return -EPROTO;
+	}
+
 	req->pdu_len = 0;
 	req->h2cdata_left = r2t_length;
 	req->h2cdata_offset = r2t_offset;
@@ -2638,6 +2649,8 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg)
 	ctrl->async_req.offset = 0;
 	ctrl->async_req.curr_bio = NULL;
 	ctrl->async_req.data_len = 0;
+	init_llist_node(&ctrl->async_req.lentry);
+	INIT_LIST_HEAD(&ctrl->async_req.entry);
 
 	nvme_tcp_queue_request(&ctrl->async_req, true);
 }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] nvme-tcp: sanitize request list handling
  2025-05-28  6:45 ` [PATCH 1/3] nvme-tcp: sanitize request list handling Hannes Reinecke
@ 2025-05-29 12:34   ` Sagi Grimberg
  0 siblings, 0 replies; 9+ messages in thread
From: Sagi Grimberg @ 2025-05-29 12:34 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

(can you please add these when resending future versions?)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-28  6:45 [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke
  2025-05-28  6:45 ` [PATCH 1/3] nvme-tcp: sanitize request list handling Hannes Reinecke
@ 2025-05-28  6:45 ` Hannes Reinecke
  2025-05-29 12:34   ` Sagi Grimberg
  2025-05-28  6:45 ` [PATCH 3/3] nvme-tcp: do not queue io_work on a full socket Hannes Reinecke
  2025-06-04  8:04 ` [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Christoph Hellwig
  3 siblings, 1 reply; 9+ messages in thread
From: Hannes Reinecke @ 2025-05-28  6:45 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme,
	Hannes Reinecke, Chris Leech

When the socket is busy processing nvme_tcp_try_recv() might
return -EAGAIN, but this doesn't automatically imply that
the sending side is blocked, too.
So check if there are pending requests once nvme_tcp_try_recv()
returns -EAGAIN and continue with the sending loop to avoid
I/O stalls.

Acked-by: Chris Leech <cleech@redhat.com>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/tcp.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 69b89a3c09a1..0e178115dc04 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1361,7 +1361,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
 	queue->nr_cqe = 0;
 	consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
 	release_sock(sk);
-	return consumed;
+	return consumed == -EAGAIN ? 0 : consumed;
 }
 
 static void nvme_tcp_io_work(struct work_struct *w)
@@ -1389,6 +1389,11 @@ static void nvme_tcp_io_work(struct work_struct *w)
 		else if (unlikely(result < 0))
 			return;
 
+		/* did we get some space after spending time in recv? */
+		if (nvme_tcp_queue_has_pending(queue) &&
+		    sk_stream_is_writeable(queue->sock->sk))
+			pending = true;
+
 		if (!pending || !queue->rd_enabled)
 			return;
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] nvme-tcp: fix I/O stalls on congested sockets
  2025-05-28  6:45 ` [PATCH 2/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke
@ 2025-05-29 12:34   ` Sagi Grimberg
  0 siblings, 0 replies; 9+ messages in thread
From: Sagi Grimberg @ 2025-05-29 12:34 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme,
	Chris Leech

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 3/3] nvme-tcp: do not queue io_work on a full socket
  2025-05-28  6:45 [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke
  2025-05-28  6:45 ` [PATCH 1/3] nvme-tcp: sanitize request list handling Hannes Reinecke
  2025-05-28  6:45 ` [PATCH 2/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke
@ 2025-05-28  6:45 ` Hannes Reinecke
  2025-05-29 12:38   ` Sagi Grimberg
  2025-06-04  8:04 ` [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Christoph Hellwig
  3 siblings, 1 reply; 9+ messages in thread
From: Hannes Reinecke @ 2025-05-28  6:45 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme,
	Hannes Reinecke

There really is no point in scheduling io_work from ->queue_rq()
if the socket is full; we'll be notified by the ->write_space()
callback once space on the socket becomes available.
Consequently we also need to remove the check for 'sk_stream_is_writeable()'
in the ->write_space() callback as we need to schedule io_work
to receive packets even if the sending side is blocked.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/nvme/host/tcp.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 0e178115dc04..e4dd1620dc28 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -411,6 +411,9 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 	empty = llist_add(&req->lentry, &queue->req_list) &&
 		list_empty(&queue->send_list) && !queue->request;
 
+	if (!sk_stream_is_writeable(queue->sock->sk))
+		empty = false;
+
 	/*
 	 * if we're the first on the send_list and we can try to send
 	 * directly, otherwise queue io_work. Also, only do that if we
@@ -422,7 +425,8 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 		mutex_unlock(&queue->send_mutex);
 	}
 
-	if (last && nvme_tcp_queue_has_pending(queue))
+	if (last && nvme_tcp_queue_has_pending(queue) &&
+	    sk_stream_is_writeable(queue->sock->sk))
 		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 }
 
@@ -1074,7 +1078,7 @@ static void nvme_tcp_write_space(struct sock *sk)
 
 	read_lock_bh(&sk->sk_callback_lock);
 	queue = sk->sk_user_data;
-	if (likely(queue && sk_stream_is_writeable(sk))) {
+	if (likely(queue)) {
 		clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
 	}
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: do not queue io_work on a full socket
  2025-05-28  6:45 ` [PATCH 3/3] nvme-tcp: do not queue io_work on a full socket Hannes Reinecke
@ 2025-05-29 12:38   ` Sagi Grimberg
  2025-05-30  6:42     ` Hannes Reinecke
  0 siblings, 1 reply; 9+ messages in thread
From: Sagi Grimberg @ 2025-05-29 12:38 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme



On 28/05/2025 9:45, Hannes Reinecke wrote:
> There really is no point in scheduling io_work from ->queue_rq()
> if the socket is full; we'll be notified by the ->write_space()
> callback once space on the socket becomes available.
> Consequently we also need to remove the check for 'sk_stream_is_writeable()'
> in the ->write_space() callback as we need to schedule io_work
> to receive packets even if the sending side is blocked.
>
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> ---
>   drivers/nvme/host/tcp.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 0e178115dc04..e4dd1620dc28 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -411,6 +411,9 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
>   	empty = llist_add(&req->lentry, &queue->req_list) &&
>   		list_empty(&queue->send_list) && !queue->request;
>   
> +	if (!sk_stream_is_writeable(queue->sock->sk))
> +		empty = false;

This is wrong. the fact that the socket it not writeable does not mean 
that the queue
is not empty (meaning we have rqeuests that we have yet to write to the 
socket).

> +
>   	/*
>   	 * if we're the first on the send_list and we can try to send
>   	 * directly, otherwise queue io_work. Also, only do that if we
> @@ -422,7 +425,8 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
>   		mutex_unlock(&queue->send_mutex);
>   	}
>   
> -	if (last && nvme_tcp_queue_has_pending(queue))
> +	if (last && nvme_tcp_queue_has_pending(queue) &&
> +	    sk_stream_is_writeable(queue->sock->sk))

I think its fine, because either this or the callback will effectively 
not queue the work
as the work can only be queued once.

>   		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>   }
>   
> @@ -1074,7 +1078,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>   
>   	read_lock_bh(&sk->sk_callback_lock);
>   	queue = sk->sk_user_data;
> -	if (likely(queue && sk_stream_is_writeable(sk))) {
> +	if (likely(queue)) {
>   		clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>   		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>   	}

I'd keep only this change. Unless you are convinced that the other 
stream checks
are correct?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] nvme-tcp: do not queue io_work on a full socket
  2025-05-29 12:38   ` Sagi Grimberg
@ 2025-05-30  6:42     ` Hannes Reinecke
  0 siblings, 0 replies; 9+ messages in thread
From: Hannes Reinecke @ 2025-05-30  6:42 UTC (permalink / raw)
  To: Sagi Grimberg, Hannes Reinecke
  Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme

On 5/29/25 14:38, Sagi Grimberg wrote:
> 
> 
> On 28/05/2025 9:45, Hannes Reinecke wrote:
>> There really is no point in scheduling io_work from ->queue_rq()
>> if the socket is full; we'll be notified by the ->write_space()
>> callback once space on the socket becomes available.
>> Consequently we also need to remove the check for 
>> 'sk_stream_is_writeable()'
>> in the ->write_space() callback as we need to schedule io_work
>> to receive packets even if the sending side is blocked.
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>> ---
>>   drivers/nvme/host/tcp.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 0e178115dc04..e4dd1620dc28 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -411,6 +411,9 @@ static inline void nvme_tcp_queue_request(struct 
>> nvme_tcp_request *req,
>>       empty = llist_add(&req->lentry, &queue->req_list) &&
>>           list_empty(&queue->send_list) && !queue->request;
>> +    if (!sk_stream_is_writeable(queue->sock->sk))
>> +        empty = false;
> 
> This is wrong. the fact that the socket it not writeable does not mean 
> that the queue
> is not empty (meaning we have rqeuests that we have yet to write to the 
> socket).
> 
Admittedly I abuse 'empty' here; setting 'empty' to 'false' just avoids
the workqueue to be run. If 'sk_stream_is_writeable()' is false starting
the workqueue will just cause sendmsg() to return EAGAIN, and we'll be
notified via the 'write_space' callback anyway.

>> +
>>       /*
>>        * if we're the first on the send_list and we can try to send
>>        * directly, otherwise queue io_work. Also, only do that if we
>> @@ -422,7 +425,8 @@ static inline void nvme_tcp_queue_request(struct 
>> nvme_tcp_request *req,
>>           mutex_unlock(&queue->send_mutex);
>>       }
>> -    if (last && nvme_tcp_queue_has_pending(queue))
>> +    if (last && nvme_tcp_queue_has_pending(queue) &&
>> +        sk_stream_is_writeable(queue->sock->sk))
> 
> I think its fine, because either this or the callback will effectively 
> not queue the work as the work can only be queued once.
> 
>>           queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>   }
>> @@ -1074,7 +1078,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>>       read_lock_bh(&sk->sk_callback_lock);
>>       queue = sk->sk_user_data;
>> -    if (likely(queue && sk_stream_is_writeable(sk))) {
>> +    if (likely(queue)) {
>>           clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>>           queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>       }
> 
> I'd keep only this change. Unless you are convinced that the other 
> stream checks are correct?

The gist of this patch was to move the checks for 
'sk_stream_is_writeable' into io_work to avoid it being queued
when the socket is blocked, as in these cases io_work will be
restarted via the callbacks.
Prime reason here was to avoid recvmsg() or sendmsg() return with EAGAIN
as this will increase latency during command processing.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets
  2025-05-28  6:45 [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke
                   ` (2 preceding siblings ...)
  2025-05-28  6:45 ` [PATCH 3/3] nvme-tcp: do not queue io_work on a full socket Hannes Reinecke
@ 2025-06-04  8:04 ` Christoph Hellwig
  3 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2025-06-04  8:04 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, Kamaljit Singh,
	linux-nvme

I've added patches 1 and 2 to nvme-6.16.



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-06-04  8:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-28  6:45 [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke
2025-05-28  6:45 ` [PATCH 1/3] nvme-tcp: sanitize request list handling Hannes Reinecke
2025-05-29 12:34   ` Sagi Grimberg
2025-05-28  6:45 ` [PATCH 2/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke
2025-05-29 12:34   ` Sagi Grimberg
2025-05-28  6:45 ` [PATCH 3/3] nvme-tcp: do not queue io_work on a full socket Hannes Reinecke
2025-05-29 12:38   ` Sagi Grimberg
2025-05-30  6:42     ` Hannes Reinecke
2025-06-04  8:04 ` [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox