* [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets
@ 2025-05-28 6:45 Hannes Reinecke
2025-05-28 6:45 ` [PATCH 1/3] nvme-tcp: sanitize request list handling Hannes Reinecke
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Hannes Reinecke @ 2025-05-28 6:45 UTC (permalink / raw)
To: Sagi Grimberg
Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme,
Hannes Reinecke
Hi all,
I have been chasing keep-alive timeouts with TLS enabled in the last few
days (weeks, even :-(). On larger setups (eg with 32 queues) the connection
never got established properly as I've been hitting keep-alive timeouts before
the last queue got connected.
Turns out that occasionally we simply do not send the keep-alive request; it's
been added to the request list but the io_work workqueue function is never
restarted as it bails out after nvme_tcp_try_recv() returns -EAGAIN.
During debugging I also found that we're quite lazy with the list
handling of requests, so I've added two preliminary patches to ensure
that all list elements are properly terminated.
As usual, comments and reviews are welcome.
Changes to v3:
- Drop merged patch to open-code nvme_tcp_queue_requests()
- Add patch to check for sk_stream_is_writeable()
Changes to v2:
- Removed AEN patches again
Changes to the original submission:
- Include reviews from Chris Leech
- Add patch to requeue namespace scan
- Add patch to re-read ANA log page
Hannes Reinecke (3):
nvme-tcp: sanitize request list handling
nvme-tcp: fix I/O stalls on congested sockets
nvme-tcp: do not queue io_work on a full socket
drivers/nvme/host/tcp.c | 30 ++++++++++++++++++++++++++----
1 file changed, 26 insertions(+), 4 deletions(-)
--
2.35.3
^ permalink raw reply [flat|nested] 9+ messages in thread* [PATCH 1/3] nvme-tcp: sanitize request list handling 2025-05-28 6:45 [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke @ 2025-05-28 6:45 ` Hannes Reinecke 2025-05-29 12:34 ` Sagi Grimberg 2025-05-28 6:45 ` [PATCH 2/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke ` (2 subsequent siblings) 3 siblings, 1 reply; 9+ messages in thread From: Hannes Reinecke @ 2025-05-28 6:45 UTC (permalink / raw) To: Sagi Grimberg Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme, Hannes Reinecke Validate the request in nvme_tcp_handle_r2t() to ensure it's not part of any list, otherwise a malicious R2T PDU might inject a loop in request list processing. Signed-off-by: Hannes Reinecke <hare@kernel.org> --- drivers/nvme/host/tcp.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 853bc67d045c..69b89a3c09a1 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -452,7 +452,8 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue) return NULL; } - list_del(&req->entry); + list_del_init(&req->entry); + init_llist_node(&req->lentry); return req; } @@ -560,6 +561,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set, req->queue = queue; nvme_req(rq)->ctrl = &ctrl->ctrl; nvme_req(rq)->cmd = &pdu->cmd; + init_llist_node(&req->lentry); + INIT_LIST_HEAD(&req->entry); return 0; } @@ -764,6 +767,14 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue, return -EPROTO; } + if (llist_on_list(&req->lentry) || + !list_empty(&req->entry)) { + dev_err(queue->ctrl->ctrl.device, + "req %d unexpected r2t while processing request\n", + rq->tag); + return -EPROTO; + } + req->pdu_len = 0; req->h2cdata_left = r2t_length; req->h2cdata_offset = r2t_offset; @@ -2638,6 +2649,8 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg) ctrl->async_req.offset = 0; ctrl->async_req.curr_bio = NULL; ctrl->async_req.data_len = 0; + init_llist_node(&ctrl->async_req.lentry); + INIT_LIST_HEAD(&ctrl->async_req.entry); nvme_tcp_queue_request(&ctrl->async_req, true); } -- 2.35.3 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/3] nvme-tcp: sanitize request list handling 2025-05-28 6:45 ` [PATCH 1/3] nvme-tcp: sanitize request list handling Hannes Reinecke @ 2025-05-29 12:34 ` Sagi Grimberg 0 siblings, 0 replies; 9+ messages in thread From: Sagi Grimberg @ 2025-05-29 12:34 UTC (permalink / raw) To: Hannes Reinecke Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme Reviewed-by: Sagi Grimberg <sagi@grimberg.me> (can you please add these when resending future versions?) ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/3] nvme-tcp: fix I/O stalls on congested sockets 2025-05-28 6:45 [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke 2025-05-28 6:45 ` [PATCH 1/3] nvme-tcp: sanitize request list handling Hannes Reinecke @ 2025-05-28 6:45 ` Hannes Reinecke 2025-05-29 12:34 ` Sagi Grimberg 2025-05-28 6:45 ` [PATCH 3/3] nvme-tcp: do not queue io_work on a full socket Hannes Reinecke 2025-06-04 8:04 ` [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Christoph Hellwig 3 siblings, 1 reply; 9+ messages in thread From: Hannes Reinecke @ 2025-05-28 6:45 UTC (permalink / raw) To: Sagi Grimberg Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme, Hannes Reinecke, Chris Leech When the socket is busy processing nvme_tcp_try_recv() might return -EAGAIN, but this doesn't automatically imply that the sending side is blocked, too. So check if there are pending requests once nvme_tcp_try_recv() returns -EAGAIN and continue with the sending loop to avoid I/O stalls. Acked-by: Chris Leech <cleech@redhat.com> Signed-off-by: Hannes Reinecke <hare@kernel.org> --- drivers/nvme/host/tcp.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 69b89a3c09a1..0e178115dc04 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -1361,7 +1361,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue) queue->nr_cqe = 0; consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb); release_sock(sk); - return consumed; + return consumed == -EAGAIN ? 0 : consumed; } static void nvme_tcp_io_work(struct work_struct *w) @@ -1389,6 +1389,11 @@ static void nvme_tcp_io_work(struct work_struct *w) else if (unlikely(result < 0)) return; + /* did we get some space after spending time in recv? */ + if (nvme_tcp_queue_has_pending(queue) && + sk_stream_is_writeable(queue->sock->sk)) + pending = true; + if (!pending || !queue->rd_enabled) return; -- 2.35.3 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 2/3] nvme-tcp: fix I/O stalls on congested sockets 2025-05-28 6:45 ` [PATCH 2/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke @ 2025-05-29 12:34 ` Sagi Grimberg 0 siblings, 0 replies; 9+ messages in thread From: Sagi Grimberg @ 2025-05-29 12:34 UTC (permalink / raw) To: Hannes Reinecke Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme, Chris Leech Reviewed-by: Sagi Grimberg <sagi@grimberg.me> ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 3/3] nvme-tcp: do not queue io_work on a full socket 2025-05-28 6:45 [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke 2025-05-28 6:45 ` [PATCH 1/3] nvme-tcp: sanitize request list handling Hannes Reinecke 2025-05-28 6:45 ` [PATCH 2/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke @ 2025-05-28 6:45 ` Hannes Reinecke 2025-05-29 12:38 ` Sagi Grimberg 2025-06-04 8:04 ` [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Christoph Hellwig 3 siblings, 1 reply; 9+ messages in thread From: Hannes Reinecke @ 2025-05-28 6:45 UTC (permalink / raw) To: Sagi Grimberg Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme, Hannes Reinecke There really is no point in scheduling io_work from ->queue_rq() if the socket is full; we'll be notified by the ->write_space() callback once space on the socket becomes available. Consequently we also need to remove the check for 'sk_stream_is_writeable()' in the ->write_space() callback as we need to schedule io_work to receive packets even if the sending side is blocked. Signed-off-by: Hannes Reinecke <hare@kernel.org> --- drivers/nvme/host/tcp.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 0e178115dc04..e4dd1620dc28 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -411,6 +411,9 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req, empty = llist_add(&req->lentry, &queue->req_list) && list_empty(&queue->send_list) && !queue->request; + if (!sk_stream_is_writeable(queue->sock->sk)) + empty = false; + /* * if we're the first on the send_list and we can try to send * directly, otherwise queue io_work. Also, only do that if we @@ -422,7 +425,8 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req, mutex_unlock(&queue->send_mutex); } - if (last && nvme_tcp_queue_has_pending(queue)) + if (last && nvme_tcp_queue_has_pending(queue) && + sk_stream_is_writeable(queue->sock->sk)) queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work); } @@ -1074,7 +1078,7 @@ static void nvme_tcp_write_space(struct sock *sk) read_lock_bh(&sk->sk_callback_lock); queue = sk->sk_user_data; - if (likely(queue && sk_stream_is_writeable(sk))) { + if (likely(queue)) { clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags); queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work); } -- 2.35.3 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] nvme-tcp: do not queue io_work on a full socket 2025-05-28 6:45 ` [PATCH 3/3] nvme-tcp: do not queue io_work on a full socket Hannes Reinecke @ 2025-05-29 12:38 ` Sagi Grimberg 2025-05-30 6:42 ` Hannes Reinecke 0 siblings, 1 reply; 9+ messages in thread From: Sagi Grimberg @ 2025-05-29 12:38 UTC (permalink / raw) To: Hannes Reinecke Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme On 28/05/2025 9:45, Hannes Reinecke wrote: > There really is no point in scheduling io_work from ->queue_rq() > if the socket is full; we'll be notified by the ->write_space() > callback once space on the socket becomes available. > Consequently we also need to remove the check for 'sk_stream_is_writeable()' > in the ->write_space() callback as we need to schedule io_work > to receive packets even if the sending side is blocked. > > Signed-off-by: Hannes Reinecke <hare@kernel.org> > --- > drivers/nvme/host/tcp.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c > index 0e178115dc04..e4dd1620dc28 100644 > --- a/drivers/nvme/host/tcp.c > +++ b/drivers/nvme/host/tcp.c > @@ -411,6 +411,9 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req, > empty = llist_add(&req->lentry, &queue->req_list) && > list_empty(&queue->send_list) && !queue->request; > > + if (!sk_stream_is_writeable(queue->sock->sk)) > + empty = false; This is wrong. the fact that the socket it not writeable does not mean that the queue is not empty (meaning we have rqeuests that we have yet to write to the socket). > + > /* > * if we're the first on the send_list and we can try to send > * directly, otherwise queue io_work. Also, only do that if we > @@ -422,7 +425,8 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req, > mutex_unlock(&queue->send_mutex); > } > > - if (last && nvme_tcp_queue_has_pending(queue)) > + if (last && nvme_tcp_queue_has_pending(queue) && > + sk_stream_is_writeable(queue->sock->sk)) I think its fine, because either this or the callback will effectively not queue the work as the work can only be queued once. > queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work); > } > > @@ -1074,7 +1078,7 @@ static void nvme_tcp_write_space(struct sock *sk) > > read_lock_bh(&sk->sk_callback_lock); > queue = sk->sk_user_data; > - if (likely(queue && sk_stream_is_writeable(sk))) { > + if (likely(queue)) { > clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags); > queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work); > } I'd keep only this change. Unless you are convinced that the other stream checks are correct? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/3] nvme-tcp: do not queue io_work on a full socket 2025-05-29 12:38 ` Sagi Grimberg @ 2025-05-30 6:42 ` Hannes Reinecke 0 siblings, 0 replies; 9+ messages in thread From: Hannes Reinecke @ 2025-05-30 6:42 UTC (permalink / raw) To: Sagi Grimberg, Hannes Reinecke Cc: Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme On 5/29/25 14:38, Sagi Grimberg wrote: > > > On 28/05/2025 9:45, Hannes Reinecke wrote: >> There really is no point in scheduling io_work from ->queue_rq() >> if the socket is full; we'll be notified by the ->write_space() >> callback once space on the socket becomes available. >> Consequently we also need to remove the check for >> 'sk_stream_is_writeable()' >> in the ->write_space() callback as we need to schedule io_work >> to receive packets even if the sending side is blocked. >> >> Signed-off-by: Hannes Reinecke <hare@kernel.org> >> --- >> drivers/nvme/host/tcp.c | 8 ++++++-- >> 1 file changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c >> index 0e178115dc04..e4dd1620dc28 100644 >> --- a/drivers/nvme/host/tcp.c >> +++ b/drivers/nvme/host/tcp.c >> @@ -411,6 +411,9 @@ static inline void nvme_tcp_queue_request(struct >> nvme_tcp_request *req, >> empty = llist_add(&req->lentry, &queue->req_list) && >> list_empty(&queue->send_list) && !queue->request; >> + if (!sk_stream_is_writeable(queue->sock->sk)) >> + empty = false; > > This is wrong. the fact that the socket it not writeable does not mean > that the queue > is not empty (meaning we have rqeuests that we have yet to write to the > socket). > Admittedly I abuse 'empty' here; setting 'empty' to 'false' just avoids the workqueue to be run. If 'sk_stream_is_writeable()' is false starting the workqueue will just cause sendmsg() to return EAGAIN, and we'll be notified via the 'write_space' callback anyway. >> + >> /* >> * if we're the first on the send_list and we can try to send >> * directly, otherwise queue io_work. Also, only do that if we >> @@ -422,7 +425,8 @@ static inline void nvme_tcp_queue_request(struct >> nvme_tcp_request *req, >> mutex_unlock(&queue->send_mutex); >> } >> - if (last && nvme_tcp_queue_has_pending(queue)) >> + if (last && nvme_tcp_queue_has_pending(queue) && >> + sk_stream_is_writeable(queue->sock->sk)) > > I think its fine, because either this or the callback will effectively > not queue the work as the work can only be queued once. > >> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work); >> } >> @@ -1074,7 +1078,7 @@ static void nvme_tcp_write_space(struct sock *sk) >> read_lock_bh(&sk->sk_callback_lock); >> queue = sk->sk_user_data; >> - if (likely(queue && sk_stream_is_writeable(sk))) { >> + if (likely(queue)) { >> clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags); >> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work); >> } > > I'd keep only this change. Unless you are convinced that the other > stream checks are correct? The gist of this patch was to move the checks for 'sk_stream_is_writeable' into io_work to avoid it being queued when the socket is blocked, as in these cases io_work will be restarted via the callbacks. Prime reason here was to avoid recvmsg() or sendmsg() return with EAGAIN as this will increase latency during command processing. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets 2025-05-28 6:45 [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke ` (2 preceding siblings ...) 2025-05-28 6:45 ` [PATCH 3/3] nvme-tcp: do not queue io_work on a full socket Hannes Reinecke @ 2025-06-04 8:04 ` Christoph Hellwig 3 siblings, 0 replies; 9+ messages in thread From: Christoph Hellwig @ 2025-06-04 8:04 UTC (permalink / raw) To: Hannes Reinecke Cc: Sagi Grimberg, Christoph Hellwig, Keith Busch, Kamaljit Singh, linux-nvme I've added patches 1 and 2 to nvme-6.16. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-06-04 8:32 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-28 6:45 [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Hannes Reinecke 2025-05-28 6:45 ` [PATCH 1/3] nvme-tcp: sanitize request list handling Hannes Reinecke 2025-05-29 12:34 ` Sagi Grimberg 2025-05-28 6:45 ` [PATCH 2/3] nvme-tcp: fix I/O stalls on congested sockets Hannes Reinecke 2025-05-29 12:34 ` Sagi Grimberg 2025-05-28 6:45 ` [PATCH 3/3] nvme-tcp: do not queue io_work on a full socket Hannes Reinecke 2025-05-29 12:38 ` Sagi Grimberg 2025-05-30 6:42 ` Hannes Reinecke 2025-06-04 8:04 ` [PATCHv4 0/3] nvme-tcp: fixup I/O stall on congested sockets Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox