public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
* [PATCH] nvme-tcp: Fix regression that causes sporadic requests to time out
@ 2022-09-05 15:07 Sagi Grimberg
  2022-09-05 16:19 ` Jonathan Nicklin
  2022-09-06  4:43 ` Christoph Hellwig
  0 siblings, 2 replies; 3+ messages in thread
From: Sagi Grimberg @ 2022-09-05 15:07 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig
  Cc: Keith Busch, Chaitanya Kulkarni, Jonathan Nicklin

When we queue requests, we strive to batch as much as possible and also
signal the network stack that more data is about to be sent over a socket
with MSG_SENDPAGE_NOTLAST. This flag looks at the pending requests queued
as well as queue->more_requests that is derived from the block layer
last-in-batch indication.

We set more_request=true when we flush the request directly from .queue_rq
submission context (in nvme_tcp_send_all), however this is wrongly
assuming that no other requests may be queued during the execution of
nvme_tcp_send_all.

Due to this, a race condition may happen where:
1. request X is queued as !last-in-batch
2. request X submission context calls nvme_tcp_send_all directly
3. nvme_tcp_send_all is preempted and schedules to a different cpu
4. request Y is queued as last-in-batch
5. nvme_tcp_send_all context sends request X+Y, however signals for
both MSG_SENDPAGE_NOTLAST because queue->more_requests=true.
==> none of the requests is pushed down to the wire as the network
stack is waiting for more data, both requests timeout.

To fix this, we eliminate queue->more_requests and only rely on
the queue req_list and send_list to be not-empty.

Fixes: 122e5b9f3d37 ("nvme-tcp: optimize network stack with setting msg
flags according to batch size")
Reported-by: Jonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/tcp.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 54b4e8a7fe42..d5871fd6f769 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -121,7 +121,6 @@ struct nvme_tcp_queue {
 	struct mutex		send_mutex;
 	struct llist_head	req_list;
 	struct list_head	send_list;
-	bool			more_requests;
 
 	/* recv state */
 	void			*pdu;
@@ -320,7 +319,7 @@ static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
 static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
 {
 	return !list_empty(&queue->send_list) ||
-		!llist_empty(&queue->req_list) || queue->more_requests;
+		!llist_empty(&queue->req_list);
 }
 
 static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
@@ -339,9 +338,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
 	 */
 	if (queue->io_cpu == raw_smp_processor_id() &&
 	    sync && empty && mutex_trylock(&queue->send_mutex)) {
-		queue->more_requests = !last;
 		nvme_tcp_send_all(queue);
-		queue->more_requests = false;
 		mutex_unlock(&queue->send_mutex);
 	}
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] nvme-tcp: Fix regression that causes sporadic requests to time out
  2022-09-05 15:07 [PATCH] nvme-tcp: Fix regression that causes sporadic requests to time out Sagi Grimberg
@ 2022-09-05 16:19 ` Jonathan Nicklin
  2022-09-06  4:43 ` Christoph Hellwig
  1 sibling, 0 replies; 3+ messages in thread
From: Jonathan Nicklin @ 2022-09-05 16:19 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni

Thanks, Sagi. I have tested the changes to eliminate queue->more_requests and
confirm that it addresses the bug (1000 runs of the repro scenario executed
without issue).

Tested-by: Jonathan Nicklin <jnicklin@blockbridge.com>

On Mon, Sep 5, 2022 at 11:07 AM Sagi Grimberg <sagi@grimberg.me> wrote:
>
> When we queue requests, we strive to batch as much as possible and also
> signal the network stack that more data is about to be sent over a socket
> with MSG_SENDPAGE_NOTLAST. This flag looks at the pending requests queued
> as well as queue->more_requests that is derived from the block layer
> last-in-batch indication.
>
> We set more_request=true when we flush the request directly from .queue_rq
> submission context (in nvme_tcp_send_all), however this is wrongly
> assuming that no other requests may be queued during the execution of
> nvme_tcp_send_all.
>
> Due to this, a race condition may happen where:
> 1. request X is queued as !last-in-batch
> 2. request X submission context calls nvme_tcp_send_all directly
> 3. nvme_tcp_send_all is preempted and schedules to a different cpu
> 4. request Y is queued as last-in-batch
> 5. nvme_tcp_send_all context sends request X+Y, however signals for
> both MSG_SENDPAGE_NOTLAST because queue->more_requests=true.
> ==> none of the requests is pushed down to the wire as the network
> stack is waiting for more data, both requests timeout.
>
> To fix this, we eliminate queue->more_requests and only rely on
> the queue req_list and send_list to be not-empty.
>
> Fixes: 122e5b9f3d37 ("nvme-tcp: optimize network stack with setting msg
> flags according to batch size")
> Reported-by: Jonathan Nicklin <jnicklin@blockbridge.com>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>  drivers/nvme/host/tcp.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 54b4e8a7fe42..d5871fd6f769 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -121,7 +121,6 @@ struct nvme_tcp_queue {
>         struct mutex            send_mutex;
>         struct llist_head       req_list;
>         struct list_head        send_list;
> -       bool                    more_requests;
>
>         /* recv state */
>         void                    *pdu;
> @@ -320,7 +319,7 @@ static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>  static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
>  {
>         return !list_empty(&queue->send_list) ||
> -               !llist_empty(&queue->req_list) || queue->more_requests;
> +               !llist_empty(&queue->req_list);
>  }
>
>  static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
> @@ -339,9 +338,7 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
>          */
>         if (queue->io_cpu == raw_smp_processor_id() &&
>             sync && empty && mutex_trylock(&queue->send_mutex)) {
> -               queue->more_requests = !last;
>                 nvme_tcp_send_all(queue);
> -               queue->more_requests = false;
>                 mutex_unlock(&queue->send_mutex);
>         }
>
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] nvme-tcp: Fix regression that causes sporadic requests to time out
  2022-09-05 15:07 [PATCH] nvme-tcp: Fix regression that causes sporadic requests to time out Sagi Grimberg
  2022-09-05 16:19 ` Jonathan Nicklin
@ 2022-09-06  4:43 ` Christoph Hellwig
  1 sibling, 0 replies; 3+ messages in thread
From: Christoph Hellwig @ 2022-09-06  4:43 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme, Christoph Hellwig, Keith Busch, Chaitanya Kulkarni,
	Jonathan Nicklin

Thanks,

applied to nvme-6.0.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-09-06  4:43 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-09-05 15:07 [PATCH] nvme-tcp: Fix regression that causes sporadic requests to time out Sagi Grimberg
2022-09-05 16:19 ` Jonathan Nicklin
2022-09-06  4:43 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox