All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bart.VanAssche@sandisk.com (Bart Van Assche)
Subject: [PATCH v2] nvme-rdma: support devices with queue size < 32
Date: Mon, 10 Apr 2017 15:32:18 +0000	[thread overview]
Message-ID: <1491838338.4199.5.camel@sandisk.com> (raw)
In-Reply-To: <1519881025.363156294.1491837154312.JavaMail.zimbra@kalray.eu>

On Mon, 2017-04-10@17:12 +0200, Marta Rybczynska wrote:
> In the case of small NVMe-oF queue size (<32) we may enter
> a deadlock caused by the fact that the IB completions aren't sent
> waiting for 32 and the send queue will fill up.
> 
> The error is seen as (using mlx5):
> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
> 
> This patch changes the way the signaling is done so
> that it depends on the queue depth now. The magic define has
> been removed completely.
> 
> Signed-off-by: Marta Rybczynska <marta.rybczynska at kalray.eu>
> Signed-off-by: Samuel Jones <sjones at kalray.eu>
> ---
> Changes from v1:
> * signal by queue size/2, remove hardcoded 32
> * support queue depth of 1
> 
>  drivers/nvme/host/rdma.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 47a479f..4de1b92 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
>                 nvme_rdma_wr_error(cq, wc, "SEND");
>  }
>  
> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> +{
> +       int sig_limit;
> +
> +       /* We signal completion every queue depth/2 and also
> +        * handle the case of possible device with queue_depth=1,
> +        * where we would need to signal every message.
> +        */
> +       sig_limit = max(queue->queue_size / 2, 1);
> +       return (++queue->sig_count % sig_limit) == 0;
> +}
> +
>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>                 struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
>                 struct ib_send_wr *first, bool flush)
> @@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>          * Would have been way to obvious to handle this in hardware or
>          * at least the RDMA stack..
>          *
> -        * This messy and racy code sniplet is copy and pasted from the iSER
> -        * initiator, and the magic '32' comes from there as well.
> -        *
>          * Always signal the flushes. The magic request used for the flush
>          * sequencer is not allocated in our driver's tagset and it's
>          * triggered to be freed by blk_cleanup_queue(). So we need to
> @@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>          * embedded in request's payload, is not freed when __ib_process_cq()
>          * calls wr_cqe->done().
>          */
> -       if ((++queue->sig_count % 32) == 0 || flush)
> +       if (nvme_rdma_queue_sig_limit(queue) || flush)
>                 wr.send_flags |= IB_SEND_SIGNALED;
>  
>         if (first)

Hello Marta,

The approach of this patch is suboptimal from a performance point of view.
If the number of WRs that have been submitted since the last signaled WR
was submitted would be tracked in a member variable that would allow to
get rid of the (relatively slow) division operation.

Bart.

WARNING: multiple messages have this Message-ID (diff)
From: Bart Van Assche <Bart.VanAssche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
To: "mrybczyn-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org"
	<mrybczyn-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>,
	"leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org"
	<leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	"linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"hch-jcswGhMUV9g@public.gmane.org"
	<hch-jcswGhMUV9g@public.gmane.org>,
	"axboe-b10kYP2dOMg@public.gmane.org"
	<axboe-b10kYP2dOMg@public.gmane.org>,
	"linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org"
	<linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org>,
	"maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org"
	<maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	"jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org"
	<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>,
	"keith.busch-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org"
	<keith.busch-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>,
	"dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org"
	<dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: "samuel.jones-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org"
	<samuel.jones-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
Subject: Re: [PATCH v2] nvme-rdma: support devices with queue size < 32
Date: Mon, 10 Apr 2017 15:32:18 +0000	[thread overview]
Message-ID: <1491838338.4199.5.camel@sandisk.com> (raw)
In-Reply-To: <1519881025.363156294.1491837154312.JavaMail.zimbra-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>

On Mon, 2017-04-10 at 17:12 +0200, Marta Rybczynska wrote:
> In the case of small NVMe-oF queue size (<32) we may enter
> a deadlock caused by the fact that the IB completions aren't sent
> waiting for 32 and the send queue will fill up.
> 
> The error is seen as (using mlx5):
> [ 2048.693355] mlx5_0:mlx5_ib_post_send:3765:(pid 7273):
> [ 2048.693360] nvme nvme1: nvme_rdma_post_send failed with error code -12
> 
> This patch changes the way the signaling is done so
> that it depends on the queue depth now. The magic define has
> been removed completely.
> 
> Signed-off-by: Marta Rybczynska <marta.rybczynska-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
> Signed-off-by: Samuel Jones <sjones-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
> ---
> Changes from v1:
> * signal by queue size/2, remove hardcoded 32
> * support queue depth of 1
> 
>  drivers/nvme/host/rdma.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 47a479f..4de1b92 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1029,6 +1029,18 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
>                 nvme_rdma_wr_error(cq, wc, "SEND");
>  }
>  
> +static inline nvme_rdma_queue_sig_limit(struct nvme_rdma_queue *queue)
> +{
> +       int sig_limit;
> +
> +       /* We signal completion every queue depth/2 and also
> +        * handle the case of possible device with queue_depth=1,
> +        * where we would need to signal every message.
> +        */
> +       sig_limit = max(queue->queue_size / 2, 1);
> +       return (++queue->sig_count % sig_limit) == 0;
> +}
> +
>  static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>                 struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
>                 struct ib_send_wr *first, bool flush)
> @@ -1056,9 +1068,6 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>          * Would have been way to obvious to handle this in hardware or
>          * at least the RDMA stack..
>          *
> -        * This messy and racy code sniplet is copy and pasted from the iSER
> -        * initiator, and the magic '32' comes from there as well.
> -        *
>          * Always signal the flushes. The magic request used for the flush
>          * sequencer is not allocated in our driver's tagset and it's
>          * triggered to be freed by blk_cleanup_queue(). So we need to
> @@ -1066,7 +1075,7 @@ static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
>          * embedded in request's payload, is not freed when __ib_process_cq()
>          * calls wr_cqe->done().
>          */
> -       if ((++queue->sig_count % 32) == 0 || flush)
> +       if (nvme_rdma_queue_sig_limit(queue) || flush)
>                 wr.send_flags |= IB_SEND_SIGNALED;
>  
>         if (first)

Hello Marta,

The approach of this patch is suboptimal from a performance point of view.
If the number of WRs that have been submitted since the last signaled WR
was submitted would be tracked in a member variable that would allow to
get rid of the (relatively slow) division operation.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2017-04-10 15:32 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-10 15:12 [PATCH v2] nvme-rdma: support devices with queue size < 32 Marta Rybczynska
2017-04-10 15:12 ` Marta Rybczynska
2017-04-10 15:16 ` Christoph Hellwig
2017-04-10 15:16   ` Christoph Hellwig
2017-04-10 15:21   ` Marta Rybczynska
2017-04-10 15:21     ` Marta Rybczynska
2017-04-10 15:27   ` Bart Van Assche
2017-04-10 15:27     ` Bart Van Assche
2017-04-10 15:31     ` hch
2017-04-10 15:31       ` hch-jcswGhMUV9g
2017-04-10 15:32 ` Bart Van Assche [this message]
2017-04-10 15:32   ` Bart Van Assche
2017-04-11  8:52   ` Marta Rybczynska
2017-04-11  8:52     ` Marta Rybczynska
2017-04-11 10:50     ` Max Gurtovoy
2017-04-11 10:50       ` Max Gurtovoy
2017-04-11 11:04       ` Marta Rybczynska
2017-04-11 11:04         ` Marta Rybczynska
2017-04-11 15:10     ` Bart Van Assche
2017-04-11 15:10       ` Bart Van Assche
2017-04-20  9:43       ` [PATCH v3] " Marta Rybczynska
2017-04-20  9:43         ` Marta Rybczynska
2017-04-20 11:37         ` Sagi Grimberg
2017-04-20 11:37           ` Sagi Grimberg
2017-04-20 11:43           ` Sagi Grimberg
2017-04-20 11:43             ` Sagi Grimberg
2017-04-21  8:01             ` Marta Rybczynska
2017-04-21  8:01               ` Marta Rybczynska
2017-05-22 18:51 ` [PATCH v2] " Christoph Hellwig
2017-05-22 18:51   ` Christoph Hellwig
2017-05-23 15:32   ` Marta Rybczynska
2017-05-23 15:32     ` Marta Rybczynska
2017-06-05  9:47     ` Marta Rybczynska
2017-06-05  9:47       ` Marta Rybczynska

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1491838338.4199.5.camel@sandisk.com \
    --to=bart.vanassche@sandisk.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.