Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Sagi Grimberg <sagi@grimberg.me>
To: Steve Wise <swise@opengridcomputing.com>,
	Max Gurtovoy <maxg@mellanox.com>,
	Jason Gunthorpe <jgg@mellanox.com>
Cc: 'Leon Romanovsky' <leon@kernel.org>,
	'Doug Ledford' <dledford@redhat.com>,
	'RDMA mailing list' <linux-rdma@vger.kernel.org>,
	'Saeed Mahameed' <saeedm@mellanox.com>,
	'linux-netdev' <netdev@vger.kernel.org>
Subject: Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
Date: Thu, 16 Aug 2018 11:26:21 -0700	[thread overview]
Message-ID: <4a13541c-db48-beca-4ee7-932528b22986@grimberg.me> (raw)
In-Reply-To: <47178d4d-f730-6e59-5c19-58331cc3864a@opengridcomputing.com>


> Let me know if you want me to try this or any particular fix.

Steve, can you test this one?
--
[PATCH rfc] block: fix rdma queue mapping

nvme-rdma attempts to map queues based on irq vector affinity.
However, for some devices, completion vector irq affinity is
configurable by the user which can break the existing assumption
that irq vectors are optimally arranged over the host cpu cores.

So we map queues in two stages:
First map queues according to corresponding to the completion
vector IRQ affinity taking the first cpu in the vector affinity map.
if the current irq affinity is arranged such that a vector is not
assigned to any distinct cpu, we map it to a cpu that is on the same
node. If numa affinity can not be sufficed, we map it to any unmapped
cpu we can find. Then, map the remaining cpus in the possible cpumap
naively.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
Steve, can you test out this patch?
  block/blk-mq-cpumap.c  | 39 +++++++++++++-----------
  block/blk-mq-rdma.c    | 80 
+++++++++++++++++++++++++++++++++++++++++++-------
  include/linux/blk-mq.h |  1 +
  3 files changed, 93 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f15842..34811db8cba9 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,30 +30,35 @@ static int get_first_sibling(unsigned int cpu)
         return cpu;
  }

-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue_cpu(struct blk_mq_tag_set *set, unsigned int cpu)
  {
         unsigned int *map = set->mq_map;
         unsigned int nr_queues = set->nr_hw_queues;
-       unsigned int cpu, first_sibling;
+       unsigned int first_sibling;

-       for_each_possible_cpu(cpu) {
-               /*
-                * First do sequential mapping between CPUs and queues.
-                * In case we still have CPUs to map, and we have some 
number of
-                * threads per cores then map sibling threads to the 
same queue for
-                * performace optimizations.
-                */
-               if (cpu < nr_queues) {
+       /*
+        * First do sequential mapping between CPUs and queues.
+        * In case we still have CPUs to map, and we have some number of
+        * threads per cores then map sibling threads to the same queue for
+        * performace optimizations.
+        */
+       if (cpu < nr_queues) {
+               map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+       } else {
+               first_sibling = get_first_sibling(cpu);
+               if (first_sibling == cpu)
                         map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-               } else {
-                       first_sibling = get_first_sibling(cpu);
-                       if (first_sibling == cpu)
-                               map[cpu] = cpu_to_queue_index(nr_queues, 
cpu);
-                       else
-                               map[cpu] = map[first_sibling];
-               }
+               else
+                       map[cpu] = map[first_sibling];
         }
+}
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+       unsigned int cpu;

+       for_each_possible_cpu(cpu)
+               blk_mq_map_queue_cpu(set, cpu);
         return 0;
  }
  EXPORT_SYMBOL_GPL(blk_mq_map_queues);
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..d04cbb1925f5 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -14,6 +14,61 @@
  #include <linux/blk-mq-rdma.h>
  #include <rdma/ib_verbs.h>

+static int blk_mq_rdma_map_queue(struct blk_mq_tag_set *set,
+               struct ib_device *dev, int first_vec, unsigned int queue)
+{
+       const struct cpumask *mask;
+       unsigned int cpu;
+       bool mapped = false;
+
+       mask = ib_get_vector_affinity(dev, first_vec + queue);
+       if (!mask)
+               return -ENOTSUPP;
+
+       /* map with an unmapped cpu according to affinity mask */
+       for_each_cpu(cpu, mask) {
+               if (set->mq_map[cpu] == UINT_MAX) {
+                       set->mq_map[cpu] = queue;
+                       mapped = true;
+                       break;
+               }
+       }
+
+       if (!mapped) {
+               int n;
+
+               /* map with an unmapped cpu in the same numa node */
+               for_each_node(n) {
+                       const struct cpumask *node_cpumask = 
cpumask_of_node(n);
+
+                       if (!cpumask_intersects(mask, node_cpumask))
+                               continue;
+
+                       for_each_cpu(cpu, node_cpumask) {
+                               if (set->mq_map[cpu] == UINT_MAX) {
+                                       set->mq_map[cpu] = queue;
+                                       mapped = true;
+                                       break;
+                               }
+                       }
+               }
+       }
+
+       if (!mapped) {
+               /* map with any unmapped cpu we can find */
+               for_each_possible_cpu(cpu) {
+                       if (set->mq_map[cpu] == UINT_MAX) {
+                               set->mq_map[cpu] = queue;
+                               mapped = true;
+                               break;
+                       }
+               }
+       }
+
+       WARN_ON_ONCE(!mapped);
+       return 0;
+}
+
  /**
   * blk_mq_rdma_map_queues - provide a default queue mapping for rdma 
device
   * @set:       tagset to provide the mapping for
@@ -21,31 +76,36 @@
   * @first_vec: first interrupt vectors to use for queues (usually 0)
   *
   * This function assumes the rdma device @dev has at least as many 
available
- * interrupt vetors as @set has queues.  It will then query it's 
affinity mask
- * and built queue mapping that maps a queue to the CPUs that have irq 
affinity
- * for the corresponding vector.
+ * interrupt vetors as @set has queues.  It will then query vector 
affinity mask
+ * and attempt to build irq affinity aware queue mappings. If optimal 
affinity
+ * aware mapping cannot be acheived for a given queue, we look for any 
unmapped
+ * cpu to map it. Lastly, we map naively all other unmapped cpus in the 
mq_map.
   *
   * In case either the driver passed a @dev with less vectors than
   * @set->nr_hw_queues, or @dev does not provide an affinity mask for a
   * vector, we fallback to the naive mapping.
   */
  int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
-               struct ib_device *dev, int first_vec)
+                struct ib_device *dev, int first_vec)
  {
-       const struct cpumask *mask;
         unsigned int queue, cpu;

+       /* reset cpu mapping */
+       for_each_possible_cpu(cpu)
+               set->mq_map[cpu] = UINT_MAX;
+
         for (queue = 0; queue < set->nr_hw_queues; queue++) {
-               mask = ib_get_vector_affinity(dev, first_vec + queue);
-               if (!mask)
+               if (blk_mq_rdma_map_queue(set, dev, first_vec, queue))
                         goto fallback;
+       }

-               for_each_cpu(cpu, mask)
-                       set->mq_map[cpu] = queue;
+       /* map any remaining unmapped cpus */
+       for_each_possible_cpu(cpu) {
+               if (set->mq_map[cpu] == UINT_MAX)
+                       blk_mq_map_queue_cpu(set, cpu);;
         }

         return 0;
-
  fallback:
         return blk_mq_map_queues(set);
  }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d710e92874cc..6eb09c4de34f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -285,6 +285,7 @@ int blk_mq_freeze_queue_wait_timeout(struct 
request_queue *q,
                                      unsigned long timeout);

  int blk_mq_map_queues(struct blk_mq_tag_set *set);
+void blk_mq_map_queue_cpu(struct blk_mq_tag_set *set, unsigned int cpu);
  void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int 
nr_hw_queues);

  void blk_mq_quiesce_queue_nowait(struct request_queue *q);

next prev parent reply	other threads:[~2018-08-16 21:26 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-16  8:30 [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask Leon Romanovsky
2018-07-16 10:23 ` Sagi Grimberg
2018-07-16 10:30   ` Leon Romanovsky
2018-07-16 14:54     ` Max Gurtovoy
2018-07-16 14:59       ` Sagi Grimberg
2018-07-16 16:46         ` Max Gurtovoy
2018-07-16 17:08           ` Steve Wise
2018-07-17  8:46             ` Max Gurtovoy
2018-07-17  8:58               ` Leon Romanovsky
2018-07-17 10:05                 ` Max Gurtovoy
2018-07-17 13:03               ` Steve Wise
2018-07-18 11:38                 ` Sagi Grimberg
2018-07-18 14:14                   ` Max Gurtovoy
2018-07-18 14:25                     ` Steve Wise
2018-07-18 19:29                     ` Steve Wise
2018-07-19 14:50                       ` Max Gurtovoy
2018-07-19 18:45                         ` Steve Wise
2018-07-20  1:25                           ` Max Gurtovoy
2018-07-23 16:49                             ` Jason Gunthorpe
2018-07-23 16:53                               ` Max Gurtovoy
2018-07-30 15:47                                 ` Steve Wise
2018-07-31 10:00                                   ` Max Gurtovoy
2018-08-01  5:12                                 ` Sagi Grimberg
2018-08-01 14:27                                   ` Max Gurtovoy
2018-08-06 19:20                                     ` Steve Wise
2018-08-15  6:37                                       ` Leon Romanovsky
2018-08-16 18:26                                       ` Sagi Grimberg [this message]
2018-08-16 18:32                                         ` Steve Wise
2018-08-17 16:17                                           ` Steve Wise
2018-08-17 20:03                                             ` Sagi Grimberg
2018-08-17 20:17                                               ` Jason Gunthorpe
2018-08-17 20:26                                                 ` Sagi Grimberg
2018-08-17 21:28                                               ` Steve Wise
2018-07-24 15:24                             ` Steve Wise
2018-07-24 20:52                               ` Steve Wise

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:3eb169f1584 dfblob:34811db8cba dfblob:996167f1de1
dfblob:d04cbb1925f dfblob:d710e92874c dfblob:6eb09c4de34 )
 OR (
bs:"Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4a13541c-db48-beca-4ee7-932528b22986@grimberg.me \
    --to=sagi@grimberg.me \
    --cc=dledford@redhat.com \
    --cc=jgg@mellanox.com \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=maxg@mellanox.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@mellanox.com \
    --cc=swise@opengridcomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).