From: "Steve Wise" <swise@opengridcomputing.com>
To: "'Max Gurtovoy'" <maxg@mellanox.com>,
"'Sagi Grimberg'" <sagi@grimberg.me>,
"'Leon Romanovsky'" <leon@kernel.org>
Cc: "'Doug Ledford'" <dledford@redhat.com>,
"'Jason Gunthorpe'" <jgg@mellanox.com>,
"'RDMA mailing list'" <linux-rdma@vger.kernel.org>,
"'Saeed Mahameed'" <saeedm@mellanox.com>,
"'linux-netdev'" <netdev@vger.kernel.org>
Subject: RE: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
Date: Wed, 18 Jul 2018 14:29:14 -0500 [thread overview]
Message-ID: <02dc01d41ecd$9cc8a0b0$d659e210$@opengridcomputing.com> (raw)
In-Reply-To: <9a4d8d50-19b0-fcaa-d4a3-6cfa2318a973@mellanox.com>
[-- Attachment #1: Type: text/plain, Size: 6432 bytes --]
>
> On 7/18/2018 2:38 PM, Sagi Grimberg wrote:
> >
> >>> IMO we must fulfil the user wish to connect to N queues and not reduce
> >>> it because of affinity overlaps. So in order to push Leon's patch we
> >>> must also fix the blk_mq_rdma_map_queues to do a best effort
> mapping
> >>> according the affinity and map the rest in naive way (in that way we
> >>> will *always* map all the queues).
> >>
> >> That is what I would expect also. For example, in my node, where
> >> there are
> >> 16 cpus, and 2 numa nodes, I observe much better nvmf IOPS
> performance by
> >> setting up my 16 driver completion event queues such that each is
> >> bound to a
> >> node-local cpu. So I end up with each nodel-local cpu having 2 queues
> >> bound
> >> to it. W/O adding support in iw_cxgb4 for ib_get_vector_affinity(),
> >> this
> >> works fine. I assumed adding ib_get_vector_affinity() would allow
> >> this to
> >> all "just work" by default, but I'm running into this connection failure
> >> issue.
> >>
> >> I don't understand exactly what the blk_mq layer is trying to do, but I
> >> assume it has ingress event queues and processing that it trying to align
> >> with the drivers ingress cq event handling, so everybody stays on the
> >> same
> >> cpu (or at least node). But something else is going on. Is there
> >> documentation on how this works somewhere?
> >
> > Does this (untested) patch help?
>
> I'm not sure (I'll test it tomorrow) because the issue is the unmapped
> queues and not the cpus.
> for example, if the affinity of q=6 and q=12 returned the same cpumask
> than q=6 will not be mapped and will fail to connect.
>
Attached is a patch that applies cleanly for me. It has problems if vectors have affinity to more than 1 cpu:
[ 2031.988881] iw_cxgb4: comp_vector 0, irq 203 mask 0xff00
[ 2031.994706] iw_cxgb4: comp_vector 1, irq 204 mask 0xff00
[ 2032.000348] iw_cxgb4: comp_vector 2, irq 205 mask 0xff00
[ 2032.005992] iw_cxgb4: comp_vector 3, irq 206 mask 0xff00
[ 2032.011629] iw_cxgb4: comp_vector 4, irq 207 mask 0xff00
[ 2032.017271] iw_cxgb4: comp_vector 5, irq 208 mask 0xff00
[ 2032.022901] iw_cxgb4: comp_vector 6, irq 209 mask 0xff00
[ 2032.028514] iw_cxgb4: comp_vector 7, irq 210 mask 0xff00
[ 2032.034110] iw_cxgb4: comp_vector 8, irq 211 mask 0xff00
[ 2032.039677] iw_cxgb4: comp_vector 9, irq 212 mask 0xff00
[ 2032.045244] iw_cxgb4: comp_vector 10, irq 213 mask 0xff00
[ 2032.050889] iw_cxgb4: comp_vector 11, irq 214 mask 0xff00
[ 2032.056531] iw_cxgb4: comp_vector 12, irq 215 mask 0xff00
[ 2032.062174] iw_cxgb4: comp_vector 13, irq 216 mask 0xff00
[ 2032.067817] iw_cxgb4: comp_vector 14, irq 217 mask 0xff00
[ 2032.073457] iw_cxgb4: comp_vector 15, irq 218 mask 0xff00
[ 2032.079102] blk_mq_rdma_map_queues: set->mq_map[0] queue 0 vector 0
[ 2032.085621] blk_mq_rdma_map_queues: set->mq_map[1] queue 1 vector 1
[ 2032.092139] blk_mq_rdma_map_queues: set->mq_map[2] queue 2 vector 2
[ 2032.098658] blk_mq_rdma_map_queues: set->mq_map[3] queue 3 vector 3
[ 2032.105177] blk_mq_rdma_map_queues: set->mq_map[4] queue 4 vector 4
[ 2032.111689] blk_mq_rdma_map_queues: set->mq_map[5] queue 5 vector 5
[ 2032.118208] blk_mq_rdma_map_queues: set->mq_map[6] queue 6 vector 6
[ 2032.124728] blk_mq_rdma_map_queues: set->mq_map[7] queue 7 vector 7
[ 2032.131246] blk_mq_rdma_map_queues: set->mq_map[8] queue 15 vector 15
[ 2032.137938] blk_mq_rdma_map_queues: set->mq_map[9] queue 15 vector 15
[ 2032.144629] blk_mq_rdma_map_queues: set->mq_map[10] queue 15 vector 15
[ 2032.151401] blk_mq_rdma_map_queues: set->mq_map[11] queue 15 vector 15
[ 2032.158172] blk_mq_rdma_map_queues: set->mq_map[12] queue 15 vector 15
[ 2032.164940] blk_mq_rdma_map_queues: set->mq_map[13] queue 15 vector 15
[ 2032.171709] blk_mq_rdma_map_queues: set->mq_map[14] queue 15 vector 15
[ 2032.178477] blk_mq_rdma_map_queues: set->mq_map[15] queue 15 vector 15
[ 2032.187409] nvme nvme0: Connect command failed, error wo/DNR bit: -16402
[ 2032.194376] nvme nvme0: failed to connect queue: 9 ret=-18
But if I set all my vector affinities single cpus but only those in the same numa node, it now works:
[ 2311.884397] iw_cxgb4: comp_vector 0, irq 203 mask 0x100
[ 2311.890103] iw_cxgb4: comp_vector 1, irq 204 mask 0x200
[ 2311.895659] iw_cxgb4: comp_vector 2, irq 205 mask 0x400
[ 2311.901211] iw_cxgb4: comp_vector 3, irq 206 mask 0x800
[ 2311.906758] iw_cxgb4: comp_vector 4, irq 207 mask 0x1000
[ 2311.912390] iw_cxgb4: comp_vector 5, irq 208 mask 0x2000
[ 2311.918014] iw_cxgb4: comp_vector 6, irq 209 mask 0x4000
[ 2311.923627] iw_cxgb4: comp_vector 7, irq 210 mask 0x8000
[ 2311.929213] iw_cxgb4: comp_vector 8, irq 211 mask 0x100
[ 2311.934694] iw_cxgb4: comp_vector 9, irq 212 mask 0x200
[ 2311.940163] iw_cxgb4: comp_vector 10, irq 213 mask 0x400
[ 2311.945716] iw_cxgb4: comp_vector 11, irq 214 mask 0x800
[ 2311.951272] iw_cxgb4: comp_vector 12, irq 215 mask 0x1000
[ 2311.956914] iw_cxgb4: comp_vector 13, irq 216 mask 0x2000
[ 2311.962558] iw_cxgb4: comp_vector 14, irq 217 mask 0x4000
[ 2311.968201] iw_cxgb4: comp_vector 15, irq 218 mask 0x8000
[ 2311.973845] blk_mq_rdma_map_queues: set->mq_map[0] queue 0 vector 0
[ 2311.980367] blk_mq_rdma_map_queues: set->mq_map[1] queue 1 vector 1
[ 2311.986885] blk_mq_rdma_map_queues: set->mq_map[2] queue 2 vector 2
[ 2311.993402] blk_mq_rdma_map_queues: set->mq_map[3] queue 3 vector 3
[ 2311.999919] blk_mq_rdma_map_queues: set->mq_map[4] queue 4 vector 4
[ 2312.006436] blk_mq_rdma_map_queues: set->mq_map[5] queue 5 vector 5
[ 2312.012956] blk_mq_rdma_map_queues: set->mq_map[6] queue 6 vector 6
[ 2312.019473] blk_mq_rdma_map_queues: set->mq_map[7] queue 7 vector 7
[ 2312.025991] blk_mq_rdma_map_queues: set->mq_map[8] queue 8 vector 8
[ 2312.032511] blk_mq_rdma_map_queues: set->mq_map[9] queue 9 vector 9
[ 2312.039030] blk_mq_rdma_map_queues: set->mq_map[10] queue 10 vector 10
[ 2312.045801] blk_mq_rdma_map_queues: set->mq_map[11] queue 11 vector 11
[ 2312.052572] blk_mq_rdma_map_queues: set->mq_map[12] queue 12 vector 12
[ 2312.059341] blk_mq_rdma_map_queues: set->mq_map[13] queue 13 vector 13
[ 2312.066111] blk_mq_rdma_map_queues: set->mq_map[14] queue 14 vector 14
[ 2312.072879] blk_mq_rdma_map_queues: set->mq_map[15] queue 15 vector 15
[ 2312.081926] nvme nvme0: new ctrl: NQN "nvme-nullb0", addr 172.16.2.1:4420
[-- Attachment #2: sagi2.patch --]
[-- Type: application/octet-stream, Size: 3258 bytes --]
commit 5128aa16a366c78aaa7a96f3e5760f993e9edb3e
Author: Steve Wise <swise@opengridcomputing.com>
Date: Wed Jul 18 06:53:32 2018 -0700
sagi's patch2
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f15842..02b888ff3c10 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,29 +30,36 @@ static int get_first_sibling(unsigned int cpu)
return cpu;
}
-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue_to_cpu(struct blk_mq_tag_set *set, unsigned int cpu)
{
unsigned int *map = set->mq_map;
unsigned int nr_queues = set->nr_hw_queues;
- unsigned int cpu, first_sibling;
+ unsigned int first_sibling;
- for_each_possible_cpu(cpu) {
- /*
- * First do sequential mapping between CPUs and queues.
- * In case we still have CPUs to map, and we have some number of
- * threads per cores then map sibling threads to the same queue for
- * performace optimizations.
- */
- if (cpu < nr_queues) {
+ /*
+ * First do sequential mapping between CPUs and queues.
+ * In case we still have CPUs to map, and we have some number of
+ * threads per cores then map sibling threads to the same queue for
+ * performace optimizations.
+ */
+ if (cpu < nr_queues) {
+ map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+ } else {
+ first_sibling = get_first_sibling(cpu);
+ if (first_sibling == cpu)
map[cpu] = cpu_to_queue_index(nr_queues, cpu);
- } else {
- first_sibling = get_first_sibling(cpu);
- if (first_sibling == cpu)
- map[cpu] = cpu_to_queue_index(nr_queues, cpu);
- else
- map[cpu] = map[first_sibling];
- }
+ else
+ map[cpu] = map[first_sibling];
}
+}
+EXPORT_SYMBOL_GPL(blk_mq_map_queue_to_cpu);
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu)
+ blk_mq_map_queue_to_cpu(set, cpu);
return 0;
}
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..27210105a882 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -35,6 +35,10 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
const struct cpumask *mask;
unsigned int queue, cpu;
+ /* reset all to */
+ for_each_possible_cpu(cpu)
+ set->mq_map[cpu] = UINT_MAX;
+
for (queue = 0; queue < set->nr_hw_queues; queue++) {
mask = ib_get_vector_affinity(dev, first_vec + queue);
if (!mask)
@@ -44,6 +48,11 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
set->mq_map[cpu] = queue;
}
+ for_each_possible_cpu(cpu) {
+ if (set->mq_map[cpu] == UINT_MAX)
+ blk_mq_map_queue_to_cpu(set, cpu);
+ }
+
return 0;
fallback:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e3147eb74222..19e83d93a1d4 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -283,6 +283,7 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
unsigned long timeout);
int blk_mq_map_queues(struct blk_mq_tag_set *set);
+void blk_mq_map_queue_to_cpu(struct blk_mq_tag_set *set, unsigned int cpu);
void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
void blk_mq_quiesce_queue_nowait(struct request_queue *q);
next prev parent reply other threads:[~2018-07-18 20:08 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-07-16 8:30 [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask Leon Romanovsky
2018-07-16 10:23 ` Sagi Grimberg
2018-07-16 10:30 ` Leon Romanovsky
2018-07-16 14:54 ` Max Gurtovoy
2018-07-16 14:59 ` Sagi Grimberg
2018-07-16 16:46 ` Max Gurtovoy
2018-07-16 17:08 ` Steve Wise
2018-07-17 8:46 ` Max Gurtovoy
2018-07-17 8:58 ` Leon Romanovsky
2018-07-17 10:05 ` Max Gurtovoy
2018-07-17 13:03 ` Steve Wise
2018-07-18 11:38 ` Sagi Grimberg
2018-07-18 14:14 ` Max Gurtovoy
2018-07-18 14:25 ` Steve Wise
2018-07-18 19:29 ` Steve Wise [this message]
2018-07-19 14:50 ` Max Gurtovoy
2018-07-19 18:45 ` Steve Wise
2018-07-20 1:25 ` Max Gurtovoy
2018-07-23 16:49 ` Jason Gunthorpe
2018-07-23 16:53 ` Max Gurtovoy
2018-07-30 15:47 ` Steve Wise
2018-07-31 10:00 ` Max Gurtovoy
2018-08-01 5:12 ` Sagi Grimberg
2018-08-01 14:27 ` Max Gurtovoy
2018-08-06 19:20 ` Steve Wise
2018-08-15 6:37 ` Leon Romanovsky
2018-08-16 18:26 ` Sagi Grimberg
2018-08-16 18:32 ` Steve Wise
2018-08-17 16:17 ` Steve Wise
2018-08-17 20:03 ` Sagi Grimberg
2018-08-17 20:17 ` Jason Gunthorpe
2018-08-17 20:26 ` Sagi Grimberg
2018-08-17 21:28 ` Steve Wise
2018-07-24 15:24 ` Steve Wise
2018-07-24 20:52 ` Steve Wise
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='02dc01d41ecd$9cc8a0b0$d659e210$@opengridcomputing.com' \
--to=swise@opengridcomputing.com \
--cc=dledford@redhat.com \
--cc=jgg@mellanox.com \
--cc=leon@kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=maxg@mellanox.com \
--cc=netdev@vger.kernel.org \
--cc=saeedm@mellanox.com \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).