RE: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Steve Wise" <swise@opengridcomputing.com>
To: 'Max Gurtovoy' <maxg@mellanox.com>,
	'Sagi Grimberg' <sagi@grimberg.me>,
	'Leon Romanovsky' <leon@kernel.org>
Cc: 'Doug Ledford' <dledford@redhat.com>,
	'Jason Gunthorpe' <jgg@mellanox.com>,
	'RDMA mailing list' <linux-rdma@vger.kernel.org>,
	'Saeed Mahameed' <saeedm@mellanox.com>,
	'linux-netdev' <netdev@vger.kernel.org>
Subject: RE: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
Date: Wed, 18 Jul 2018 14:29:14 -0500	[thread overview]
Message-ID: <02dc01d41ecd$9cc8a0b0$d659e210$@opengridcomputing.com> (raw)
In-Reply-To: <9a4d8d50-19b0-fcaa-d4a3-6cfa2318a973@mellanox.com>

[-- Attachment #1: Type: text/plain, Size: 6432 bytes --]


> 
> On 7/18/2018 2:38 PM, Sagi Grimberg wrote:
> >
> >>> IMO we must fulfil the user wish to connect to N queues and not reduce
> >>> it because of affinity overlaps. So in order to push Leon's patch we
> >>> must also fix the blk_mq_rdma_map_queues to do a best effort
> mapping
> >>> according the affinity and map the rest in naive way (in that way we
> >>> will *always* map all the queues).
> >>
> >> That is what I would expect also.   For example, in my node, where
> >> there are
> >> 16 cpus, and 2 numa nodes, I observe much better nvmf IOPS
> performance by
> >> setting up my 16 driver completion event queues such that each is
> >> bound to a
> >> node-local cpu.  So I end up with each nodel-local cpu having 2 queues
> >> bound
> >> to it.   W/O adding support in iw_cxgb4 for ib_get_vector_affinity(),
> >> this
> >> works fine.   I assumed adding ib_get_vector_affinity() would allow
> >> this to
> >> all "just work" by default, but I'm running into this connection failure
> >> issue.
> >>
> >> I don't understand exactly what the blk_mq layer is trying to do, but I
> >> assume it has ingress event queues and processing that it trying to align
> >> with the drivers ingress cq event handling, so everybody stays on the
> >> same
> >> cpu (or at least node).   But something else is going on.  Is there
> >> documentation on how this works somewhere?
> >
> > Does this (untested) patch help?
> 
> I'm not sure (I'll test it tomorrow) because the issue is the unmapped
> queues and not the cpus.
> for example, if the affinity of q=6 and q=12 returned the same cpumask
> than q=6 will not be mapped and will fail to connect.
>

Attached is a patch that applies cleanly for me.  It has problems if vectors have affinity to more than 1 cpu:

[ 2031.988881] iw_cxgb4: comp_vector 0, irq 203 mask 0xff00
[ 2031.994706] iw_cxgb4: comp_vector 1, irq 204 mask 0xff00
[ 2032.000348] iw_cxgb4: comp_vector 2, irq 205 mask 0xff00
[ 2032.005992] iw_cxgb4: comp_vector 3, irq 206 mask 0xff00
[ 2032.011629] iw_cxgb4: comp_vector 4, irq 207 mask 0xff00
[ 2032.017271] iw_cxgb4: comp_vector 5, irq 208 mask 0xff00
[ 2032.022901] iw_cxgb4: comp_vector 6, irq 209 mask 0xff00
[ 2032.028514] iw_cxgb4: comp_vector 7, irq 210 mask 0xff00
[ 2032.034110] iw_cxgb4: comp_vector 8, irq 211 mask 0xff00
[ 2032.039677] iw_cxgb4: comp_vector 9, irq 212 mask 0xff00
[ 2032.045244] iw_cxgb4: comp_vector 10, irq 213 mask 0xff00
[ 2032.050889] iw_cxgb4: comp_vector 11, irq 214 mask 0xff00
[ 2032.056531] iw_cxgb4: comp_vector 12, irq 215 mask 0xff00
[ 2032.062174] iw_cxgb4: comp_vector 13, irq 216 mask 0xff00
[ 2032.067817] iw_cxgb4: comp_vector 14, irq 217 mask 0xff00
[ 2032.073457] iw_cxgb4: comp_vector 15, irq 218 mask 0xff00
[ 2032.079102] blk_mq_rdma_map_queues: set->mq_map[0] queue 0 vector 0
[ 2032.085621] blk_mq_rdma_map_queues: set->mq_map[1] queue 1 vector 1
[ 2032.092139] blk_mq_rdma_map_queues: set->mq_map[2] queue 2 vector 2
[ 2032.098658] blk_mq_rdma_map_queues: set->mq_map[3] queue 3 vector 3
[ 2032.105177] blk_mq_rdma_map_queues: set->mq_map[4] queue 4 vector 4
[ 2032.111689] blk_mq_rdma_map_queues: set->mq_map[5] queue 5 vector 5
[ 2032.118208] blk_mq_rdma_map_queues: set->mq_map[6] queue 6 vector 6
[ 2032.124728] blk_mq_rdma_map_queues: set->mq_map[7] queue 7 vector 7
[ 2032.131246] blk_mq_rdma_map_queues: set->mq_map[8] queue 15 vector 15
[ 2032.137938] blk_mq_rdma_map_queues: set->mq_map[9] queue 15 vector 15
[ 2032.144629] blk_mq_rdma_map_queues: set->mq_map[10] queue 15 vector 15
[ 2032.151401] blk_mq_rdma_map_queues: set->mq_map[11] queue 15 vector 15
[ 2032.158172] blk_mq_rdma_map_queues: set->mq_map[12] queue 15 vector 15
[ 2032.164940] blk_mq_rdma_map_queues: set->mq_map[13] queue 15 vector 15
[ 2032.171709] blk_mq_rdma_map_queues: set->mq_map[14] queue 15 vector 15
[ 2032.178477] blk_mq_rdma_map_queues: set->mq_map[15] queue 15 vector 15
[ 2032.187409] nvme nvme0: Connect command failed, error wo/DNR bit: -16402
[ 2032.194376] nvme nvme0: failed to connect queue: 9 ret=-18
 
But if I set all my vector affinities single cpus but only those in the same numa node, it now works:

[ 2311.884397] iw_cxgb4: comp_vector 0, irq 203 mask 0x100
[ 2311.890103] iw_cxgb4: comp_vector 1, irq 204 mask 0x200
[ 2311.895659] iw_cxgb4: comp_vector 2, irq 205 mask 0x400
[ 2311.901211] iw_cxgb4: comp_vector 3, irq 206 mask 0x800
[ 2311.906758] iw_cxgb4: comp_vector 4, irq 207 mask 0x1000
[ 2311.912390] iw_cxgb4: comp_vector 5, irq 208 mask 0x2000
[ 2311.918014] iw_cxgb4: comp_vector 6, irq 209 mask 0x4000
[ 2311.923627] iw_cxgb4: comp_vector 7, irq 210 mask 0x8000
[ 2311.929213] iw_cxgb4: comp_vector 8, irq 211 mask 0x100
[ 2311.934694] iw_cxgb4: comp_vector 9, irq 212 mask 0x200
[ 2311.940163] iw_cxgb4: comp_vector 10, irq 213 mask 0x400
[ 2311.945716] iw_cxgb4: comp_vector 11, irq 214 mask 0x800
[ 2311.951272] iw_cxgb4: comp_vector 12, irq 215 mask 0x1000
[ 2311.956914] iw_cxgb4: comp_vector 13, irq 216 mask 0x2000
[ 2311.962558] iw_cxgb4: comp_vector 14, irq 217 mask 0x4000
[ 2311.968201] iw_cxgb4: comp_vector 15, irq 218 mask 0x8000
[ 2311.973845] blk_mq_rdma_map_queues: set->mq_map[0] queue 0 vector 0
[ 2311.980367] blk_mq_rdma_map_queues: set->mq_map[1] queue 1 vector 1
[ 2311.986885] blk_mq_rdma_map_queues: set->mq_map[2] queue 2 vector 2
[ 2311.993402] blk_mq_rdma_map_queues: set->mq_map[3] queue 3 vector 3
[ 2311.999919] blk_mq_rdma_map_queues: set->mq_map[4] queue 4 vector 4
[ 2312.006436] blk_mq_rdma_map_queues: set->mq_map[5] queue 5 vector 5
[ 2312.012956] blk_mq_rdma_map_queues: set->mq_map[6] queue 6 vector 6
[ 2312.019473] blk_mq_rdma_map_queues: set->mq_map[7] queue 7 vector 7
[ 2312.025991] blk_mq_rdma_map_queues: set->mq_map[8] queue 8 vector 8
[ 2312.032511] blk_mq_rdma_map_queues: set->mq_map[9] queue 9 vector 9
[ 2312.039030] blk_mq_rdma_map_queues: set->mq_map[10] queue 10 vector 10
[ 2312.045801] blk_mq_rdma_map_queues: set->mq_map[11] queue 11 vector 11
[ 2312.052572] blk_mq_rdma_map_queues: set->mq_map[12] queue 12 vector 12
[ 2312.059341] blk_mq_rdma_map_queues: set->mq_map[13] queue 13 vector 13
[ 2312.066111] blk_mq_rdma_map_queues: set->mq_map[14] queue 14 vector 14
[ 2312.072879] blk_mq_rdma_map_queues: set->mq_map[15] queue 15 vector 15
[ 2312.081926] nvme nvme0: new ctrl: NQN "nvme-nullb0", addr 172.16.2.1:4420


[-- Attachment #2: sagi2.patch --]
[-- Type: application/octet-stream, Size: 3258 bytes --]

commit 5128aa16a366c78aaa7a96f3e5760f993e9edb3e
Author: Steve Wise <swise@opengridcomputing.com>
Date:   Wed Jul 18 06:53:32 2018 -0700

    sagi's patch2

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 3eb169f15842..02b888ff3c10 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -30,29 +30,36 @@ static int get_first_sibling(unsigned int cpu)
 	return cpu;
 }
 
-int blk_mq_map_queues(struct blk_mq_tag_set *set)
+void blk_mq_map_queue_to_cpu(struct blk_mq_tag_set *set, unsigned int cpu)
 {
 	unsigned int *map = set->mq_map;
 	unsigned int nr_queues = set->nr_hw_queues;
-	unsigned int cpu, first_sibling;
+	unsigned int first_sibling;
 
-	for_each_possible_cpu(cpu) {
-		/*
-		 * First do sequential mapping between CPUs and queues.
-		 * In case we still have CPUs to map, and we have some number of
-		 * threads per cores then map sibling threads to the same queue for
-		 * performace optimizations.
-		 */
-		if (cpu < nr_queues) {
+	/*
+	 * First do sequential mapping between CPUs and queues.
+	 * In case we still have CPUs to map, and we have some number of
+	 * threads per cores then map sibling threads to the same queue for
+	 * performace optimizations.
+	 */
+	if (cpu < nr_queues) {
+		map[cpu] = cpu_to_queue_index(nr_queues, cpu);
+	} else {
+		first_sibling = get_first_sibling(cpu);
+		if (first_sibling == cpu)
 			map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-		} else {
-			first_sibling = get_first_sibling(cpu);
-			if (first_sibling == cpu)
-				map[cpu] = cpu_to_queue_index(nr_queues, cpu);
-			else
-				map[cpu] = map[first_sibling];
-		}
+		else
+			map[cpu] = map[first_sibling];
 	}
+}
+EXPORT_SYMBOL_GPL(blk_mq_map_queue_to_cpu);
+
+int blk_mq_map_queues(struct blk_mq_tag_set *set)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu)
+		blk_mq_map_queue_to_cpu(set, cpu);
 
 	return 0;
 }
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..27210105a882 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -35,6 +35,10 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
 	const struct cpumask *mask;
 	unsigned int queue, cpu;
 
+	/* reset all to  */
+	for_each_possible_cpu(cpu)
+		set->mq_map[cpu] = UINT_MAX;
+
 	for (queue = 0; queue < set->nr_hw_queues; queue++) {
 		mask = ib_get_vector_affinity(dev, first_vec + queue);
 		if (!mask)
@@ -44,6 +48,11 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
 			set->mq_map[cpu] = queue;
 	}
 
+	for_each_possible_cpu(cpu) {
+		if (set->mq_map[cpu] == UINT_MAX)
+			blk_mq_map_queue_to_cpu(set, cpu);
+	}
+
 	return 0;
 
 fallback:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e3147eb74222..19e83d93a1d4 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -283,6 +283,7 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
 
 int blk_mq_map_queues(struct blk_mq_tag_set *set);
+void blk_mq_map_queue_to_cpu(struct blk_mq_tag_set *set, unsigned int cpu);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);

next prev parent reply	other threads:[~2018-07-18 19:29 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-16  8:30 [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask Leon Romanovsky
2018-07-16 10:23 ` Sagi Grimberg
2018-07-16 10:30   ` Leon Romanovsky
2018-07-16 14:54     ` Max Gurtovoy
2018-07-16 14:59       ` Sagi Grimberg
2018-07-16 16:46         ` Max Gurtovoy
2018-07-16 17:08           ` Steve Wise
2018-07-17  8:46             ` Max Gurtovoy
2018-07-17  8:58               ` Leon Romanovsky
2018-07-17 10:05                 ` Max Gurtovoy
2018-07-17 13:03               ` Steve Wise
2018-07-18 11:38                 ` Sagi Grimberg
2018-07-18 14:14                   ` Max Gurtovoy
2018-07-18 14:25                     ` Steve Wise
2018-07-18 19:29                     ` Steve Wise [this message]
2018-07-19 14:50                       ` Max Gurtovoy
2018-07-19 18:45                         ` Steve Wise
2018-07-20  1:25                           ` Max Gurtovoy
2018-07-23 16:49                             ` Jason Gunthorpe
2018-07-23 16:53                               ` Max Gurtovoy
2018-07-30 15:47                                 ` Steve Wise
2018-07-31 10:00                                   ` Max Gurtovoy
2018-08-01  5:12                                 ` Sagi Grimberg
2018-08-01 14:27                                   ` Max Gurtovoy
2018-08-06 19:20                                     ` Steve Wise
2018-08-15  6:37                                       ` Leon Romanovsky
2018-08-16 18:26                                       ` Sagi Grimberg
2018-08-16 18:32                                         ` Steve Wise
2018-08-17 16:17                                           ` Steve Wise
2018-08-17 20:03                                             ` Sagi Grimberg
2018-08-17 20:17                                               ` Jason Gunthorpe
2018-08-17 20:26                                                 ` Sagi Grimberg
2018-08-17 21:28                                               ` Steve Wise
2018-07-24 15:24                             ` Steve Wise
2018-07-24 20:52                               ` Steve Wise

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:3eb169f1584 dfblob:02b888ff3c1 dfblob:996167f1de1
dfblob:27210105a88 dfblob:e3147eb7422 dfblob:19e83d93a1d )
 OR (
bs:"RE: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='02dc01d41ecd$9cc8a0b0$d659e210$@opengridcomputing.com' \
    --to=swise@opengridcomputing.com \
    --cc=dledford@redhat.com \
    --cc=jgg@mellanox.com \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=maxg@mellanox.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@mellanox.com \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.