linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] nvme-pci: honor isolcpus configuration
@ 2024-06-21 13:53 Daniel Wagner
  2024-06-21 13:53 ` [PATCH 1/3] sched/isolation: Add io_queue housekeeping option Daniel Wagner
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Daniel Wagner @ 2024-06-21 13:53 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, Daniel Wagner

The nvme-pci driver is ignoring the isolcpus configuration. There were several
attempts to fix this in the past [1][2]. This is another attempt but this time
trying to address the feedback and solve it in the core code.

The first patch introduces a new option for isolcpus 'io_queue', but I'm not
really sure if this is needed and we could just use the managed_irq option
instead. I guess depends if there is an use case which depens on queues on the
isolated CPUs.

The second patch introduces a new block layer helper which returns the number of
possible queues. I suspect it would makes sense also to make this helper a bit
smarter and also consider the number of queues the hardware supports.

And the last patch updates the group_cpus_evenly function so that it uses only
the housekeeping CPUs when they are defined

Note this series is not addressing the affinity setting of the admin queue
(queue 0). I'd like to address this after we agreed on how to solve this.
Currently, the admin queue affinity can be controlled by the irq_afffinity
command line option, so there is at least a workaround for it.

Baseline:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 1536 MB
node 0 free: 1227 MB
node 1 cpus: 4 5 6 7
node 1 size: 1729 MB
node 1 free: 1422 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

options nvme write_queues=4 poll_queues=4

55: 0 41 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 0-edge nvme0q0 affinity: 0-3
63: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 1-edge nvme0q1 affinity: 4-5
64: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 2-edge nvme0q2 affinity: 6-7
65: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 3-edge nvme0q3 affinity: 0-1
66: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 4-edge nvme0q4 affinity: 2-3
67: 0 0 0 0 24 0 0 0 PCI-MSIX-0000:00:05.0 5-edge nvme0q5 affinity: 4
68: 0 0 0 0 0 1 0 0 PCI-MSIX-0000:00:05.0 6-edge nvme0q6 affinity: 5
69: 0 0 0 0 0 0 41 0 PCI-MSIX-0000:00:05.0 7-edge nvme0q7 affinity: 6
70: 0 0 0 0 0 0 0 3 PCI-MSIX-0000:00:05.0 8-edge nvme0q8 affinity: 7
71: 1 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 9-edge nvme0q9 affinity: 0
72: 0 18 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 10-edge nvme0q10 affinity: 1
73: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 11-edge nvme0q11 affinity: 2
74: 0 0 0 3 0 0 0 0 PCI-MSIX-0000:00:05.0 12-edge nvme0q12 affinity: 3

queue mapping for /dev/nvme0n1
        hctx0: default 4 5
        hctx1: default 6 7
        hctx2: default 0 1
        hctx3: default 2 3
        hctx4: read 4
        hctx5: read 5
        hctx6: read 6
        hctx7: read 7
        hctx8: read 0
        hctx9: read 1
        hctx10: read 2
        hctx11: read 3
        hctx12: poll 4 5
        hctx13: poll 6 7
        hctx14: poll 0 1
        hctx15: poll 2 3

PCI name is 00:05.0: nvme0n1
        irq 55, cpu list 0-3, effective list 1
        irq 63, cpu list 4-5, effective list 5
        irq 64, cpu list 6-7, effective list 7
        irq 65, cpu list 0-1, effective list 1
        irq 66, cpu list 2-3, effective list 3
        irq 67, cpu list 4, effective list 4
        irq 68, cpu list 5, effective list 5
        irq 69, cpu list 6, effective list 6
        irq 70, cpu list 7, effective list 7
        irq 71, cpu list 0, effective list 0
        irq 72, cpu list 1, effective list 1
        irq 73, cpu list 2, effective list 2
        irq 74, cpu list 3, effective list 3

* patched:

48: 0 0 33 0 0 0 0 0 PCI-MSIX-0000:00:05.0 0-edge nvme0q0 affinity: 0-3
58: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 1-edge nvme0q1 affinity: 4
59: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 2-edge nvme0q2 affinity: 5
60: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 3-edge nvme0q3 affinity: 0
61: 0 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 4-edge nvme0q4 affinity: 1
62: 0 0 0 0 45 0 0 0 PCI-MSIX-0000:00:05.0 5-edge nvme0q5 affinity: 4
63: 0 0 0 0 0 12 0 0 PCI-MSIX-0000:00:05.0 6-edge nvme0q6 affinity: 5
64: 2 0 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 7-edge nvme0q7 affinity: 0
65: 0 35 0 0 0 0 0 0 PCI-MSIX-0000:00:05.0 8-edge nvme0q8 affinity: 1

queue mapping for /dev/nvme0n1
        hctx0: default 2 3 4 6 7
        hctx1: default 5
        hctx2: default 0
        hctx3: default 1
        hctx4: read 4
        hctx5: read 5
        hctx6: read 0
        hctx7: read 1
        hctx8: poll 4
        hctx9: poll 5
        hctx10: poll 0
        hctx11: poll 1

PCI name is 00:05.0: nvme0n1
        irq 48, cpu list 0-3, effective list 2
        irq 58, cpu list 4, effective list 4
        irq 59, cpu list 5, effective list 5
        irq 60, cpu list 0, effective list 0
        irq 61, cpu list 1, effective list 1
        irq 62, cpu list 4, effective list 4
        irq 63, cpu list 5, effective list 5
        irq 64, cpu list 0, effective list 0
        irq 65, cpu list 1, effective list 1

[1] https://lore.kernel.org/lkml/20220423054331.GA17823@lst.de/T/#m9939195a465accbf83187caf346167c4242e798d
[2] https://lore.kernel.org/linux-nvme/87fruci5nj.ffs@tglx/

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
Daniel Wagner (3):
      sched/isolation: Add io_queue housekeeping option
      nvme-pci: limit queue count to housekeeping cpus
      lib/group_cpus.c: honor housekeeping config when grouping CPUs

 block/blk-mq-cpumap.c           | 13 +++++++++
 drivers/nvme/host/pci.c         |  4 +--
 include/linux/blk-mq.h          |  1 +
 include/linux/sched/isolation.h |  1 +
 kernel/sched/isolation.c        |  7 +++++
 lib/group_cpus.c                | 64 +++++++++++++++++++++++++++++++++++++++--
 6 files changed, 86 insertions(+), 4 deletions(-)
---
base-commit: 6ba59ff4227927d3a8530fc2973b80e94b54d58f
change-id: 20240620-isolcpus-io-queues-1a88eb47ff8b

Best regards,
-- 
Daniel Wagner <dwagner@suse.de>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] sched/isolation: Add io_queue housekeeping option
  2024-06-21 13:53 [PATCH 0/3] nvme-pci: honor isolcpus configuration Daniel Wagner
@ 2024-06-21 13:53 ` Daniel Wagner
  2024-06-22  5:11   ` Christoph Hellwig
  2024-06-21 13:53 ` [PATCH 2/3] nvme-pci: limit queue count to housekeeping cpus Daniel Wagner
  2024-06-21 13:53 ` [PATCH 3/3] lib/group_cpus.c: honor housekeeping config when grouping CPUs Daniel Wagner
  2 siblings, 1 reply; 15+ messages in thread
From: Daniel Wagner @ 2024-06-21 13:53 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, Daniel Wagner

Drivers such as nvme-pci are spreading the IO queues on all CPUs. This
is not necessary an invalid setup when using isolcpus but there are also
users of isolcpus which do not want to have any IO workload on the
isolated CPUs.

Add a new house keeping type so the infrastructure code and drivers are
able to distinguish between the two setups.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 include/linux/sched/isolation.h | 1 +
 kernel/sched/isolation.c        | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 2b461129d1fa..fe751d704e99 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -16,6 +16,7 @@ enum hk_type {
 	HK_TYPE_WQ,
 	HK_TYPE_MANAGED_IRQ,
 	HK_TYPE_KTHREAD,
+	HK_TYPE_IO_QUEUE,
 	HK_TYPE_MAX
 };
 
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 5891e715f00d..91d7a434330c 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -18,6 +18,7 @@ enum hk_flags {
 	HK_FLAG_WQ		= BIT(HK_TYPE_WQ),
 	HK_FLAG_MANAGED_IRQ	= BIT(HK_TYPE_MANAGED_IRQ),
 	HK_FLAG_KTHREAD		= BIT(HK_TYPE_KTHREAD),
+	HK_FLAG_IO_QUEUE	= BIT(HK_TYPE_IO_QUEUE),
 };
 
 DEFINE_STATIC_KEY_FALSE(housekeeping_overridden);
@@ -228,6 +229,12 @@ static int __init housekeeping_isolcpus_setup(char *str)
 			continue;
 		}
 
+		if (!strncmp(str, "io_queue,", 9)) {
+			str += 9;
+			flags |= HK_FLAG_IO_QUEUE;
+			continue;
+		}
+
 		/*
 		 * Skip unknown sub-parameter and validate that it is not
 		 * containing an invalid character.

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] nvme-pci: limit queue count to housekeeping cpus
  2024-06-21 13:53 [PATCH 0/3] nvme-pci: honor isolcpus configuration Daniel Wagner
  2024-06-21 13:53 ` [PATCH 1/3] sched/isolation: Add io_queue housekeeping option Daniel Wagner
@ 2024-06-21 13:53 ` Daniel Wagner
  2024-06-22  5:14   ` Christoph Hellwig
  2024-06-21 13:53 ` [PATCH 3/3] lib/group_cpus.c: honor housekeeping config when grouping CPUs Daniel Wagner
  2 siblings, 1 reply; 15+ messages in thread
From: Daniel Wagner @ 2024-06-21 13:53 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, Daniel Wagner

The nvme-pci driver is ignoring the isolcpus and allocates IO queues for
all possible CPUs. But this could add noise to the isolated CPU whenever
there is IO issued on the isolated CPU. This is not always what the user
wants. Thus only ask for as many queues as there are housekeeping CPUs.

Note, the placement of the queues will be addressed in the next patch.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 block/blk-mq-cpumap.c   | 13 +++++++++++++
 drivers/nvme/host/pci.c |  4 ++--
 include/linux/blk-mq.h  |  1 +
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 9638b25fd521..43c039900ef6 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -11,10 +11,23 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/group_cpus.h>
+#include <linux/sched/isolation.h>
 
 #include "blk.h"
 #include "blk-mq.h"
 
+unsigned int blk_mq_num_possible_queues(void)
+{
+	const struct cpumask *io_queue_mask;
+
+	io_queue_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+	if (!cpumask_empty(io_queue_mask))
+		return cpumask_weight(io_queue_mask);
+
+	return num_possible_cpus();
+}
+EXPORT_SYMBOL_GPL(blk_mq_num_possible_queues);
+
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 {
 	const struct cpumask *masks;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 102a9fb0c65f..66999fa13b2c 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -81,7 +81,7 @@ static int io_queue_count_set(const char *val, const struct kernel_param *kp)
 	int ret;
 
 	ret = kstrtouint(val, 10, &n);
-	if (ret != 0 || n > num_possible_cpus())
+	if (ret != 0 || n > blk_mq_num_possible_queues())
 		return -EINVAL;
 	return param_set_uint(val, kp);
 }
@@ -2263,7 +2263,7 @@ static unsigned int nvme_max_io_queues(struct nvme_dev *dev)
 	 */
 	if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
 		return 1;
-	return num_possible_cpus() + dev->nr_write_queues + dev->nr_poll_queues;
+	return blk_mq_num_possible_queues() + dev->nr_write_queues + dev->nr_poll_queues;
 }
 
 static int nvme_setup_io_queues(struct nvme_dev *dev)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 89ba6b16fe8b..2105cc78ca67 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -900,6 +900,7 @@ void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
 
+unsigned int blk_mq_num_possible_queues(void);
 void blk_mq_map_queues(struct blk_mq_queue_map *qmap);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] lib/group_cpus.c: honor housekeeping config when grouping CPUs
  2024-06-21 13:53 [PATCH 0/3] nvme-pci: honor isolcpus configuration Daniel Wagner
  2024-06-21 13:53 ` [PATCH 1/3] sched/isolation: Add io_queue housekeeping option Daniel Wagner
  2024-06-21 13:53 ` [PATCH 2/3] nvme-pci: limit queue count to housekeeping cpus Daniel Wagner
@ 2024-06-21 13:53 ` Daniel Wagner
  2024-06-22  5:13   ` Christoph Hellwig
  2 siblings, 1 reply; 15+ messages in thread
From: Daniel Wagner @ 2024-06-21 13:53 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig
  Cc: Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme, Daniel Wagner

group_cpus_evenly distributes all present CPUs into groups. This
ignores the isolcpus configuration and assigns isolated CPUs into the
groups.

Make group_cpus_evenly aware of isolcpus configuration and use the
housekeeping CPU mask as base for distributing the available CPUs into
groups.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 lib/group_cpus.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 62 insertions(+), 2 deletions(-)

diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index ee272c4cefcc..f1517a44abc9 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -8,6 +8,7 @@
 #include <linux/cpu.h>
 #include <linux/sort.h>
 #include <linux/group_cpus.h>
+#include <linux/sched/isolation.h>
 
 #ifdef CONFIG_SMP
 
@@ -330,7 +331,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
 }
 
 /**
- * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * group_possible_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
  * @numgrps: number of groups
  *
  * Return: cpumask array if successful, NULL otherwise. And each element
@@ -344,7 +345,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
  * We guarantee in the resulted grouping that all CPUs are covered, and
  * no same CPU is assigned to multiple groups
  */
-struct cpumask *group_cpus_evenly(unsigned int numgrps)
+static struct cpumask *group_possible_cpus_evenly(unsigned int numgrps)
 {
 	unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
 	cpumask_var_t *node_to_cpumask;
@@ -423,6 +424,65 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 	}
 	return masks;
 }
+
+static struct cpumask *group_mask_cpus_evenly(unsigned int numgrps,
+					      const struct cpumask *cpu_mask)
+{
+	cpumask_var_t *node_to_cpumask;
+	cpumask_var_t nmsk;
+	int ret = -ENOMEM;
+	struct cpumask *masks = NULL;
+
+	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
+		return NULL;
+
+	node_to_cpumask = alloc_node_to_cpumask();
+	if (!node_to_cpumask)
+		goto fail_nmsk;
+
+	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
+	if (!masks)
+		goto fail_node_to_cpumask;
+
+	build_node_to_cpumask(node_to_cpumask);
+
+	ret = __group_cpus_evenly(0, numgrps, node_to_cpumask, cpu_mask, nmsk,
+				  masks);
+
+fail_node_to_cpumask:
+	free_node_to_cpumask(node_to_cpumask);
+
+fail_nmsk:
+	free_cpumask_var(nmsk);
+	if (ret < 0) {
+		kfree(masks);
+		return NULL;
+	}
+	return masks;
+}
+
+/**
+ * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
+ * @numgrps: number of groups
+ *
+ * Return: cpumask array if successful, NULL otherwise. And each element
+ * includes CPUs assigned to this group
+ *
+ * @group_possible_cpus_evently is used for distributing the cpus
+ * on all possible cpus in absence of isolcpus command line argument.
+ * If the isolcpus argument is used with io_queue option only
+ * the housekeeping CPUs are considered.
+ */
+struct cpumask *group_cpus_evenly(unsigned int numgrps)
+{
+	const struct cpumask *io_queue_mask;
+
+	io_queue_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
+	if (!cpumask_empty(io_queue_mask))
+		return group_mask_cpus_evenly(numgrps, io_queue_mask);
+	else
+		return group_possible_cpus_evenly(numgrps);
+}
 #else /* CONFIG_SMP */
 struct cpumask *group_cpus_evenly(unsigned int numgrps)
 {

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] sched/isolation: Add io_queue housekeeping option
  2024-06-21 13:53 ` [PATCH 1/3] sched/isolation: Add io_queue housekeeping option Daniel Wagner
@ 2024-06-22  5:11   ` Christoph Hellwig
  2024-06-24  7:13     ` Daniel Wagner
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2024-06-22  5:11 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, Ming Lei,
	linux-kernel, linux-block, linux-nvme

> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> index 2b461129d1fa..fe751d704e99 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -16,6 +16,7 @@ enum hk_type {
>  	HK_TYPE_WQ,
>  	HK_TYPE_MANAGED_IRQ,
>  	HK_TYPE_KTHREAD,
> +	HK_TYPE_IO_QUEUE,
>  	HK_TYPE_MAX
>  };

It might be a good time to write comments explaining these types?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] lib/group_cpus.c: honor housekeeping config when grouping CPUs
  2024-06-21 13:53 ` [PATCH 3/3] lib/group_cpus.c: honor housekeeping config when grouping CPUs Daniel Wagner
@ 2024-06-22  5:13   ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2024-06-22  5:13 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, Ming Lei,
	linux-kernel, linux-block, linux-nvme

Shouldn't this go before the current patch 2?

> +	io_queue_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
> +	if (!cpumask_empty(io_queue_mask))
> +		return group_mask_cpus_evenly(numgrps, io_queue_mask);
> +	else
> +		return group_possible_cpus_evenly(numgrps);
> +}

No need for an else after the return above.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] nvme-pci: limit queue count to housekeeping cpus
  2024-06-21 13:53 ` [PATCH 2/3] nvme-pci: limit queue count to housekeeping cpus Daniel Wagner
@ 2024-06-22  5:14   ` Christoph Hellwig
  2024-06-23  7:03     ` Sagi Grimberg
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2024-06-22  5:14 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Christoph Hellwig, Frederic Weisbecker, Mel Gorman,
	Hannes Reinecke, Sridhar Balaraman, brookxu.cn, Ming Lei,
	linux-kernel, linux-block, linux-nvme

> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
> index 9638b25fd521..43c039900ef6 100644
> --- a/block/blk-mq-cpumap.c
> +++ b/block/blk-mq-cpumap.c
> @@ -11,10 +11,23 @@
>  #include <linux/smp.h>
>  #include <linux/cpu.h>
>  #include <linux/group_cpus.h>
> +#include <linux/sched/isolation.h>
>  
>  #include "blk.h"
>  #include "blk-mq.h"
>  
> +unsigned int blk_mq_num_possible_queues(void)
> +{
> +	const struct cpumask *io_queue_mask;
> +
> +	io_queue_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
> +	if (!cpumask_empty(io_queue_mask))
> +		return cpumask_weight(io_queue_mask);
> +
> +	return num_possible_cpus();
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_num_possible_queues);

This should be split into a separate patch.  And it could really use
a kerneldoc comment.

> -	return num_possible_cpus() + dev->nr_write_queues + dev->nr_poll_queues;
> +	return blk_mq_num_possible_queues() + dev->nr_write_queues + dev->nr_poll_queues;

Please avoid the overly long line here.

Otherwise this looks good to me.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] nvme-pci: limit queue count to housekeeping cpus
  2024-06-22  5:14   ` Christoph Hellwig
@ 2024-06-23  7:03     ` Sagi Grimberg
  0 siblings, 0 replies; 15+ messages in thread
From: Sagi Grimberg @ 2024-06-23  7:03 UTC (permalink / raw)
  To: Christoph Hellwig, Daniel Wagner
  Cc: Jens Axboe, Keith Busch, Thomas Gleixner, Frederic Weisbecker,
	Mel Gorman, Hannes Reinecke, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme



On 22/06/2024 8:14, Christoph Hellwig wrote:
>> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
>> index 9638b25fd521..43c039900ef6 100644
>> --- a/block/blk-mq-cpumap.c
>> +++ b/block/blk-mq-cpumap.c
>> @@ -11,10 +11,23 @@
>>   #include <linux/smp.h>
>>   #include <linux/cpu.h>
>>   #include <linux/group_cpus.h>
>> +#include <linux/sched/isolation.h>
>>   
>>   #include "blk.h"
>>   #include "blk-mq.h"
>>   
>> +unsigned int blk_mq_num_possible_queues(void)
>> +{
>> +	const struct cpumask *io_queue_mask;
>> +
>> +	io_queue_mask = housekeeping_cpumask(HK_TYPE_IO_QUEUE);
>> +	if (!cpumask_empty(io_queue_mask))
>> +		return cpumask_weight(io_queue_mask);
>> +
>> +	return num_possible_cpus();
>> +}
>> +EXPORT_SYMBOL_GPL(blk_mq_num_possible_queues);
> This should be split into a separate patch.  And it could really use
> a kerneldoc comment.

Agree.

Other than that, looks good.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] sched/isolation: Add io_queue housekeeping option
  2024-06-22  5:11   ` Christoph Hellwig
@ 2024-06-24  7:13     ` Daniel Wagner
  2024-06-24  8:47       ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel Wagner @ 2024-06-24  7:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme

On Sat, Jun 22, 2024 at 07:11:57AM GMT, Christoph Hellwig wrote:
> > diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> > index 2b461129d1fa..fe751d704e99 100644
> > --- a/include/linux/sched/isolation.h
> > +++ b/include/linux/sched/isolation.h
> > @@ -16,6 +16,7 @@ enum hk_type {
> >  	HK_TYPE_WQ,
> >  	HK_TYPE_MANAGED_IRQ,
> >  	HK_TYPE_KTHREAD,
> > +	HK_TYPE_IO_QUEUE,
> >  	HK_TYPE_MAX
> >  };
> 
> It might be a good time to write comments explaining these types?

Sure, will do.

Do you think we should introduce a new type or just use the existing
managed_irq for this?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] sched/isolation: Add io_queue housekeeping option
  2024-06-24  7:13     ` Daniel Wagner
@ 2024-06-24  8:47       ` Christoph Hellwig
  2024-06-24  9:00         ` Daniel Wagner
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2024-06-24  8:47 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Christoph Hellwig, Jens Axboe, Keith Busch, Sagi Grimberg,
	Thomas Gleixner, Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme

On Mon, Jun 24, 2024 at 09:13:06AM +0200, Daniel Wagner wrote:
> On Sat, Jun 22, 2024 at 07:11:57AM GMT, Christoph Hellwig wrote:
> > > diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> > > index 2b461129d1fa..fe751d704e99 100644
> > > --- a/include/linux/sched/isolation.h
> > > +++ b/include/linux/sched/isolation.h
> > > @@ -16,6 +16,7 @@ enum hk_type {
> > >  	HK_TYPE_WQ,
> > >  	HK_TYPE_MANAGED_IRQ,
> > >  	HK_TYPE_KTHREAD,
> > > +	HK_TYPE_IO_QUEUE,
> > >  	HK_TYPE_MAX
> > >  };
> > 
> > It might be a good time to write comments explaining these types?
> 
> Sure, will do.
> 
> Do you think we should introduce a new type or just use the existing
> managed_irq for this?

No idea really.  What was the reason for adding a new one?  The best
person to comment on this is probably Thomas.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] sched/isolation: Add io_queue housekeeping option
  2024-06-24  8:47       ` Christoph Hellwig
@ 2024-06-24  9:00         ` Daniel Wagner
  2024-06-25  6:37           ` Hannes Reinecke
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel Wagner @ 2024-06-24  9:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Frederic Weisbecker, Mel Gorman, Hannes Reinecke,
	Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme

On Mon, Jun 24, 2024 at 10:47:05AM GMT, Christoph Hellwig wrote:
> > Do you think we should introduce a new type or just use the existing
> > managed_irq for this?
> 
> No idea really.  What was the reason for adding a new one?

I've added the new type so that the current behavior of spreading the
queues over to the isolated CPUs is still possible. I don't know if this
a valid use case or not. I just didn't wanted to kill this feature it
without having discussed it before.

But if we agree this doesn't really makes sense with isolcpus, then I
think we should use the managed_irq one as nvme-pci is using the managed
IRQ API.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] sched/isolation: Add io_queue housekeeping option
  2024-06-24  9:00         ` Daniel Wagner
@ 2024-06-25  6:37           ` Hannes Reinecke
  2024-06-25  7:07             ` Thomas Gleixner
  0 siblings, 1 reply; 15+ messages in thread
From: Hannes Reinecke @ 2024-06-25  6:37 UTC (permalink / raw)
  To: Daniel Wagner, Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Thomas Gleixner,
	Frederic Weisbecker, Mel Gorman, Sridhar Balaraman, brookxu.cn,
	Ming Lei, linux-kernel, linux-block, linux-nvme

On 6/24/24 11:00, Daniel Wagner wrote:
> On Mon, Jun 24, 2024 at 10:47:05AM GMT, Christoph Hellwig wrote:
>>> Do you think we should introduce a new type or just use the existing
>>> managed_irq for this?
>>
>> No idea really.  What was the reason for adding a new one?
> 
> I've added the new type so that the current behavior of spreading the
> queues over to the isolated CPUs is still possible. I don't know if this
> a valid use case or not. I just didn't wanted to kill this feature it
> without having discussed it before.
> 
> But if we agree this doesn't really makes sense with isolcpus, then I
> think we should use the managed_irq one as nvme-pci is using the managed
> IRQ API.
> 
I'm in favour in expanding/modifying the managed irq case.
For managed irqs the driver will be running on the housekeeping CPUs 
only, and has no way of even installing irq handlers for the isolcpus.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] sched/isolation: Add io_queue housekeeping option
  2024-06-25  6:37           ` Hannes Reinecke
@ 2024-06-25  7:07             ` Thomas Gleixner
  2024-06-25  8:57               ` Daniel Wagner
  0 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2024-06-25  7:07 UTC (permalink / raw)
  To: Hannes Reinecke, Daniel Wagner, Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, Frederic Weisbecker,
	Mel Gorman, Sridhar Balaraman, brookxu.cn, Ming Lei, linux-kernel,
	linux-block, linux-nvme

On Tue, Jun 25 2024 at 08:37, Hannes Reinecke wrote:
> On 6/24/24 11:00, Daniel Wagner wrote:
>> On Mon, Jun 24, 2024 at 10:47:05AM GMT, Christoph Hellwig wrote:
>>>> Do you think we should introduce a new type or just use the existing
>>>> managed_irq for this?
>>>
>>> No idea really.  What was the reason for adding a new one?
>> 
>> I've added the new type so that the current behavior of spreading the
>> queues over to the isolated CPUs is still possible. I don't know if this
>> a valid use case or not. I just didn't wanted to kill this feature it
>> without having discussed it before.
>> 
>> But if we agree this doesn't really makes sense with isolcpus, then I
>> think we should use the managed_irq one as nvme-pci is using the managed
>> IRQ API.
>> 
> I'm in favour in expanding/modifying the managed irq case.
> For managed irqs the driver will be running on the housekeeping CPUs 
> only, and has no way of even installing irq handlers for the isolcpus.

Yes, that's preferred, but please double check with the people who
introduced that in the first place.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] sched/isolation: Add io_queue housekeeping option
  2024-06-25  7:07             ` Thomas Gleixner
@ 2024-06-25  8:57               ` Daniel Wagner
  2024-06-30 13:47                 ` Ming Lei
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel Wagner @ 2024-06-25  8:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: Hannes Reinecke, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Frederic Weisbecker, Mel Gorman, Sridhar Balaraman,
	brookxu.cn, linux-kernel, linux-block, linux-nvme,
	Thomas Gleixner

On Tue, Jun 25, 2024 at 09:07:30AM GMT, Thomas Gleixner wrote:
> On Tue, Jun 25 2024 at 08:37, Hannes Reinecke wrote:
> > On 6/24/24 11:00, Daniel Wagner wrote:
> >> On Mon, Jun 24, 2024 at 10:47:05AM GMT, Christoph Hellwig wrote:
> >>>> Do you think we should introduce a new type or just use the existing
> >>>> managed_irq for this?
> >>>
> >>> No idea really.  What was the reason for adding a new one?
> >> 
> >> I've added the new type so that the current behavior of spreading the
> >> queues over to the isolated CPUs is still possible. I don't know if this
> >> a valid use case or not. I just didn't wanted to kill this feature it
> >> without having discussed it before.
> >> 
> >> But if we agree this doesn't really makes sense with isolcpus, then I
> >> think we should use the managed_irq one as nvme-pci is using the managed
> >> IRQ API.
> >> 
> > I'm in favour in expanding/modifying the managed irq case.
> > For managed irqs the driver will be running on the housekeeping CPUs 
> > only, and has no way of even installing irq handlers for the isolcpus.
> 
> Yes, that's preferred, but please double check with the people who
> introduced that in the first place.

The relevant code was added by Ming:

11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed
interrupts")

    [...] it can happen that a managed interrupt whose affinity
    mask contains both isolated and housekeeping CPUs is routed to an isolated
    CPU. As a consequence IO submitted on a housekeeping CPU causes interrupts
    on the isolated CPU.

    Add a new sub-parameter 'managed_irq' for 'isolcpus' and the corresponding
    logic in the interrupt affinity selection code.

    The subparameter indicates to the interrupt affinity selection logic that
    it should try to avoid the above scenario.
    [...]

From the commit message I read the original indent is that managed_irq
should avoid speading queues on isolcated CPUs.

Ming, do you agree to use the managed_irq mask to limit the queue
spreading on isolated CPUs? It would make the io_queue option obsolete.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] sched/isolation: Add io_queue housekeeping option
  2024-06-25  8:57               ` Daniel Wagner
@ 2024-06-30 13:47                 ` Ming Lei
  0 siblings, 0 replies; 15+ messages in thread
From: Ming Lei @ 2024-06-30 13:47 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Hannes Reinecke, Christoph Hellwig, Jens Axboe, Keith Busch,
	Sagi Grimberg, Frederic Weisbecker, Mel Gorman, Sridhar Balaraman,
	brookxu.cn, linux-kernel, linux-block, linux-nvme,
	Thomas Gleixner

On Tue, Jun 25, 2024 at 10:57:42AM +0200, Daniel Wagner wrote:
> On Tue, Jun 25, 2024 at 09:07:30AM GMT, Thomas Gleixner wrote:
> > On Tue, Jun 25 2024 at 08:37, Hannes Reinecke wrote:
> > > On 6/24/24 11:00, Daniel Wagner wrote:
> > >> On Mon, Jun 24, 2024 at 10:47:05AM GMT, Christoph Hellwig wrote:
> > >>>> Do you think we should introduce a new type or just use the existing
> > >>>> managed_irq for this?
> > >>>
> > >>> No idea really.  What was the reason for adding a new one?
> > >> 
> > >> I've added the new type so that the current behavior of spreading the
> > >> queues over to the isolated CPUs is still possible. I don't know if this
> > >> a valid use case or not. I just didn't wanted to kill this feature it
> > >> without having discussed it before.
> > >> 
> > >> But if we agree this doesn't really makes sense with isolcpus, then I
> > >> think we should use the managed_irq one as nvme-pci is using the managed
> > >> IRQ API.
> > >> 
> > > I'm in favour in expanding/modifying the managed irq case.
> > > For managed irqs the driver will be running on the housekeeping CPUs 
> > > only, and has no way of even installing irq handlers for the isolcpus.
> > 
> > Yes, that's preferred, but please double check with the people who
> > introduced that in the first place.
> 
> The relevant code was added by Ming:
> 
> 11ea68f553e2 ("genirq, sched/isolation: Isolate from handling managed
> interrupts")
> 
>     [...] it can happen that a managed interrupt whose affinity
>     mask contains both isolated and housekeeping CPUs is routed to an isolated
>     CPU. As a consequence IO submitted on a housekeeping CPU causes interrupts
>     on the isolated CPU.
> 
>     Add a new sub-parameter 'managed_irq' for 'isolcpus' and the corresponding
>     logic in the interrupt affinity selection code.
> 
>     The subparameter indicates to the interrupt affinity selection logic that
>     it should try to avoid the above scenario.
>     [...]
> 
> From the commit message I read the original indent is that managed_irq
> should avoid speading queues on isolcated CPUs.
> 
> Ming, do you agree to use the managed_irq mask to limit the queue
> spreading on isolated CPUs? It would make the io_queue option obsolete.

Yes, managed_irq is introduced for not spreading on isolated CPUs, and
it is supposed to work well.

The only problem of managed_irq is just that isolated CPUs are
spread, but they are excluded from irq effective masks.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-06-30 13:48 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-21 13:53 [PATCH 0/3] nvme-pci: honor isolcpus configuration Daniel Wagner
2024-06-21 13:53 ` [PATCH 1/3] sched/isolation: Add io_queue housekeeping option Daniel Wagner
2024-06-22  5:11   ` Christoph Hellwig
2024-06-24  7:13     ` Daniel Wagner
2024-06-24  8:47       ` Christoph Hellwig
2024-06-24  9:00         ` Daniel Wagner
2024-06-25  6:37           ` Hannes Reinecke
2024-06-25  7:07             ` Thomas Gleixner
2024-06-25  8:57               ` Daniel Wagner
2024-06-30 13:47                 ` Ming Lei
2024-06-21 13:53 ` [PATCH 2/3] nvme-pci: limit queue count to housekeeping cpus Daniel Wagner
2024-06-22  5:14   ` Christoph Hellwig
2024-06-23  7:03     ` Sagi Grimberg
2024-06-21 13:53 ` [PATCH 3/3] lib/group_cpus.c: honor housekeeping config when grouping CPUs Daniel Wagner
2024-06-22  5:13   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).