[PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation.

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation.
@ 2024-10-31  7:46 'Guanjun'
  2024-10-31  7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
  2024-10-31  7:46 ` [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity 'Guanjun'
  0 siblings, 2 replies; 9+ messages in thread
From: 'Guanjun' @ 2024-10-31  7:46 UTC (permalink / raw)
  To: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
	stefanha, miklos, tglx, peterz, akpm, paulmck, thuth, rostedt, bp,
	xiongwei.song, linux-doc, linux-kernel, linux-block,
	virtualization, linux-fsdevel
  Cc: guanjun

From: Guanjun <guanjun@linux.alibaba.com>

We found that in scenarios with a large number of devices on the system,
for example, 256 NVMe block devices, each with 2 I/O queues, about a few
dozen of interrupts cannot be allocated, and get the error code -ENOSPC.
The reason for this issue is that the io queue interrupts are set to managed
interrupts (i.e., affinity is managed by the kernel), which leads to a
excessive number of the IRQ matrix bits being reserved.

This patch series support for limiting the number of managed interrupt
per allocation to address this issue.

Thanks,
Guanjun


Guanjun (2):
  genirq/affinity: add support for limiting managed interrupts
  genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out
    in the affinity

 .../admin-guide/kernel-parameters.txt         | 12 ++++
 block/blk-mq-cpumap.c                         |  2 +-
 drivers/virtio/virtio_vdpa.c                  |  2 +-
 fs/fuse/virtio_fs.c                           |  2 +-
 include/linux/group_cpus.h                    |  2 +-
 include/linux/irq.h                           |  2 +
 kernel/cpu.c                                  |  2 +-
 kernel/irq/affinity.c                         | 11 ++--
 kernel/irq/cpuhotplug.c                       | 51 +++++++++++++++++
 lib/group_cpus.c                              | 55 ++++++++++++++++++-
 10 files changed, 130 insertions(+), 11 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
  2024-10-31  7:46 [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation 'Guanjun'
@ 2024-10-31  7:46 ` 'Guanjun'
  2024-10-31 10:35   ` Thomas Gleixner
  2024-11-01  7:06   ` Jiri Slaby
  2024-10-31  7:46 ` [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity 'Guanjun'
  1 sibling, 2 replies; 9+ messages in thread
From: 'Guanjun' @ 2024-10-31  7:46 UTC (permalink / raw)
  To: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
	stefanha, miklos, tglx, peterz, akpm, paulmck, thuth, rostedt, bp,
	xiongwei.song, linux-doc, linux-kernel, linux-block,
	virtualization, linux-fsdevel
  Cc: guanjun

From: Guanjun <guanjun@linux.alibaba.com>

Commit c410abbbacb9 (genirq/affinity: Add is_managed to struct irq_affinity_desc)
introduced is_managed bit to struct irq_affinity_desc. Due to queue interrupts
treated as managed interrupts, in scenarios where a large number of
devices are present (using massive msix queue interrupts), an excessive number
of IRQ matrix bits (about num_online_cpus() * nvecs) are reserved during
interrupt allocation. This sequently leads to the situation where interrupts
for some devices cannot be properly allocated.

Support for limiting the number of managed interrupts on every node per allocation.

Signed-off-by: Guanjun <guanjun@linux.alibaba.com>
---
 .../admin-guide/kernel-parameters.txt         |  9 +++
 block/blk-mq-cpumap.c                         |  2 +-
 drivers/virtio/virtio_vdpa.c                  |  2 +-
 fs/fuse/virtio_fs.c                           |  2 +-
 include/linux/group_cpus.h                    |  2 +-
 kernel/irq/affinity.c                         | 11 ++--
 lib/group_cpus.c                              | 55 ++++++++++++++++++-
 7 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9b61097a6448..ac80f35d04c9 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3238,6 +3238,15 @@
 			different yeeloong laptops.
 			Example: machtype=lemote-yeeloong-2f-7inch
 
+	managed_irqs_per_node=
+			[KNL,SMP] Support for limiting the number of managed
+			interrupts on every node to prevent the case that
+			interrupts cannot be properly allocated where a large
+			number of devices are present. The default number is 0,
+			that means no limit to the number of managed irqs.
+			Format: integer between 0 and num_possible_cpus() / num_possible_nodes()
+			Default: 0
+
 	maxcpus=	[SMP,EARLY] Maximum number of processors that an SMP kernel
 			will bring up during bootup.  maxcpus=n : n >= 0 limits
 			the kernel to bring up 'n' processors. Surely after
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 9638b25fd521..481c81318e00 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -20,7 +20,7 @@ void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 	const struct cpumask *masks;
 	unsigned int queue, cpu;
 
-	masks = group_cpus_evenly(qmap->nr_queues);
+	masks = group_cpus_evenly(qmap->nr_queues, true);
 	if (!masks) {
 		for_each_possible_cpu(cpu)
 			qmap->mq_map[cpu] = qmap->queue_offset;
diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 7364bd53e38d..cd303ac64046 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -330,7 +330,7 @@ create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
 		unsigned int this_vecs = affd->set_size[i];
 		int j;
-		struct cpumask *result = group_cpus_evenly(this_vecs);
+		struct cpumask *result = group_cpus_evenly(this_vecs, true);
 
 		if (!result) {
 			kfree(masks);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index f68527891929..41b3bcc03f9c 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -881,7 +881,7 @@ static void virtio_fs_map_queues(struct virtio_device *vdev, struct virtio_fs *f
 	return;
 fallback:
 	/* Attempt to map evenly in groups over the CPUs */
-	masks = group_cpus_evenly(fs->num_request_queues);
+	masks = group_cpus_evenly(fs->num_request_queues, true);
 	/* If even this fails we default to all CPUs use queue zero */
 	if (!masks) {
 		for_each_possible_cpu(cpu)
diff --git a/include/linux/group_cpus.h b/include/linux/group_cpus.h
index e42807ec61f6..10a12b9a7ed4 100644
--- a/include/linux/group_cpus.h
+++ b/include/linux/group_cpus.h
@@ -9,6 +9,6 @@
 #include <linux/kernel.h>
 #include <linux/cpu.h>
 
-struct cpumask *group_cpus_evenly(unsigned int numgrps);
+struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed);
 
 #endif
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 44a4eba80315..775ab8537ddc 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -64,6 +64,10 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	for (curvec = 0; curvec < affd->pre_vectors; curvec++)
 		cpumask_copy(&masks[curvec].mask, irq_default_affinity);
 
+	/* Mark the managed interrupts */
+	for (i = curvec; i < nvecs - affd->post_vectors; i++)
+		masks[i].is_managed = 1;
+
 	/*
 	 * Spread on present CPUs starting from affd->pre_vectors. If we
 	 * have multiple sets, build each sets affinity mask separately.
@@ -71,7 +75,8 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
 		unsigned int this_vecs = affd->set_size[i];
 		int j;
-		struct cpumask *result = group_cpus_evenly(this_vecs);
+		struct cpumask *result = group_cpus_evenly(this_vecs,
+				masks[curvec].is_managed);
 
 		if (!result) {
 			kfree(masks);
@@ -94,10 +99,6 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	for (; curvec < nvecs; curvec++)
 		cpumask_copy(&masks[curvec].mask, irq_default_affinity);
 
-	/* Mark the managed interrupts */
-	for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++)
-		masks[i].is_managed = 1;
-
 	return masks;
 }
 
diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index ee272c4cefcc..769a139491bc 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -11,6 +11,30 @@
 
 #ifdef CONFIG_SMP
 
+static unsigned int __read_mostly managed_irqs_per_node;
+static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
+	[0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
+};
+
+static int __init irq_managed_setup(char *str)
+{
+	int ret;
+
+	ret = kstrtouint(str, 10, &managed_irqs_per_node);
+	if (ret < 0) {
+		pr_warn("managed_irqs_per_node= cannot parse, ignored\n");
+		return 0;
+	}
+
+	if (managed_irqs_per_node * num_possible_nodes() > num_possible_cpus()) {
+		managed_irqs_per_node = num_possible_cpus() / num_possible_nodes();
+		pr_warn("managed_irqs_per_node= cannot be larger than %u\n",
+			managed_irqs_per_node);
+	}
+	return 1;
+}
+__setup("managed_irqs_per_node=", irq_managed_setup);
+
 static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 				unsigned int cpus_per_grp)
 {
@@ -246,6 +270,30 @@ static void alloc_nodes_groups(unsigned int numgrps,
 	}
 }
 
+static void __group_prepare_affinity(struct cpumask *premask,
+				     cpumask_var_t *node_to_cpumask)
+{
+	nodemask_t nodemsk = NODE_MASK_NONE;
+	unsigned int ncpus, n;
+
+	get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
+
+	for_each_node_mask(n, nodemsk) {
+		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
+		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
+
+		ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
+		if (ncpus < managed_irqs_per_node) {
+			/* Reset node n to current node cpumask */
+			cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]);
+			continue;
+		}
+
+		grp_spread_init_one(premask, &managed_irqs_cpumsk[n], managed_irqs_per_node);
+	}
+}
+
+
 static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
 			       cpumask_var_t *node_to_cpumask,
 			       const struct cpumask *cpu_mask,
@@ -332,6 +380,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
 /**
  * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
  * @numgrps: number of groups
+ * @is_managed: if these groups managed by kernel
  *
  * Return: cpumask array if successful, NULL otherwise. And each element
  * includes CPUs assigned to this group
@@ -344,7 +393,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
  * We guarantee in the resulted grouping that all CPUs are covered, and
  * no same CPU is assigned to multiple groups
  */
-struct cpumask *group_cpus_evenly(unsigned int numgrps)
+struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed)
 {
 	unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
 	cpumask_var_t *node_to_cpumask;
@@ -382,6 +431,10 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 	 */
 	cpumask_copy(npresmsk, data_race(cpu_present_mask));
 
+	/* Limit the count of managed interrupts on every node */
+	if (is_managed && managed_irqs_per_node)
+		__group_prepare_affinity(npresmsk, node_to_cpumask);
+
 	/* grouping present CPUs first */
 	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
 				  npresmsk, nmsk, masks);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
  2024-10-31  7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
@ 2024-10-31 10:35   ` Thomas Gleixner
  2024-10-31 10:50     ` Ming Lei
  2024-11-01  3:03     ` mapicccy
  2024-11-01  7:06   ` Jiri Slaby
  1 sibling, 2 replies; 9+ messages in thread
From: Thomas Gleixner @ 2024-10-31 10:35 UTC (permalink / raw)
  To: 'Guanjun', corbet, axboe, mst, jasowang, xuanzhuo,
	eperezma, vgoyal, stefanha, miklos, peterz, akpm, paulmck, thuth,
	rostedt, bp, xiongwei.song, linux-doc, linux-kernel, linux-block,
	virtualization, linux-fsdevel
  Cc: guanjun

On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote:
>  #ifdef CONFIG_SMP
>  
> +static unsigned int __read_mostly managed_irqs_per_node;
> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
> +	[0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
> +};
>  
> +static void __group_prepare_affinity(struct cpumask *premask,
> +				     cpumask_var_t *node_to_cpumask)
> +{
> +	nodemask_t nodemsk = NODE_MASK_NONE;
> +	unsigned int ncpus, n;
> +
> +	get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
> +
> +	for_each_node_mask(n, nodemsk) {
> +		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
> +		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);

How is this managed_irqs_cpumsk array protected against concurrency?

> +		ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
> +		if (ncpus < managed_irqs_per_node) {
> +			/* Reset node n to current node cpumask */
> +			cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]);

This whole logic is incomprehensible and aside of the concurrency
problem it's broken when CPUs are made present at run-time because these
cpu masks are static and represent the stale state of the last
invocation.

Given the limitations of the x86 vector space, which is not going away
anytime soon, there are only two options IMO to handle such a scenario.

   1) Tell the nvme/block layer to disable queue affinity management

   2) Restrict the devices and queues to the nodes they sit on

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
  2024-10-31 10:35   ` Thomas Gleixner
@ 2024-10-31 10:50     ` Ming Lei
       [not found]       ` <43FD1116-C188-4729-A3AB-C2A0F5A087D2@linux.alibaba.com>
  2024-11-01  3:03     ` mapicccy
  1 sibling, 1 reply; 9+ messages in thread
From: Ming Lei @ 2024-10-31 10:50 UTC (permalink / raw)
  To: Thomas Gleixner, Christoph Hellwig
  Cc: Guanjun, corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
	stefanha, miklos, peterz, akpm, paulmck, thuth, rostedt, bp,
	xiongwei.song, linux-doc, linux-kernel, linux-block,
	virtualization, linux-fsdevel

On Thu, Oct 31, 2024 at 6:35 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote:
> >  #ifdef CONFIG_SMP
> >
> > +static unsigned int __read_mostly managed_irqs_per_node;
> > +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
> > +     [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
> > +};
> >
> > +static void __group_prepare_affinity(struct cpumask *premask,
> > +                                  cpumask_var_t *node_to_cpumask)
> > +{
> > +     nodemask_t nodemsk = NODE_MASK_NONE;
> > +     unsigned int ncpus, n;
> > +
> > +     get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
> > +
> > +     for_each_node_mask(n, nodemsk) {
> > +             cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
> > +             cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
>
> How is this managed_irqs_cpumsk array protected against concurrency?
>
> > +             ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
> > +             if (ncpus < managed_irqs_per_node) {
> > +                     /* Reset node n to current node cpumask */
> > +                     cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]);
>
> This whole logic is incomprehensible and aside of the concurrency
> problem it's broken when CPUs are made present at run-time because these
> cpu masks are static and represent the stale state of the last
> invocation.
>
> Given the limitations of the x86 vector space, which is not going away
> anytime soon, there are only two options IMO to handle such a scenario.
>
>    1) Tell the nvme/block layer to disable queue affinity management

+1

There are other use cases, such as cpu isolation, which can benefit from
this way too.

https://lore.kernel.org/linux-nvme/20240702104112.4123810-1-ming.lei@redhat.com/

Thanks,


^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <43FD1116-C188-4729-A3AB-C2A0F5A087D2@linux.alibaba.com>]

* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
       [not found]       ` <43FD1116-C188-4729-A3AB-C2A0F5A087D2@linux.alibaba.com>
@ 2024-11-01  3:34         ` Jason Wang
  0 siblings, 0 replies; 9+ messages in thread
From: Jason Wang @ 2024-11-01  3:34 UTC (permalink / raw)
  To: mapicccy
  Cc: Ming Lei, Thomas Gleixner, Christoph Hellwig, corbet, axboe, mst,
	xuanzhuo, eperezma, vgoyal, stefanha, miklos, peterz, akpm,
	paulmck, thuth, rostedt, bp, xiongwei.song, linux-doc,
	linux-kernel, linux-block, virtualization, linux-fsdevel

On Fri, Nov 1, 2024 at 11:12 AM mapicccy <guanjun@linux.alibaba.com> wrote:
>
>
>
> 2024年10月31日 18:50，Ming Lei <ming.lei@redhat.com> 写道：
>
> On Thu, Oct 31, 2024 at 6:35 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote:
>
> #ifdef CONFIG_SMP
>
> +static unsigned int __read_mostly managed_irqs_per_node;
> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
> +     [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
> +};
>
> +static void __group_prepare_affinity(struct cpumask *premask,
> +                                  cpumask_var_t *node_to_cpumask)
> +{
> +     nodemask_t nodemsk = NODE_MASK_NONE;
> +     unsigned int ncpus, n;
> +
> +     get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
> +
> +     for_each_node_mask(n, nodemsk) {
> +             cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
> +             cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
>
>
> How is this managed_irqs_cpumsk array protected against concurrency?
>
> +             ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
> +             if (ncpus < managed_irqs_per_node) {
> +                     /* Reset node n to current node cpumask */
> +                     cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]);
>
>
> This whole logic is incomprehensible and aside of the concurrency
> problem it's broken when CPUs are made present at run-time because these
> cpu masks are static and represent the stale state of the last
> invocation.
>
> Given the limitations of the x86 vector space, which is not going away
> anytime soon, there are only two options IMO to handle such a scenario.
>
>   1) Tell the nvme/block layer to disable queue affinity management
>
>
> +1
>
> There are other use cases, such as cpu isolation, which can benefit from
> this way too.
>
> https://lore.kernel.org/linux-nvme/20240702104112.4123810-1-ming.lei@redhat.com/
>

I wonder if we need to do the same for virtio-blk.

>
> Thanks for your reminder. However, in this link only modified the NVMe driver,
> but there is the same issue in the virtio net driver as well.

I guess you meant virtio-blk actually?

>
> Guanjun
>
>
> Thanks,
>

Thanks


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
  2024-10-31 10:35   ` Thomas Gleixner
  2024-10-31 10:50     ` Ming Lei
@ 2024-11-01  3:03     ` mapicccy
  2024-11-01 23:37       ` Thomas Gleixner
  1 sibling, 1 reply; 9+ messages in thread
From: mapicccy @ 2024-11-01  3:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
	stefanha, miklos, peterz, akpm, paulmck, thuth, rostedt, bp,
	xiongwei.song, linux-doc, linux-kernel, linux-block,
	virtualization, linux-fsdevel



> 2024年10月31日 18:35，Thomas Gleixner <tglx@linutronix.de> 写道：
> 
> On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote:
>> #ifdef CONFIG_SMP
>> 
>> +static unsigned int __read_mostly managed_irqs_per_node;
>> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
>> +	[0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
>> +};
>> 
>> +static void __group_prepare_affinity(struct cpumask *premask,
>> +				     cpumask_var_t *node_to_cpumask)
>> +{
>> +	nodemask_t nodemsk = NODE_MASK_NONE;
>> +	unsigned int ncpus, n;
>> +
>> +	get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
>> +
>> +	for_each_node_mask(n, nodemsk) {
>> +		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
>> +		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
> 
> How is this managed_irqs_cpumsk array protected against concurrency?

My intention was to allocate up to `managed_irq_per_node` cpu bits from `managed_irqs_cpumask[n]`,
even if another task modifies some of the bits in the `managed_irqs_cpumask[n]` at the same time.

> 
>> +		ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
>> +		if (ncpus < managed_irqs_per_node) {
>> +			/* Reset node n to current node cpumask */
>> +			cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]);
> 
> This whole logic is incomprehensible and aside of the concurrency
> problem it's broken when CPUs are made present at run-time because these
> cpu masks are static and represent the stale state of the last
> invocation.

Sorry, I realize there is indeed a logic issue here (caused by developing on 5.10 LTS and rebase to the latest linux-next).

> 
> Given the limitations of the x86 vector space, which is not going away
> anytime soon, there are only two options IMO to handle such a scenario.
> 
>   1) Tell the nvme/block layer to disable queue affinity management
> 
>   2) Restrict the devices and queues to the nodes they sit on

I have tried fixing this issue through nvme driver, but later discovered that the same issue exists with virtio net.
Therefore, I want to address this with a more general solution.

Thanks,
Guanjun

> 
> Thanks,
> 
>        tglx


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
  2024-11-01  3:03     ` mapicccy
@ 2024-11-01 23:37       ` Thomas Gleixner
  0 siblings, 0 replies; 9+ messages in thread
From: Thomas Gleixner @ 2024-11-01 23:37 UTC (permalink / raw)
  To: mapicccy
  Cc: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
	stefanha, miklos, peterz, akpm, paulmck, thuth, rostedt, bp,
	xiongwei.song, linux-doc, linux-kernel, linux-block,
	virtualization, linux-fsdevel

On Fri, Nov 01 2024 at 11:03, mapicccy wrote:
>> 2024年10月31日 18:35，Thomas Gleixner <tglx@linutronix.de> 写道：
>>> +	get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
>>> +
>>> +	for_each_node_mask(n, nodemsk) {
>>> +		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
>>> +		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
>> 
>> How is this managed_irqs_cpumsk array protected against concurrency?
>
> My intention was to allocate up to `managed_irq_per_node` cpu bits from `managed_irqs_cpumask[n]`,
> even if another task modifies some of the bits in the `managed_irqs_cpumask[n]` at the same time.

That may have been your intention, but how is this even remotely
correct?

Aside of that. If it's intentional and you think it's correct then you
should have documented that in the code and also annotated it to not
trigger santiziers.

>> Given the limitations of the x86 vector space, which is not going away
>> anytime soon, there are only two options IMO to handle such a scenario.
>> 
>>   1) Tell the nvme/block layer to disable queue affinity management
>> 
>>   2) Restrict the devices and queues to the nodes they sit on
>
> I have tried fixing this issue through nvme driver, but later
> discovered that the same issue exists with virtio net.  Therefore, I
> want to address this with a more general solution.

I understand, but a general solution for this problem won't exist
ever.

It's very reasonable to restrict this for one particular device type or
subsystem while maintaining the strict managed property for others, no?

General solutions are definitely preferred, but not for the price that
they break existing completely correct and working setups. Which is what
your 2/2 patch does for sure.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
  2024-10-31  7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
  2024-10-31 10:35   ` Thomas Gleixner
@ 2024-11-01  7:06   ` Jiri Slaby
  1 sibling, 0 replies; 9+ messages in thread
From: Jiri Slaby @ 2024-11-01  7:06 UTC (permalink / raw)
  To: 'Guanjun', corbet, axboe, mst, jasowang, xuanzhuo,
	eperezma, vgoyal, stefanha, miklos, tglx, peterz, akpm, paulmck,
	thuth, rostedt, bp, xiongwei.song, linux-doc, linux-kernel,
	linux-block, virtualization, linux-fsdevel

Hi,

On 31. 10. 24, 8:46, 'Guanjun' wrote:
> From: Guanjun <guanjun@linux.alibaba.com>
> 
> Commit c410abbbacb9 (genirq/affinity: Add is_managed to struct irq_affinity_desc)
> introduced is_managed bit to struct irq_affinity_desc. Due to queue interrupts
> treated as managed interrupts, in scenarios where a large number of
> devices are present (using massive msix queue interrupts), an excessive number
> of IRQ matrix bits (about num_online_cpus() * nvecs) are reserved during
> interrupt allocation. This sequently leads to the situation where interrupts
> for some devices cannot be properly allocated.
> 
> Support for limiting the number of managed interrupts on every node per allocation.
> 
> Signed-off-by: Guanjun <guanjun@linux.alibaba.com>
> ---
>   .../admin-guide/kernel-parameters.txt         |  9 +++
>   block/blk-mq-cpumap.c                         |  2 +-
>   drivers/virtio/virtio_vdpa.c                  |  2 +-
>   fs/fuse/virtio_fs.c                           |  2 +-
>   include/linux/group_cpus.h                    |  2 +-
>   kernel/irq/affinity.c                         | 11 ++--
>   lib/group_cpus.c                              | 55 ++++++++++++++++++-
>   7 files changed, 73 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 9b61097a6448..ac80f35d04c9 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3238,6 +3238,15 @@
>   			different yeeloong laptops.
>   			Example: machtype=lemote-yeeloong-2f-7inch
>   
> +	managed_irqs_per_node=
> +			[KNL,SMP] Support for limiting the number of managed
> +			interrupts on every node to prevent the case that
> +			interrupts cannot be properly allocated where a large
> +			number of devices are present. The default number is 0,
> +			that means no limit to the number of managed irqs.
> +			Format: integer between 0 and num_possible_cpus() / num_possible_nodes()
> +			Default: 0

Kernel parameters suck. Esp. here you have to guess to even properly 
boot. Could this be auto-tuned instead?

> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -11,6 +11,30 @@
>   
>   #ifdef CONFIG_SMP
>   
> +static unsigned int __read_mostly managed_irqs_per_node;
> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {

This is quite excessive. On SUSE configs, this is 8192 cpu bits * 1024 
nodes = 1 M. For everyone. You have to allocate this dynamically 
instead. See e.g. setup_node_to_cpumask_map().

> +	[0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
> +};
> +
> +static int __init irq_managed_setup(char *str)
> +{
> +	int ret;
> +
> +	ret = kstrtouint(str, 10, &managed_irqs_per_node);
> +	if (ret < 0) {
> +		pr_warn("managed_irqs_per_node= cannot parse, ignored\n");

could not be parsed

> +		return 0;
> +	}
> +
> +	if (managed_irqs_per_node * num_possible_nodes() > num_possible_cpus()) {
> +		managed_irqs_per_node = num_possible_cpus() / num_possible_nodes();
> +		pr_warn("managed_irqs_per_node= cannot be larger than %u\n",
> +			managed_irqs_per_node);
> +	}
> +	return 1;
> +}
> +__setup("managed_irqs_per_node=", irq_managed_setup);
> +
>   static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
>   				unsigned int cpus_per_grp)
>   {
...
> @@ -332,6 +380,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
>   /**
>    * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
>    * @numgrps: number of groups
> + * @is_managed: if these groups managed by kernel

are managed by the kernel

>    *
>    * Return: cpumask array if successful, NULL otherwise. And each element
>    * includes CPUs assigned to this group

thanks,
-- 
js
suse labs


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity
  2024-10-31  7:46 [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation 'Guanjun'
  2024-10-31  7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
@ 2024-10-31  7:46 ` 'Guanjun'
  1 sibling, 0 replies; 9+ messages in thread
From: 'Guanjun' @ 2024-10-31  7:46 UTC (permalink / raw)
  To: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
	stefanha, miklos, tglx, peterz, akpm, paulmck, thuth, rostedt, bp,
	xiongwei.song, linux-doc, linux-kernel, linux-block,
	virtualization, linux-fsdevel
  Cc: guanjun

From: Guanjun <guanjun@linux.alibaba.com>

Once we limit the number of managed interrupts, if the last online CPU in
the affinity goes offline, it will result in the interrupt becoming unavailable
util one of the assigned CPUs comes online again. So prevent the last online
CPU in the affinity from going offline, and return -EBUSY in this situation.

Signed-off-by: Guanjun <guanjun@linux.alibaba.com>
---
 .../admin-guide/kernel-parameters.txt         |  3 ++
 include/linux/irq.h                           |  2 +
 kernel/cpu.c                                  |  2 +-
 kernel/irq/cpuhotplug.c                       | 51 +++++++++++++++++++
 4 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ac80f35d04c9..173598cbf4a6 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3244,6 +3244,9 @@
 			interrupts cannot be properly allocated where a large
 			number of devices are present. The default number is 0,
 			that means no limit to the number of managed irqs.
+			Once we limit the number of managed interrupts, the last
+			online CPU in the affinity goes offline will fail with
+			the error code -EBUSY.
 			Format: integer between 0 and num_possible_cpus() / num_possible_nodes()
 			Default: 0
 
diff --git a/include/linux/irq.h b/include/linux/irq.h
index fa711f80957b..68ce05a74079 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -615,8 +615,10 @@ extern int irq_set_vcpu_affinity(unsigned int irq, void *vcpu_info);
 #if defined(CONFIG_SMP) && defined(CONFIG_GENERIC_IRQ_MIGRATION)
 extern void irq_migrate_all_off_this_cpu(void);
 extern int irq_affinity_online_cpu(unsigned int cpu);
+extern int irq_affinity_offline_cpu(unsigned int cpu);
 #else
 # define irq_affinity_online_cpu	NULL
+# define irq_affinity_offline_cpu	NULL
 #endif
 
 #if defined(CONFIG_SMP) && defined(CONFIG_GENERIC_PENDING_IRQ)
diff --git a/kernel/cpu.c b/kernel/cpu.c
index c4aaf73dec9e..672d920970b2 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2219,7 +2219,7 @@ static struct cpuhp_step cpuhp_hp_states[] = {
 	[CPUHP_AP_IRQ_AFFINITY_ONLINE] = {
 		.name			= "irq/affinity:online",
 		.startup.single		= irq_affinity_online_cpu,
-		.teardown.single	= NULL,
+		.teardown.single	= irq_affinity_offline_cpu,
 	},
 	[CPUHP_AP_PERF_ONLINE] = {
 		.name			= "perf:online",
diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index 15a7654eff68..e6f068198e4a 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -232,6 +232,31 @@ static void irq_restore_affinity_of_irq(struct irq_desc *desc, unsigned int cpu)
 		irq_set_affinity_locked(data, affinity, false);
 }
 
+static int irq_check_affinity_of_irq(struct irq_desc *desc, unsigned int cpu)
+{
+	struct irq_data *data = irq_desc_get_irq_data(desc);
+	const struct cpumask *affinity = irq_data_get_affinity_mask(data);
+	unsigned int cur;
+
+	if (!irqd_affinity_is_managed(data) || !desc->action ||
+	    !irq_data_get_irq_chip(data) || !cpumask_test_cpu(cpu, affinity))
+		return 0;
+
+	for_each_cpu(cur, affinity)
+		if (cur != cpu && cpumask_test_cpu(cur, cpu_online_mask))
+			return 0;
+
+	/*
+	 * If the onging offline CPU is the last one in the affinity,
+	 * the managed interrupts will be unavailable until one of
+	 * the assigned CPUs comes online. To prevent this unavailability,
+	 * return -EBUSY directly in this case.
+	 */
+	pr_warn("Affinity %*pbl of managed IRQ%u contains only one CPU%u that online\n",
+		cpumask_pr_args(affinity), data->irq, cpu);
+	return -EBUSY;
+}
+
 /**
  * irq_affinity_online_cpu - Restore affinity for managed interrupts
  * @cpu:	Upcoming CPU for which interrupts should be restored
@@ -252,3 +277,29 @@ int irq_affinity_online_cpu(unsigned int cpu)
 
 	return 0;
 }
+
+/**
+ * irq_affinity_offline_cpu - Check affinity for managed interrupts
+ * to prevent the unavailability caused by taking the last CPU in the
+ * affinity offline.
+ * @cpu:	Upcoming CPU for which interrupts should be checked
+ */
+int irq_affinity_offline_cpu(unsigned int cpu)
+{
+	struct irq_desc *desc;
+	unsigned int irq;
+	int ret = 0;
+
+	irq_lock_sparse();
+	for_each_active_irq(irq) {
+		desc = irq_to_desc(irq);
+		raw_spin_lock_irq(&desc->lock);
+		ret = irq_check_affinity_of_irq(desc, cpu);
+		raw_spin_unlock_irq(&desc->lock);
+		if (ret < 0)
+			break;
+	}
+	irq_unlock_sparse();
+
+	return ret;
+}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-11-01 23:37 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-31  7:46 [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation 'Guanjun'
2024-10-31  7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
2024-10-31 10:35   ` Thomas Gleixner
2024-10-31 10:50     ` Ming Lei
     [not found]       ` <43FD1116-C188-4729-A3AB-C2A0F5A087D2@linux.alibaba.com>
2024-11-01  3:34         ` Jason Wang
2024-11-01  3:03     ` mapicccy
2024-11-01 23:37       ` Thomas Gleixner
2024-11-01  7:06   ` Jiri Slaby
2024-10-31  7:46 ` [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity 'Guanjun'

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).