[PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation.
@ 2024-10-31  7:46 'Guanjun'
  2024-10-31  7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
  2024-10-31  7:46 ` [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity 'Guanjun'
  0 siblings, 2 replies; 11+ messages in thread
From: 'Guanjun' @ 2024-10-31  7:46 UTC (permalink / raw)
  To: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
	stefanha, miklos, tglx, peterz, akpm, paulmck, thuth, rostedt, bp,
	xiongwei.song, linux-doc, linux-kernel, linux-block,
	virtualization, linux-fsdevel
  Cc: guanjun

From: Guanjun <guanjun@linux.alibaba.com>

We found that in scenarios with a large number of devices on the system,
for example, 256 NVMe block devices, each with 2 I/O queues, about a few
dozen of interrupts cannot be allocated, and get the error code -ENOSPC.
The reason for this issue is that the io queue interrupts are set to managed
interrupts (i.e., affinity is managed by the kernel), which leads to a
excessive number of the IRQ matrix bits being reserved.

This patch series support for limiting the number of managed interrupt
per allocation to address this issue.

Thanks,
Guanjun


Guanjun (2):
  genirq/affinity: add support for limiting managed interrupts
  genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out
    in the affinity

 .../admin-guide/kernel-parameters.txt         | 12 ++++
 block/blk-mq-cpumap.c                         |  2 +-
 drivers/virtio/virtio_vdpa.c                  |  2 +-
 fs/fuse/virtio_fs.c                           |  2 +-
 include/linux/group_cpus.h                    |  2 +-
 include/linux/irq.h                           |  2 +
 kernel/cpu.c                                  |  2 +-
 kernel/irq/affinity.c                         | 11 ++--
 kernel/irq/cpuhotplug.c                       | 51 +++++++++++++++++
 lib/group_cpus.c                              | 55 ++++++++++++++++++-
 10 files changed, 130 insertions(+), 11 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
  2024-10-31  7:46 [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation 'Guanjun'
@ 2024-10-31  7:46 ` 'Guanjun'
  2024-10-31 10:35   ` Thomas Gleixner
                     ` (3 more replies)
  2024-10-31  7:46 ` [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity 'Guanjun'
  1 sibling, 4 replies; 11+ messages in thread
From: 'Guanjun' @ 2024-10-31  7:46 UTC (permalink / raw)
  To: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
	stefanha, miklos, tglx, peterz, akpm, paulmck, thuth, rostedt, bp,
	xiongwei.song, linux-doc, linux-kernel, linux-block,
	virtualization, linux-fsdevel
  Cc: guanjun

From: Guanjun <guanjun@linux.alibaba.com>

Commit c410abbbacb9 (genirq/affinity: Add is_managed to struct irq_affinity_desc)
introduced is_managed bit to struct irq_affinity_desc. Due to queue interrupts
treated as managed interrupts, in scenarios where a large number of
devices are present (using massive msix queue interrupts), an excessive number
of IRQ matrix bits (about num_online_cpus() * nvecs) are reserved during
interrupt allocation. This sequently leads to the situation where interrupts
for some devices cannot be properly allocated.

Support for limiting the number of managed interrupts on every node per allocation.

Signed-off-by: Guanjun <guanjun@linux.alibaba.com>
---
 .../admin-guide/kernel-parameters.txt         |  9 +++
 block/blk-mq-cpumap.c                         |  2 +-
 drivers/virtio/virtio_vdpa.c                  |  2 +-
 fs/fuse/virtio_fs.c                           |  2 +-
 include/linux/group_cpus.h                    |  2 +-
 kernel/irq/affinity.c                         | 11 ++--
 lib/group_cpus.c                              | 55 ++++++++++++++++++-
 7 files changed, 73 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9b61097a6448..ac80f35d04c9 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3238,6 +3238,15 @@
 			different yeeloong laptops.
 			Example: machtype=lemote-yeeloong-2f-7inch
 
+	managed_irqs_per_node=
+			[KNL,SMP] Support for limiting the number of managed
+			interrupts on every node to prevent the case that
+			interrupts cannot be properly allocated where a large
+			number of devices are present. The default number is 0,
+			that means no limit to the number of managed irqs.
+			Format: integer between 0 and num_possible_cpus() / num_possible_nodes()
+			Default: 0
+
 	maxcpus=	[SMP,EARLY] Maximum number of processors that an SMP kernel
 			will bring up during bootup.  maxcpus=n : n >= 0 limits
 			the kernel to bring up 'n' processors. Surely after
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 9638b25fd521..481c81318e00 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -20,7 +20,7 @@ void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
 	const struct cpumask *masks;
 	unsigned int queue, cpu;
 
-	masks = group_cpus_evenly(qmap->nr_queues);
+	masks = group_cpus_evenly(qmap->nr_queues, true);
 	if (!masks) {
 		for_each_possible_cpu(cpu)
 			qmap->mq_map[cpu] = qmap->queue_offset;
diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 7364bd53e38d..cd303ac64046 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -330,7 +330,7 @@ create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
 		unsigned int this_vecs = affd->set_size[i];
 		int j;
-		struct cpumask *result = group_cpus_evenly(this_vecs);
+		struct cpumask *result = group_cpus_evenly(this_vecs, true);
 
 		if (!result) {
 			kfree(masks);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index f68527891929..41b3bcc03f9c 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -881,7 +881,7 @@ static void virtio_fs_map_queues(struct virtio_device *vdev, struct virtio_fs *f
 	return;
 fallback:
 	/* Attempt to map evenly in groups over the CPUs */
-	masks = group_cpus_evenly(fs->num_request_queues);
+	masks = group_cpus_evenly(fs->num_request_queues, true);
 	/* If even this fails we default to all CPUs use queue zero */
 	if (!masks) {
 		for_each_possible_cpu(cpu)
diff --git a/include/linux/group_cpus.h b/include/linux/group_cpus.h
index e42807ec61f6..10a12b9a7ed4 100644
--- a/include/linux/group_cpus.h
+++ b/include/linux/group_cpus.h
@@ -9,6 +9,6 @@
 #include <linux/kernel.h>
 #include <linux/cpu.h>
 
-struct cpumask *group_cpus_evenly(unsigned int numgrps);
+struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed);
 
 #endif
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 44a4eba80315..775ab8537ddc 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -64,6 +64,10 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	for (curvec = 0; curvec < affd->pre_vectors; curvec++)
 		cpumask_copy(&masks[curvec].mask, irq_default_affinity);
 
+	/* Mark the managed interrupts */
+	for (i = curvec; i < nvecs - affd->post_vectors; i++)
+		masks[i].is_managed = 1;
+
 	/*
 	 * Spread on present CPUs starting from affd->pre_vectors. If we
 	 * have multiple sets, build each sets affinity mask separately.
@@ -71,7 +75,8 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
 		unsigned int this_vecs = affd->set_size[i];
 		int j;
-		struct cpumask *result = group_cpus_evenly(this_vecs);
+		struct cpumask *result = group_cpus_evenly(this_vecs,
+				masks[curvec].is_managed);
 
 		if (!result) {
 			kfree(masks);
@@ -94,10 +99,6 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
 	for (; curvec < nvecs; curvec++)
 		cpumask_copy(&masks[curvec].mask, irq_default_affinity);
 
-	/* Mark the managed interrupts */
-	for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++)
-		masks[i].is_managed = 1;
-
 	return masks;
 }
 
diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index ee272c4cefcc..769a139491bc 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -11,6 +11,30 @@
 
 #ifdef CONFIG_SMP
 
+static unsigned int __read_mostly managed_irqs_per_node;
+static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
+	[0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
+};
+
+static int __init irq_managed_setup(char *str)
+{
+	int ret;
+
+	ret = kstrtouint(str, 10, &managed_irqs_per_node);
+	if (ret < 0) {
+		pr_warn("managed_irqs_per_node= cannot parse, ignored\n");
+		return 0;
+	}
+
+	if (managed_irqs_per_node * num_possible_nodes() > num_possible_cpus()) {
+		managed_irqs_per_node = num_possible_cpus() / num_possible_nodes();
+		pr_warn("managed_irqs_per_node= cannot be larger than %u\n",
+			managed_irqs_per_node);
+	}
+	return 1;
+}
+__setup("managed_irqs_per_node=", irq_managed_setup);
+
 static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 				unsigned int cpus_per_grp)
 {
@@ -246,6 +270,30 @@ static void alloc_nodes_groups(unsigned int numgrps,
 	}
 }
 
+static void __group_prepare_affinity(struct cpumask *premask,
+				     cpumask_var_t *node_to_cpumask)
+{
+	nodemask_t nodemsk = NODE_MASK_NONE;
+	unsigned int ncpus, n;
+
+	get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
+
+	for_each_node_mask(n, nodemsk) {
+		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
+		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
+
+		ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
+		if (ncpus < managed_irqs_per_node) {
+			/* Reset node n to current node cpumask */
+			cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]);
+			continue;
+		}
+
+		grp_spread_init_one(premask, &managed_irqs_cpumsk[n], managed_irqs_per_node);
+	}
+}
+
+
 static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
 			       cpumask_var_t *node_to_cpumask,
 			       const struct cpumask *cpu_mask,
@@ -332,6 +380,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
 /**
  * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
  * @numgrps: number of groups
+ * @is_managed: if these groups managed by kernel
  *
  * Return: cpumask array if successful, NULL otherwise. And each element
  * includes CPUs assigned to this group
@@ -344,7 +393,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
  * We guarantee in the resulted grouping that all CPUs are covered, and
  * no same CPU is assigned to multiple groups
  */
-struct cpumask *group_cpus_evenly(unsigned int numgrps)
+struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed)
 {
 	unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
 	cpumask_var_t *node_to_cpumask;
@@ -382,6 +431,10 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
 	 */
 	cpumask_copy(npresmsk, data_race(cpu_present_mask));
 
+	/* Limit the count of managed interrupts on every node */
+	if (is_managed && managed_irqs_per_node)
+		__group_prepare_affinity(npresmsk, node_to_cpumask);
+
 	/* grouping present CPUs first */
 	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
 				  npresmsk, nmsk, masks);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
  2024-10-31  7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
@ 2024-10-31 10:35   ` Thomas Gleixner
  2024-10-31 10:50     ` Ming Lei
  2024-11-01  3:03     ` mapicccy
  2024-11-01  7:06   ` Jiri Slaby
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 11+ messages in thread
From: Thomas Gleixner @ 2024-10-31 10:35 UTC (permalink / raw)
  To: 'Guanjun', corbet, axboe, mst, jasowang, xuanzhuo,
	eperezma, vgoyal, stefanha, miklos, peterz, akpm, paulmck, thuth,
	rostedt, bp, xiongwei.song, linux-doc, linux-kernel, linux-block,
	virtualization, linux-fsdevel
  Cc: guanjun

On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote:
>  #ifdef CONFIG_SMP
>  
> +static unsigned int __read_mostly managed_irqs_per_node;
> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
> +	[0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
> +};
>  
> +static void __group_prepare_affinity(struct cpumask *premask,
> +				     cpumask_var_t *node_to_cpumask)
> +{
> +	nodemask_t nodemsk = NODE_MASK_NONE;
> +	unsigned int ncpus, n;
> +
> +	get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
> +
> +	for_each_node_mask(n, nodemsk) {
> +		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
> +		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);

How is this managed_irqs_cpumsk array protected against concurrency?

> +		ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
> +		if (ncpus < managed_irqs_per_node) {
> +			/* Reset node n to current node cpumask */
> +			cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]);

This whole logic is incomprehensible and aside of the concurrency
problem it's broken when CPUs are made present at run-time because these
cpu masks are static and represent the stale state of the last
invocation.

Given the limitations of the x86 vector space, which is not going away
anytime soon, there are only two options IMO to handle such a scenario.

   1) Tell the nvme/block layer to disable queue affinity management

   2) Restrict the devices and queues to the nodes they sit on

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
  2024-10-31 10:35   ` Thomas Gleixner
@ 2024-10-31 10:50     ` Ming Lei
       [not found]       ` <43FD1116-C188-4729-A3AB-C2A0F5A087D2@linux.alibaba.com>
  2024-11-01  3:03     ` mapicccy
  1 sibling, 1 reply; 11+ messages in thread
From: Ming Lei @ 2024-10-31 10:50 UTC (permalink / raw)
  To: Thomas Gleixner, Christoph Hellwig
  Cc: Guanjun, corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
	stefanha, miklos, peterz, akpm, paulmck, thuth, rostedt, bp,
	xiongwei.song, linux-doc, linux-kernel, linux-block,
	virtualization, linux-fsdevel

On Thu, Oct 31, 2024 at 6:35 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote:
> >  #ifdef CONFIG_SMP
> >
> > +static unsigned int __read_mostly managed_irqs_per_node;
> > +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
> > +     [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
> > +};
> >
> > +static void __group_prepare_affinity(struct cpumask *premask,
> > +                                  cpumask_var_t *node_to_cpumask)
> > +{
> > +     nodemask_t nodemsk = NODE_MASK_NONE;
> > +     unsigned int ncpus, n;
> > +
> > +     get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
> > +
> > +     for_each_node_mask(n, nodemsk) {
> > +             cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
> > +             cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
>
> How is this managed_irqs_cpumsk array protected against concurrency?
>
> > +             ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
> > +             if (ncpus < managed_irqs_per_node) {
> > +                     /* Reset node n to current node cpumask */
> > +                     cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]);
>
> This whole logic is incomprehensible and aside of the concurrency
> problem it's broken when CPUs are made present at run-time because these
> cpu masks are static and represent the stale state of the last
> invocation.
>
> Given the limitations of the x86 vector space, which is not going away
> anytime soon, there are only two options IMO to handle such a scenario.
>
>    1) Tell the nvme/block layer to disable queue affinity management

+1

There are other use cases, such as cpu isolation, which can benefit from
this way too.

https://lore.kernel.org/linux-nvme/20240702104112.4123810-1-ming.lei@redhat.com/

Thanks,


^ permalink raw reply	[flat|nested] 11+ messages in thread

[parent not found: <43FD1116-C188-4729-A3AB-C2A0F5A087D2@linux.alibaba.com>]

* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
       [not found]       ` <43FD1116-C188-4729-A3AB-C2A0F5A087D2@linux.alibaba.com>
@ 2024-11-01  3:34         ` Jason Wang
  0 siblings, 0 replies; 11+ messages in thread
From: Jason Wang @ 2024-11-01  3:34 UTC (permalink / raw)
  To: mapicccy
  Cc: Ming Lei, Thomas Gleixner, Christoph Hellwig, corbet, axboe, mst,
	xuanzhuo, eperezma, vgoyal, stefanha, miklos, peterz, akpm,
	paulmck, thuth, rostedt, bp, xiongwei.song, linux-doc,
	linux-kernel, linux-block, virtualization, linux-fsdevel

On Fri, Nov 1, 2024 at 11:12 AM mapicccy <guanjun@linux.alibaba.com> wrote:
>
>
>
> 2024年10月31日 18:50，Ming Lei <ming.lei@redhat.com> 写道：
>
> On Thu, Oct 31, 2024 at 6:35 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote:
>
> #ifdef CONFIG_SMP
>
> +static unsigned int __read_mostly managed_irqs_per_node;
> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
> +     [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
> +};
>
> +static void __group_prepare_affinity(struct cpumask *premask,
> +                                  cpumask_var_t *node_to_cpumask)
> +{
> +     nodemask_t nodemsk = NODE_MASK_NONE;
> +     unsigned int ncpus, n;
> +
> +     get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
> +
> +     for_each_node_mask(n, nodemsk) {
> +             cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
> +             cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
>
>
> How is this managed_irqs_cpumsk array protected against concurrency?
>
> +             ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
> +             if (ncpus < managed_irqs_per_node) {
> +                     /* Reset node n to current node cpumask */
> +                     cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]);
>
>
> This whole logic is incomprehensible and aside of the concurrency
> problem it's broken when CPUs are made present at run-time because these
> cpu masks are static and represent the stale state of the last
> invocation.
>
> Given the limitations of the x86 vector space, which is not going away
> anytime soon, there are only two options IMO to handle such a scenario.
>
>   1) Tell the nvme/block layer to disable queue affinity management
>
>
> +1
>
> There are other use cases, such as cpu isolation, which can benefit from
> this way too.
>
> https://lore.kernel.org/linux-nvme/20240702104112.4123810-1-ming.lei@redhat.com/
>

I wonder if we need to do the same for virtio-blk.

>
> Thanks for your reminder. However, in this link only modified the NVMe driver,
> but there is the same issue in the virtio net driver as well.

I guess you meant virtio-blk actually?

>
> Guanjun
>
>
> Thanks,
>

Thanks


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
  2024-10-31 10:35   ` Thomas Gleixner
  2024-10-31 10:50     ` Ming Lei
@ 2024-11-01  3:03     ` mapicccy
  2024-11-01 23:37       ` Thomas Gleixner
  1 sibling, 1 reply; 11+ messages in thread
From: mapicccy @ 2024-11-01  3:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
	stefanha, miklos, peterz, akpm, paulmck, thuth, rostedt, bp,
	xiongwei.song, linux-doc, linux-kernel, linux-block,
	virtualization, linux-fsdevel



> 2024年10月31日 18:35，Thomas Gleixner <tglx@linutronix.de> 写道：
> 
> On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote:
>> #ifdef CONFIG_SMP
>> 
>> +static unsigned int __read_mostly managed_irqs_per_node;
>> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
>> +	[0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
>> +};
>> 
>> +static void __group_prepare_affinity(struct cpumask *premask,
>> +				     cpumask_var_t *node_to_cpumask)
>> +{
>> +	nodemask_t nodemsk = NODE_MASK_NONE;
>> +	unsigned int ncpus, n;
>> +
>> +	get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
>> +
>> +	for_each_node_mask(n, nodemsk) {
>> +		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
>> +		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
> 
> How is this managed_irqs_cpumsk array protected against concurrency?

My intention was to allocate up to `managed_irq_per_node` cpu bits from `managed_irqs_cpumask[n]`,
even if another task modifies some of the bits in the `managed_irqs_cpumask[n]` at the same time.

> 
>> +		ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
>> +		if (ncpus < managed_irqs_per_node) {
>> +			/* Reset node n to current node cpumask */
>> +			cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]);
> 
> This whole logic is incomprehensible and aside of the concurrency
> problem it's broken when CPUs are made present at run-time because these
> cpu masks are static and represent the stale state of the last
> invocation.

Sorry, I realize there is indeed a logic issue here (caused by developing on 5.10 LTS and rebase to the latest linux-next).

> 
> Given the limitations of the x86 vector space, which is not going away
> anytime soon, there are only two options IMO to handle such a scenario.
> 
>   1) Tell the nvme/block layer to disable queue affinity management
> 
>   2) Restrict the devices and queues to the nodes they sit on

I have tried fixing this issue through nvme driver, but later discovered that the same issue exists with virtio net.
Therefore, I want to address this with a more general solution.

Thanks,
Guanjun

> 
> Thanks,
> 
>        tglx


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
  2024-11-01  3:03     ` mapicccy
@ 2024-11-01 23:37       ` Thomas Gleixner
  0 siblings, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2024-11-01 23:37 UTC (permalink / raw)
  To: mapicccy
  Cc: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
	stefanha, miklos, peterz, akpm, paulmck, thuth, rostedt, bp,
	xiongwei.song, linux-doc, linux-kernel, linux-block,
	virtualization, linux-fsdevel

On Fri, Nov 01 2024 at 11:03, mapicccy wrote:
>> 2024年10月31日 18:35，Thomas Gleixner <tglx@linutronix.de> 写道：
>>> +	get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
>>> +
>>> +	for_each_node_mask(n, nodemsk) {
>>> +		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
>>> +		cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
>> 
>> How is this managed_irqs_cpumsk array protected against concurrency?
>
> My intention was to allocate up to `managed_irq_per_node` cpu bits from `managed_irqs_cpumask[n]`,
> even if another task modifies some of the bits in the `managed_irqs_cpumask[n]` at the same time.

That may have been your intention, but how is this even remotely
correct?

Aside of that. If it's intentional and you think it's correct then you
should have documented that in the code and also annotated it to not
trigger santiziers.

>> Given the limitations of the x86 vector space, which is not going away
>> anytime soon, there are only two options IMO to handle such a scenario.
>> 
>>   1) Tell the nvme/block layer to disable queue affinity management
>> 
>>   2) Restrict the devices and queues to the nodes they sit on
>
> I have tried fixing this issue through nvme driver, but later
> discovered that the same issue exists with virtio net.  Therefore, I
> want to address this with a more general solution.

I understand, but a general solution for this problem won't exist
ever.

It's very reasonable to restrict this for one particular device type or
subsystem while maintaining the strict managed property for others, no?

General solutions are definitely preferred, but not for the price that
they break existing completely correct and working setups. Which is what
your 2/2 patch does for sure.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
  2024-10-31  7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
  2024-10-31 10:35   ` Thomas Gleixner
@ 2024-11-01  7:06   ` Jiri Slaby
  2024-11-02 16:30   ` kernel test robot
  2024-11-02 16:41   ` kernel test robot
  3 siblings, 0 replies; 11+ messages in thread
From: Jiri Slaby @ 2024-11-01  7:06 UTC (permalink / raw)
  To: 'Guanjun', corbet, axboe, mst, jasowang, xuanzhuo,
	eperezma, vgoyal, stefanha, miklos, tglx, peterz, akpm, paulmck,
	thuth, rostedt, bp, xiongwei.song, linux-doc, linux-kernel,
	linux-block, virtualization, linux-fsdevel

Hi,

On 31. 10. 24, 8:46, 'Guanjun' wrote:
> From: Guanjun <guanjun@linux.alibaba.com>
> 
> Commit c410abbbacb9 (genirq/affinity: Add is_managed to struct irq_affinity_desc)
> introduced is_managed bit to struct irq_affinity_desc. Due to queue interrupts
> treated as managed interrupts, in scenarios where a large number of
> devices are present (using massive msix queue interrupts), an excessive number
> of IRQ matrix bits (about num_online_cpus() * nvecs) are reserved during
> interrupt allocation. This sequently leads to the situation where interrupts
> for some devices cannot be properly allocated.
> 
> Support for limiting the number of managed interrupts on every node per allocation.
> 
> Signed-off-by: Guanjun <guanjun@linux.alibaba.com>
> ---
>   .../admin-guide/kernel-parameters.txt         |  9 +++
>   block/blk-mq-cpumap.c                         |  2 +-
>   drivers/virtio/virtio_vdpa.c                  |  2 +-
>   fs/fuse/virtio_fs.c                           |  2 +-
>   include/linux/group_cpus.h                    |  2 +-
>   kernel/irq/affinity.c                         | 11 ++--
>   lib/group_cpus.c                              | 55 ++++++++++++++++++-
>   7 files changed, 73 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 9b61097a6448..ac80f35d04c9 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3238,6 +3238,15 @@
>   			different yeeloong laptops.
>   			Example: machtype=lemote-yeeloong-2f-7inch
>   
> +	managed_irqs_per_node=
> +			[KNL,SMP] Support for limiting the number of managed
> +			interrupts on every node to prevent the case that
> +			interrupts cannot be properly allocated where a large
> +			number of devices are present. The default number is 0,
> +			that means no limit to the number of managed irqs.
> +			Format: integer between 0 and num_possible_cpus() / num_possible_nodes()
> +			Default: 0

Kernel parameters suck. Esp. here you have to guess to even properly 
boot. Could this be auto-tuned instead?

> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -11,6 +11,30 @@
>   
>   #ifdef CONFIG_SMP
>   
> +static unsigned int __read_mostly managed_irqs_per_node;
> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {

This is quite excessive. On SUSE configs, this is 8192 cpu bits * 1024 
nodes = 1 M. For everyone. You have to allocate this dynamically 
instead. See e.g. setup_node_to_cpumask_map().

> +	[0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
> +};
> +
> +static int __init irq_managed_setup(char *str)
> +{
> +	int ret;
> +
> +	ret = kstrtouint(str, 10, &managed_irqs_per_node);
> +	if (ret < 0) {
> +		pr_warn("managed_irqs_per_node= cannot parse, ignored\n");

could not be parsed

> +		return 0;
> +	}
> +
> +	if (managed_irqs_per_node * num_possible_nodes() > num_possible_cpus()) {
> +		managed_irqs_per_node = num_possible_cpus() / num_possible_nodes();
> +		pr_warn("managed_irqs_per_node= cannot be larger than %u\n",
> +			managed_irqs_per_node);
> +	}
> +	return 1;
> +}
> +__setup("managed_irqs_per_node=", irq_managed_setup);
> +
>   static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
>   				unsigned int cpus_per_grp)
>   {
...
> @@ -332,6 +380,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
>   /**
>    * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
>    * @numgrps: number of groups
> + * @is_managed: if these groups managed by kernel

are managed by the kernel

>    *
>    * Return: cpumask array if successful, NULL otherwise. And each element
>    * includes CPUs assigned to this group

thanks,
-- 
js
suse labs


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
  2024-10-31  7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
  2024-10-31 10:35   ` Thomas Gleixner
  2024-11-01  7:06   ` Jiri Slaby
@ 2024-11-02 16:30   ` kernel test robot
  2024-11-02 16:41   ` kernel test robot
  3 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2024-11-02 16:30 UTC (permalink / raw)
  To: 'Guanjun'; +Cc: oe-kbuild-all

Hi 'Guanjun',

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on tip/irq/core]
[also build test ERROR on axboe-block/for-next mszeredi-fuse/for-next tip/smp/core linus/master v6.12-rc5 next-20241101]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Guanjun/genirq-affinity-add-support-for-limiting-managed-interrupts/20241031-154824
base:   tip/irq/core
patch link:    https://lore.kernel.org/r/20241031074618.3585491-2-guanjun%40linux.alibaba.com
patch subject: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20241103/202411030019.Ho11gGvG-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 13.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241103/202411030019.Ho11gGvG-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411030019.Ho11gGvG-lkp@intel.com/

All errors (new ones prefixed by >>):

>> lib/group_cpus.c:480:17: error: conflicting types for 'group_cpus_evenly'; have 'struct cpumask *(unsigned int)'
     480 | struct cpumask *group_cpus_evenly(unsigned int numgrps)
         |                 ^~~~~~~~~~~~~~~~~
   In file included from lib/group_cpus.c:10:
   include/linux/group_cpus.h:12:17: note: previous declaration of 'group_cpus_evenly' with type 'struct cpumask *(unsigned int,  bool)' {aka 'struct cpumask *(unsigned int,  _Bool)'}
      12 | struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed);
         |                 ^~~~~~~~~~~~~~~~~
   In file included from include/linux/linkage.h:7,
                    from include/linux/kernel.h:18,
                    from lib/group_cpus.c:6:
   lib/group_cpus.c:492:19: error: conflicting types for 'group_cpus_evenly'; have 'struct cpumask *(unsigned int)'
     492 | EXPORT_SYMBOL_GPL(group_cpus_evenly);
         |                   ^~~~~~~~~~~~~~~~~
   include/linux/export.h:56:28: note: in definition of macro '__EXPORT_SYMBOL'
      56 |         extern typeof(sym) sym;                                 \
         |                            ^~~
   include/linux/export.h:69:41: note: in expansion of macro '_EXPORT_SYMBOL'
      69 | #define EXPORT_SYMBOL_GPL(sym)          _EXPORT_SYMBOL(sym, "GPL")
         |                                         ^~~~~~~~~~~~~~
   lib/group_cpus.c:492:1: note: in expansion of macro 'EXPORT_SYMBOL_GPL'
     492 | EXPORT_SYMBOL_GPL(group_cpus_evenly);
         | ^~~~~~~~~~~~~~~~~
   include/linux/group_cpus.h:12:17: note: previous declaration of 'group_cpus_evenly' with type 'struct cpumask *(unsigned int,  bool)' {aka 'struct cpumask *(unsigned int,  _Bool)'}
      12 | struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed);
         |                 ^~~~~~~~~~~~~~~~~


vim +480 lib/group_cpus.c

f7b3ea8cf72f3d Ming Lei    2022-12-27  379  
f7b3ea8cf72f3d Ming Lei    2022-12-27  380  /**
f7b3ea8cf72f3d Ming Lei    2022-12-27  381   * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
f7b3ea8cf72f3d Ming Lei    2022-12-27  382   * @numgrps: number of groups
59cfa36232b76f Guanjun     2024-10-31  383   * @is_managed: if these groups managed by kernel
f7b3ea8cf72f3d Ming Lei    2022-12-27  384   *
f7b3ea8cf72f3d Ming Lei    2022-12-27  385   * Return: cpumask array if successful, NULL otherwise. And each element
f7b3ea8cf72f3d Ming Lei    2022-12-27  386   * includes CPUs assigned to this group
f7b3ea8cf72f3d Ming Lei    2022-12-27  387   *
f7b3ea8cf72f3d Ming Lei    2022-12-27  388   * Try to put close CPUs from viewpoint of CPU and NUMA locality into
f7b3ea8cf72f3d Ming Lei    2022-12-27  389   * same group, and run two-stage grouping:
f7b3ea8cf72f3d Ming Lei    2022-12-27  390   *	1) allocate present CPUs on these groups evenly first
f7b3ea8cf72f3d Ming Lei    2022-12-27  391   *	2) allocate other possible CPUs on these groups evenly
f7b3ea8cf72f3d Ming Lei    2022-12-27  392   *
f7b3ea8cf72f3d Ming Lei    2022-12-27  393   * We guarantee in the resulted grouping that all CPUs are covered, and
f7b3ea8cf72f3d Ming Lei    2022-12-27  394   * no same CPU is assigned to multiple groups
f7b3ea8cf72f3d Ming Lei    2022-12-27  395   */
59cfa36232b76f Guanjun     2024-10-31  396  struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed)
f7b3ea8cf72f3d Ming Lei    2022-12-27  397  {
f7b3ea8cf72f3d Ming Lei    2022-12-27  398  	unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
f7b3ea8cf72f3d Ming Lei    2022-12-27  399  	cpumask_var_t *node_to_cpumask;
f7b3ea8cf72f3d Ming Lei    2022-12-27  400  	cpumask_var_t nmsk, npresmsk;
f7b3ea8cf72f3d Ming Lei    2022-12-27  401  	int ret = -ENOMEM;
f7b3ea8cf72f3d Ming Lei    2022-12-27  402  	struct cpumask *masks = NULL;
f7b3ea8cf72f3d Ming Lei    2022-12-27  403  
f7b3ea8cf72f3d Ming Lei    2022-12-27  404  	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
f7b3ea8cf72f3d Ming Lei    2022-12-27  405  		return NULL;
f7b3ea8cf72f3d Ming Lei    2022-12-27  406  
f7b3ea8cf72f3d Ming Lei    2022-12-27  407  	if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
f7b3ea8cf72f3d Ming Lei    2022-12-27  408  		goto fail_nmsk;
f7b3ea8cf72f3d Ming Lei    2022-12-27  409  
f7b3ea8cf72f3d Ming Lei    2022-12-27  410  	node_to_cpumask = alloc_node_to_cpumask();
f7b3ea8cf72f3d Ming Lei    2022-12-27  411  	if (!node_to_cpumask)
f7b3ea8cf72f3d Ming Lei    2022-12-27  412  		goto fail_npresmsk;
f7b3ea8cf72f3d Ming Lei    2022-12-27  413  
f7b3ea8cf72f3d Ming Lei    2022-12-27  414  	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
f7b3ea8cf72f3d Ming Lei    2022-12-27  415  	if (!masks)
f7b3ea8cf72f3d Ming Lei    2022-12-27  416  		goto fail_node_to_cpumask;
f7b3ea8cf72f3d Ming Lei    2022-12-27  417  
f7b3ea8cf72f3d Ming Lei    2022-12-27  418  	build_node_to_cpumask(node_to_cpumask);
f7b3ea8cf72f3d Ming Lei    2022-12-27  419  
0263f92fadbb9d Ming Lei    2023-11-20  420  	/*
0263f92fadbb9d Ming Lei    2023-11-20  421  	 * Make a local cache of 'cpu_present_mask', so the two stages
0263f92fadbb9d Ming Lei    2023-11-20  422  	 * spread can observe consistent 'cpu_present_mask' without holding
0263f92fadbb9d Ming Lei    2023-11-20  423  	 * cpu hotplug lock, then we can reduce deadlock risk with cpu
0263f92fadbb9d Ming Lei    2023-11-20  424  	 * hotplug code.
0263f92fadbb9d Ming Lei    2023-11-20  425  	 *
0263f92fadbb9d Ming Lei    2023-11-20  426  	 * Here CPU hotplug may happen when reading `cpu_present_mask`, and
0263f92fadbb9d Ming Lei    2023-11-20  427  	 * we can live with the case because it only affects that hotplug
0263f92fadbb9d Ming Lei    2023-11-20  428  	 * CPU is handled in the 1st or 2nd stage, and either way is correct
0263f92fadbb9d Ming Lei    2023-11-20  429  	 * from API user viewpoint since 2-stage spread is sort of
0263f92fadbb9d Ming Lei    2023-11-20  430  	 * optimization.
0263f92fadbb9d Ming Lei    2023-11-20  431  	 */
0263f92fadbb9d Ming Lei    2023-11-20  432  	cpumask_copy(npresmsk, data_race(cpu_present_mask));
0263f92fadbb9d Ming Lei    2023-11-20  433  
59cfa36232b76f Guanjun     2024-10-31  434  	/* Limit the count of managed interrupts on every node */
59cfa36232b76f Guanjun     2024-10-31  435  	if (is_managed && managed_irqs_per_node)
59cfa36232b76f Guanjun     2024-10-31  436  		__group_prepare_affinity(npresmsk, node_to_cpumask);
59cfa36232b76f Guanjun     2024-10-31  437  
f7b3ea8cf72f3d Ming Lei    2022-12-27  438  	/* grouping present CPUs first */
f7b3ea8cf72f3d Ming Lei    2022-12-27  439  	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
0263f92fadbb9d Ming Lei    2023-11-20  440  				  npresmsk, nmsk, masks);
f7b3ea8cf72f3d Ming Lei    2022-12-27  441  	if (ret < 0)
f7b3ea8cf72f3d Ming Lei    2022-12-27  442  		goto fail_build_affinity;
f7b3ea8cf72f3d Ming Lei    2022-12-27  443  	nr_present = ret;
f7b3ea8cf72f3d Ming Lei    2022-12-27  444  
f7b3ea8cf72f3d Ming Lei    2022-12-27  445  	/*
f7b3ea8cf72f3d Ming Lei    2022-12-27  446  	 * Allocate non present CPUs starting from the next group to be
f7b3ea8cf72f3d Ming Lei    2022-12-27  447  	 * handled. If the grouping of present CPUs already exhausted the
f7b3ea8cf72f3d Ming Lei    2022-12-27  448  	 * group space, assign the non present CPUs to the already
f7b3ea8cf72f3d Ming Lei    2022-12-27  449  	 * allocated out groups.
f7b3ea8cf72f3d Ming Lei    2022-12-27  450  	 */
f7b3ea8cf72f3d Ming Lei    2022-12-27  451  	if (nr_present >= numgrps)
f7b3ea8cf72f3d Ming Lei    2022-12-27  452  		curgrp = 0;
f7b3ea8cf72f3d Ming Lei    2022-12-27  453  	else
f7b3ea8cf72f3d Ming Lei    2022-12-27  454  		curgrp = nr_present;
0263f92fadbb9d Ming Lei    2023-11-20  455  	cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
f7b3ea8cf72f3d Ming Lei    2022-12-27  456  	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
f7b3ea8cf72f3d Ming Lei    2022-12-27  457  				  npresmsk, nmsk, masks);
f7b3ea8cf72f3d Ming Lei    2022-12-27  458  	if (ret >= 0)
f7b3ea8cf72f3d Ming Lei    2022-12-27  459  		nr_others = ret;
f7b3ea8cf72f3d Ming Lei    2022-12-27  460  
f7b3ea8cf72f3d Ming Lei    2022-12-27  461   fail_build_affinity:
f7b3ea8cf72f3d Ming Lei    2022-12-27  462  	if (ret >= 0)
f7b3ea8cf72f3d Ming Lei    2022-12-27  463  		WARN_ON(nr_present + nr_others < numgrps);
f7b3ea8cf72f3d Ming Lei    2022-12-27  464  
f7b3ea8cf72f3d Ming Lei    2022-12-27  465   fail_node_to_cpumask:
f7b3ea8cf72f3d Ming Lei    2022-12-27  466  	free_node_to_cpumask(node_to_cpumask);
f7b3ea8cf72f3d Ming Lei    2022-12-27  467  
f7b3ea8cf72f3d Ming Lei    2022-12-27  468   fail_npresmsk:
f7b3ea8cf72f3d Ming Lei    2022-12-27  469  	free_cpumask_var(npresmsk);
f7b3ea8cf72f3d Ming Lei    2022-12-27  470  
f7b3ea8cf72f3d Ming Lei    2022-12-27  471   fail_nmsk:
f7b3ea8cf72f3d Ming Lei    2022-12-27  472  	free_cpumask_var(nmsk);
f7b3ea8cf72f3d Ming Lei    2022-12-27  473  	if (ret < 0) {
f7b3ea8cf72f3d Ming Lei    2022-12-27  474  		kfree(masks);
f7b3ea8cf72f3d Ming Lei    2022-12-27  475  		return NULL;
f7b3ea8cf72f3d Ming Lei    2022-12-27  476  	}
f7b3ea8cf72f3d Ming Lei    2022-12-27  477  	return masks;
f7b3ea8cf72f3d Ming Lei    2022-12-27  478  }
188a569658584e Ingo Molnar 2023-01-18  479  #else /* CONFIG_SMP */
f7b3ea8cf72f3d Ming Lei    2022-12-27 @480  struct cpumask *group_cpus_evenly(unsigned int numgrps)

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
  2024-10-31  7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
                     ` (2 preceding siblings ...)
  2024-11-02 16:30   ` kernel test robot
@ 2024-11-02 16:41   ` kernel test robot
  3 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2024-11-02 16:41 UTC (permalink / raw)
  To: 'Guanjun'; +Cc: llvm, oe-kbuild-all

Hi 'Guanjun',

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on tip/irq/core]
[also build test ERROR on axboe-block/for-next mszeredi-fuse/for-next tip/smp/core linus/master v6.12-rc5 next-20241101]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Guanjun/genirq-affinity-add-support-for-limiting-managed-interrupts/20241031-154824
base:   tip/irq/core
patch link:    https://lore.kernel.org/r/20241031074618.3585491-2-guanjun%40linux.alibaba.com
patch subject: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
config: arm-allnoconfig (https://download.01.org/0day-ci/archive/20241103/202411030024.eIDPlX3p-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 639a7ac648f1e50ccd2556e17d401c04f9cce625)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241103/202411030024.eIDPlX3p-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411030024.eIDPlX3p-lkp@intel.com/

All errors (new ones prefixed by >>):

>> lib/group_cpus.c:480:17: error: conflicting types for 'group_cpus_evenly'
     480 | struct cpumask *group_cpus_evenly(unsigned int numgrps)
         |                 ^
   include/linux/group_cpus.h:12:17: note: previous declaration is here
      12 | struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed);
         |                 ^
   1 error generated.


vim +/group_cpus_evenly +480 lib/group_cpus.c

f7b3ea8cf72f3d Ming Lei    2022-12-27  379  
f7b3ea8cf72f3d Ming Lei    2022-12-27  380  /**
f7b3ea8cf72f3d Ming Lei    2022-12-27  381   * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
f7b3ea8cf72f3d Ming Lei    2022-12-27  382   * @numgrps: number of groups
59cfa36232b76f Guanjun     2024-10-31  383   * @is_managed: if these groups managed by kernel
f7b3ea8cf72f3d Ming Lei    2022-12-27  384   *
f7b3ea8cf72f3d Ming Lei    2022-12-27  385   * Return: cpumask array if successful, NULL otherwise. And each element
f7b3ea8cf72f3d Ming Lei    2022-12-27  386   * includes CPUs assigned to this group
f7b3ea8cf72f3d Ming Lei    2022-12-27  387   *
f7b3ea8cf72f3d Ming Lei    2022-12-27  388   * Try to put close CPUs from viewpoint of CPU and NUMA locality into
f7b3ea8cf72f3d Ming Lei    2022-12-27  389   * same group, and run two-stage grouping:
f7b3ea8cf72f3d Ming Lei    2022-12-27  390   *	1) allocate present CPUs on these groups evenly first
f7b3ea8cf72f3d Ming Lei    2022-12-27  391   *	2) allocate other possible CPUs on these groups evenly
f7b3ea8cf72f3d Ming Lei    2022-12-27  392   *
f7b3ea8cf72f3d Ming Lei    2022-12-27  393   * We guarantee in the resulted grouping that all CPUs are covered, and
f7b3ea8cf72f3d Ming Lei    2022-12-27  394   * no same CPU is assigned to multiple groups
f7b3ea8cf72f3d Ming Lei    2022-12-27  395   */
59cfa36232b76f Guanjun     2024-10-31  396  struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed)
f7b3ea8cf72f3d Ming Lei    2022-12-27  397  {
f7b3ea8cf72f3d Ming Lei    2022-12-27  398  	unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
f7b3ea8cf72f3d Ming Lei    2022-12-27  399  	cpumask_var_t *node_to_cpumask;
f7b3ea8cf72f3d Ming Lei    2022-12-27  400  	cpumask_var_t nmsk, npresmsk;
f7b3ea8cf72f3d Ming Lei    2022-12-27  401  	int ret = -ENOMEM;
f7b3ea8cf72f3d Ming Lei    2022-12-27  402  	struct cpumask *masks = NULL;
f7b3ea8cf72f3d Ming Lei    2022-12-27  403  
f7b3ea8cf72f3d Ming Lei    2022-12-27  404  	if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
f7b3ea8cf72f3d Ming Lei    2022-12-27  405  		return NULL;
f7b3ea8cf72f3d Ming Lei    2022-12-27  406  
f7b3ea8cf72f3d Ming Lei    2022-12-27  407  	if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
f7b3ea8cf72f3d Ming Lei    2022-12-27  408  		goto fail_nmsk;
f7b3ea8cf72f3d Ming Lei    2022-12-27  409  
f7b3ea8cf72f3d Ming Lei    2022-12-27  410  	node_to_cpumask = alloc_node_to_cpumask();
f7b3ea8cf72f3d Ming Lei    2022-12-27  411  	if (!node_to_cpumask)
f7b3ea8cf72f3d Ming Lei    2022-12-27  412  		goto fail_npresmsk;
f7b3ea8cf72f3d Ming Lei    2022-12-27  413  
f7b3ea8cf72f3d Ming Lei    2022-12-27  414  	masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
f7b3ea8cf72f3d Ming Lei    2022-12-27  415  	if (!masks)
f7b3ea8cf72f3d Ming Lei    2022-12-27  416  		goto fail_node_to_cpumask;
f7b3ea8cf72f3d Ming Lei    2022-12-27  417  
f7b3ea8cf72f3d Ming Lei    2022-12-27  418  	build_node_to_cpumask(node_to_cpumask);
f7b3ea8cf72f3d Ming Lei    2022-12-27  419  
0263f92fadbb9d Ming Lei    2023-11-20  420  	/*
0263f92fadbb9d Ming Lei    2023-11-20  421  	 * Make a local cache of 'cpu_present_mask', so the two stages
0263f92fadbb9d Ming Lei    2023-11-20  422  	 * spread can observe consistent 'cpu_present_mask' without holding
0263f92fadbb9d Ming Lei    2023-11-20  423  	 * cpu hotplug lock, then we can reduce deadlock risk with cpu
0263f92fadbb9d Ming Lei    2023-11-20  424  	 * hotplug code.
0263f92fadbb9d Ming Lei    2023-11-20  425  	 *
0263f92fadbb9d Ming Lei    2023-11-20  426  	 * Here CPU hotplug may happen when reading `cpu_present_mask`, and
0263f92fadbb9d Ming Lei    2023-11-20  427  	 * we can live with the case because it only affects that hotplug
0263f92fadbb9d Ming Lei    2023-11-20  428  	 * CPU is handled in the 1st or 2nd stage, and either way is correct
0263f92fadbb9d Ming Lei    2023-11-20  429  	 * from API user viewpoint since 2-stage spread is sort of
0263f92fadbb9d Ming Lei    2023-11-20  430  	 * optimization.
0263f92fadbb9d Ming Lei    2023-11-20  431  	 */
0263f92fadbb9d Ming Lei    2023-11-20  432  	cpumask_copy(npresmsk, data_race(cpu_present_mask));
0263f92fadbb9d Ming Lei    2023-11-20  433  
59cfa36232b76f Guanjun     2024-10-31  434  	/* Limit the count of managed interrupts on every node */
59cfa36232b76f Guanjun     2024-10-31  435  	if (is_managed && managed_irqs_per_node)
59cfa36232b76f Guanjun     2024-10-31  436  		__group_prepare_affinity(npresmsk, node_to_cpumask);
59cfa36232b76f Guanjun     2024-10-31  437  
f7b3ea8cf72f3d Ming Lei    2022-12-27  438  	/* grouping present CPUs first */
f7b3ea8cf72f3d Ming Lei    2022-12-27  439  	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
0263f92fadbb9d Ming Lei    2023-11-20  440  				  npresmsk, nmsk, masks);
f7b3ea8cf72f3d Ming Lei    2022-12-27  441  	if (ret < 0)
f7b3ea8cf72f3d Ming Lei    2022-12-27  442  		goto fail_build_affinity;
f7b3ea8cf72f3d Ming Lei    2022-12-27  443  	nr_present = ret;
f7b3ea8cf72f3d Ming Lei    2022-12-27  444  
f7b3ea8cf72f3d Ming Lei    2022-12-27  445  	/*
f7b3ea8cf72f3d Ming Lei    2022-12-27  446  	 * Allocate non present CPUs starting from the next group to be
f7b3ea8cf72f3d Ming Lei    2022-12-27  447  	 * handled. If the grouping of present CPUs already exhausted the
f7b3ea8cf72f3d Ming Lei    2022-12-27  448  	 * group space, assign the non present CPUs to the already
f7b3ea8cf72f3d Ming Lei    2022-12-27  449  	 * allocated out groups.
f7b3ea8cf72f3d Ming Lei    2022-12-27  450  	 */
f7b3ea8cf72f3d Ming Lei    2022-12-27  451  	if (nr_present >= numgrps)
f7b3ea8cf72f3d Ming Lei    2022-12-27  452  		curgrp = 0;
f7b3ea8cf72f3d Ming Lei    2022-12-27  453  	else
f7b3ea8cf72f3d Ming Lei    2022-12-27  454  		curgrp = nr_present;
0263f92fadbb9d Ming Lei    2023-11-20  455  	cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
f7b3ea8cf72f3d Ming Lei    2022-12-27  456  	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
f7b3ea8cf72f3d Ming Lei    2022-12-27  457  				  npresmsk, nmsk, masks);
f7b3ea8cf72f3d Ming Lei    2022-12-27  458  	if (ret >= 0)
f7b3ea8cf72f3d Ming Lei    2022-12-27  459  		nr_others = ret;
f7b3ea8cf72f3d Ming Lei    2022-12-27  460  
f7b3ea8cf72f3d Ming Lei    2022-12-27  461   fail_build_affinity:
f7b3ea8cf72f3d Ming Lei    2022-12-27  462  	if (ret >= 0)
f7b3ea8cf72f3d Ming Lei    2022-12-27  463  		WARN_ON(nr_present + nr_others < numgrps);
f7b3ea8cf72f3d Ming Lei    2022-12-27  464  
f7b3ea8cf72f3d Ming Lei    2022-12-27  465   fail_node_to_cpumask:
f7b3ea8cf72f3d Ming Lei    2022-12-27  466  	free_node_to_cpumask(node_to_cpumask);
f7b3ea8cf72f3d Ming Lei    2022-12-27  467  
f7b3ea8cf72f3d Ming Lei    2022-12-27  468   fail_npresmsk:
f7b3ea8cf72f3d Ming Lei    2022-12-27  469  	free_cpumask_var(npresmsk);
f7b3ea8cf72f3d Ming Lei    2022-12-27  470  
f7b3ea8cf72f3d Ming Lei    2022-12-27  471   fail_nmsk:
f7b3ea8cf72f3d Ming Lei    2022-12-27  472  	free_cpumask_var(nmsk);
f7b3ea8cf72f3d Ming Lei    2022-12-27  473  	if (ret < 0) {
f7b3ea8cf72f3d Ming Lei    2022-12-27  474  		kfree(masks);
f7b3ea8cf72f3d Ming Lei    2022-12-27  475  		return NULL;
f7b3ea8cf72f3d Ming Lei    2022-12-27  476  	}
f7b3ea8cf72f3d Ming Lei    2022-12-27  477  	return masks;
f7b3ea8cf72f3d Ming Lei    2022-12-27  478  }
188a569658584e Ingo Molnar 2023-01-18  479  #else /* CONFIG_SMP */
f7b3ea8cf72f3d Ming Lei    2022-12-27 @480  struct cpumask *group_cpus_evenly(unsigned int numgrps)

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity
  2024-10-31  7:46 [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation 'Guanjun'
  2024-10-31  7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
@ 2024-10-31  7:46 ` 'Guanjun'
  1 sibling, 0 replies; 11+ messages in thread
From: 'Guanjun' @ 2024-10-31  7:46 UTC (permalink / raw)
  To: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
	stefanha, miklos, tglx, peterz, akpm, paulmck, thuth, rostedt, bp,
	xiongwei.song, linux-doc, linux-kernel, linux-block,
	virtualization, linux-fsdevel
  Cc: guanjun

From: Guanjun <guanjun@linux.alibaba.com>

Once we limit the number of managed interrupts, if the last online CPU in
the affinity goes offline, it will result in the interrupt becoming unavailable
util one of the assigned CPUs comes online again. So prevent the last online
CPU in the affinity from going offline, and return -EBUSY in this situation.

Signed-off-by: Guanjun <guanjun@linux.alibaba.com>
---
 .../admin-guide/kernel-parameters.txt         |  3 ++
 include/linux/irq.h                           |  2 +
 kernel/cpu.c                                  |  2 +-
 kernel/irq/cpuhotplug.c                       | 51 +++++++++++++++++++
 4 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ac80f35d04c9..173598cbf4a6 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3244,6 +3244,9 @@
 			interrupts cannot be properly allocated where a large
 			number of devices are present. The default number is 0,
 			that means no limit to the number of managed irqs.
+			Once we limit the number of managed interrupts, the last
+			online CPU in the affinity goes offline will fail with
+			the error code -EBUSY.
 			Format: integer between 0 and num_possible_cpus() / num_possible_nodes()
 			Default: 0
 
diff --git a/include/linux/irq.h b/include/linux/irq.h
index fa711f80957b..68ce05a74079 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -615,8 +615,10 @@ extern int irq_set_vcpu_affinity(unsigned int irq, void *vcpu_info);
 #if defined(CONFIG_SMP) && defined(CONFIG_GENERIC_IRQ_MIGRATION)
 extern void irq_migrate_all_off_this_cpu(void);
 extern int irq_affinity_online_cpu(unsigned int cpu);
+extern int irq_affinity_offline_cpu(unsigned int cpu);
 #else
 # define irq_affinity_online_cpu	NULL
+# define irq_affinity_offline_cpu	NULL
 #endif
 
 #if defined(CONFIG_SMP) && defined(CONFIG_GENERIC_PENDING_IRQ)
diff --git a/kernel/cpu.c b/kernel/cpu.c
index c4aaf73dec9e..672d920970b2 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2219,7 +2219,7 @@ static struct cpuhp_step cpuhp_hp_states[] = {
 	[CPUHP_AP_IRQ_AFFINITY_ONLINE] = {
 		.name			= "irq/affinity:online",
 		.startup.single		= irq_affinity_online_cpu,
-		.teardown.single	= NULL,
+		.teardown.single	= irq_affinity_offline_cpu,
 	},
 	[CPUHP_AP_PERF_ONLINE] = {
 		.name			= "perf:online",
diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index 15a7654eff68..e6f068198e4a 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -232,6 +232,31 @@ static void irq_restore_affinity_of_irq(struct irq_desc *desc, unsigned int cpu)
 		irq_set_affinity_locked(data, affinity, false);
 }
 
+static int irq_check_affinity_of_irq(struct irq_desc *desc, unsigned int cpu)
+{
+	struct irq_data *data = irq_desc_get_irq_data(desc);
+	const struct cpumask *affinity = irq_data_get_affinity_mask(data);
+	unsigned int cur;
+
+	if (!irqd_affinity_is_managed(data) || !desc->action ||
+	    !irq_data_get_irq_chip(data) || !cpumask_test_cpu(cpu, affinity))
+		return 0;
+
+	for_each_cpu(cur, affinity)
+		if (cur != cpu && cpumask_test_cpu(cur, cpu_online_mask))
+			return 0;
+
+	/*
+	 * If the onging offline CPU is the last one in the affinity,
+	 * the managed interrupts will be unavailable until one of
+	 * the assigned CPUs comes online. To prevent this unavailability,
+	 * return -EBUSY directly in this case.
+	 */
+	pr_warn("Affinity %*pbl of managed IRQ%u contains only one CPU%u that online\n",
+		cpumask_pr_args(affinity), data->irq, cpu);
+	return -EBUSY;
+}
+
 /**
  * irq_affinity_online_cpu - Restore affinity for managed interrupts
  * @cpu:	Upcoming CPU for which interrupts should be restored
@@ -252,3 +277,29 @@ int irq_affinity_online_cpu(unsigned int cpu)
 
 	return 0;
 }
+
+/**
+ * irq_affinity_offline_cpu - Check affinity for managed interrupts
+ * to prevent the unavailability caused by taking the last CPU in the
+ * affinity offline.
+ * @cpu:	Upcoming CPU for which interrupts should be checked
+ */
+int irq_affinity_offline_cpu(unsigned int cpu)
+{
+	struct irq_desc *desc;
+	unsigned int irq;
+	int ret = 0;
+
+	irq_lock_sparse();
+	for_each_active_irq(irq) {
+		desc = irq_to_desc(irq);
+		raw_spin_lock_irq(&desc->lock);
+		ret = irq_check_affinity_of_irq(desc, cpu);
+		raw_spin_unlock_irq(&desc->lock);
+		if (ret < 0)
+			break;
+	}
+	irq_unlock_sparse();
+
+	return ret;
+}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-11-02 16:41 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-31  7:46 [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation 'Guanjun'
2024-10-31  7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
2024-10-31 10:35   ` Thomas Gleixner
2024-10-31 10:50     ` Ming Lei
     [not found]       ` <43FD1116-C188-4729-A3AB-C2A0F5A087D2@linux.alibaba.com>
2024-11-01  3:34         ` Jason Wang
2024-11-01  3:03     ` mapicccy
2024-11-01 23:37       ` Thomas Gleixner
2024-11-01  7:06   ` Jiri Slaby
2024-11-02 16:30   ` kernel test robot
2024-11-02 16:41   ` kernel test robot
2024-10-31  7:46 ` [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity 'Guanjun'

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.