* [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation.
@ 2024-10-31 7:46 'Guanjun'
2024-10-31 7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
2024-10-31 7:46 ` [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity 'Guanjun'
0 siblings, 2 replies; 11+ messages in thread
From: 'Guanjun' @ 2024-10-31 7:46 UTC (permalink / raw)
To: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
stefanha, miklos, tglx, peterz, akpm, paulmck, thuth, rostedt, bp,
xiongwei.song, linux-doc, linux-kernel, linux-block,
virtualization, linux-fsdevel
Cc: guanjun
From: Guanjun <guanjun@linux.alibaba.com>
We found that in scenarios with a large number of devices on the system,
for example, 256 NVMe block devices, each with 2 I/O queues, about a few
dozen of interrupts cannot be allocated, and get the error code -ENOSPC.
The reason for this issue is that the io queue interrupts are set to managed
interrupts (i.e., affinity is managed by the kernel), which leads to a
excessive number of the IRQ matrix bits being reserved.
This patch series support for limiting the number of managed interrupt
per allocation to address this issue.
Thanks,
Guanjun
Guanjun (2):
genirq/affinity: add support for limiting managed interrupts
genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out
in the affinity
.../admin-guide/kernel-parameters.txt | 12 ++++
block/blk-mq-cpumap.c | 2 +-
drivers/virtio/virtio_vdpa.c | 2 +-
fs/fuse/virtio_fs.c | 2 +-
include/linux/group_cpus.h | 2 +-
include/linux/irq.h | 2 +
kernel/cpu.c | 2 +-
kernel/irq/affinity.c | 11 ++--
kernel/irq/cpuhotplug.c | 51 +++++++++++++++++
lib/group_cpus.c | 55 ++++++++++++++++++-
10 files changed, 130 insertions(+), 11 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
2024-10-31 7:46 [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation 'Guanjun'
@ 2024-10-31 7:46 ` 'Guanjun'
2024-10-31 10:35 ` Thomas Gleixner
` (3 more replies)
2024-10-31 7:46 ` [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity 'Guanjun'
1 sibling, 4 replies; 11+ messages in thread
From: 'Guanjun' @ 2024-10-31 7:46 UTC (permalink / raw)
To: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
stefanha, miklos, tglx, peterz, akpm, paulmck, thuth, rostedt, bp,
xiongwei.song, linux-doc, linux-kernel, linux-block,
virtualization, linux-fsdevel
Cc: guanjun
From: Guanjun <guanjun@linux.alibaba.com>
Commit c410abbbacb9 (genirq/affinity: Add is_managed to struct irq_affinity_desc)
introduced is_managed bit to struct irq_affinity_desc. Due to queue interrupts
treated as managed interrupts, in scenarios where a large number of
devices are present (using massive msix queue interrupts), an excessive number
of IRQ matrix bits (about num_online_cpus() * nvecs) are reserved during
interrupt allocation. This sequently leads to the situation where interrupts
for some devices cannot be properly allocated.
Support for limiting the number of managed interrupts on every node per allocation.
Signed-off-by: Guanjun <guanjun@linux.alibaba.com>
---
.../admin-guide/kernel-parameters.txt | 9 +++
block/blk-mq-cpumap.c | 2 +-
drivers/virtio/virtio_vdpa.c | 2 +-
fs/fuse/virtio_fs.c | 2 +-
include/linux/group_cpus.h | 2 +-
kernel/irq/affinity.c | 11 ++--
lib/group_cpus.c | 55 ++++++++++++++++++-
7 files changed, 73 insertions(+), 10 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9b61097a6448..ac80f35d04c9 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3238,6 +3238,15 @@
different yeeloong laptops.
Example: machtype=lemote-yeeloong-2f-7inch
+ managed_irqs_per_node=
+ [KNL,SMP] Support for limiting the number of managed
+ interrupts on every node to prevent the case that
+ interrupts cannot be properly allocated where a large
+ number of devices are present. The default number is 0,
+ that means no limit to the number of managed irqs.
+ Format: integer between 0 and num_possible_cpus() / num_possible_nodes()
+ Default: 0
+
maxcpus= [SMP,EARLY] Maximum number of processors that an SMP kernel
will bring up during bootup. maxcpus=n : n >= 0 limits
the kernel to bring up 'n' processors. Surely after
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index 9638b25fd521..481c81318e00 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -20,7 +20,7 @@ void blk_mq_map_queues(struct blk_mq_queue_map *qmap)
const struct cpumask *masks;
unsigned int queue, cpu;
- masks = group_cpus_evenly(qmap->nr_queues);
+ masks = group_cpus_evenly(qmap->nr_queues, true);
if (!masks) {
for_each_possible_cpu(cpu)
qmap->mq_map[cpu] = qmap->queue_offset;
diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c
index 7364bd53e38d..cd303ac64046 100644
--- a/drivers/virtio/virtio_vdpa.c
+++ b/drivers/virtio/virtio_vdpa.c
@@ -330,7 +330,7 @@ create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
unsigned int this_vecs = affd->set_size[i];
int j;
- struct cpumask *result = group_cpus_evenly(this_vecs);
+ struct cpumask *result = group_cpus_evenly(this_vecs, true);
if (!result) {
kfree(masks);
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index f68527891929..41b3bcc03f9c 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -881,7 +881,7 @@ static void virtio_fs_map_queues(struct virtio_device *vdev, struct virtio_fs *f
return;
fallback:
/* Attempt to map evenly in groups over the CPUs */
- masks = group_cpus_evenly(fs->num_request_queues);
+ masks = group_cpus_evenly(fs->num_request_queues, true);
/* If even this fails we default to all CPUs use queue zero */
if (!masks) {
for_each_possible_cpu(cpu)
diff --git a/include/linux/group_cpus.h b/include/linux/group_cpus.h
index e42807ec61f6..10a12b9a7ed4 100644
--- a/include/linux/group_cpus.h
+++ b/include/linux/group_cpus.h
@@ -9,6 +9,6 @@
#include <linux/kernel.h>
#include <linux/cpu.h>
-struct cpumask *group_cpus_evenly(unsigned int numgrps);
+struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed);
#endif
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 44a4eba80315..775ab8537ddc 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -64,6 +64,10 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
for (curvec = 0; curvec < affd->pre_vectors; curvec++)
cpumask_copy(&masks[curvec].mask, irq_default_affinity);
+ /* Mark the managed interrupts */
+ for (i = curvec; i < nvecs - affd->post_vectors; i++)
+ masks[i].is_managed = 1;
+
/*
* Spread on present CPUs starting from affd->pre_vectors. If we
* have multiple sets, build each sets affinity mask separately.
@@ -71,7 +75,8 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) {
unsigned int this_vecs = affd->set_size[i];
int j;
- struct cpumask *result = group_cpus_evenly(this_vecs);
+ struct cpumask *result = group_cpus_evenly(this_vecs,
+ masks[curvec].is_managed);
if (!result) {
kfree(masks);
@@ -94,10 +99,6 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd)
for (; curvec < nvecs; curvec++)
cpumask_copy(&masks[curvec].mask, irq_default_affinity);
- /* Mark the managed interrupts */
- for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++)
- masks[i].is_managed = 1;
-
return masks;
}
diff --git a/lib/group_cpus.c b/lib/group_cpus.c
index ee272c4cefcc..769a139491bc 100644
--- a/lib/group_cpus.c
+++ b/lib/group_cpus.c
@@ -11,6 +11,30 @@
#ifdef CONFIG_SMP
+static unsigned int __read_mostly managed_irqs_per_node;
+static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
+ [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
+};
+
+static int __init irq_managed_setup(char *str)
+{
+ int ret;
+
+ ret = kstrtouint(str, 10, &managed_irqs_per_node);
+ if (ret < 0) {
+ pr_warn("managed_irqs_per_node= cannot parse, ignored\n");
+ return 0;
+ }
+
+ if (managed_irqs_per_node * num_possible_nodes() > num_possible_cpus()) {
+ managed_irqs_per_node = num_possible_cpus() / num_possible_nodes();
+ pr_warn("managed_irqs_per_node= cannot be larger than %u\n",
+ managed_irqs_per_node);
+ }
+ return 1;
+}
+__setup("managed_irqs_per_node=", irq_managed_setup);
+
static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
unsigned int cpus_per_grp)
{
@@ -246,6 +270,30 @@ static void alloc_nodes_groups(unsigned int numgrps,
}
}
+static void __group_prepare_affinity(struct cpumask *premask,
+ cpumask_var_t *node_to_cpumask)
+{
+ nodemask_t nodemsk = NODE_MASK_NONE;
+ unsigned int ncpus, n;
+
+ get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
+
+ for_each_node_mask(n, nodemsk) {
+ cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
+ cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
+
+ ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
+ if (ncpus < managed_irqs_per_node) {
+ /* Reset node n to current node cpumask */
+ cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]);
+ continue;
+ }
+
+ grp_spread_init_one(premask, &managed_irqs_cpumsk[n], managed_irqs_per_node);
+ }
+}
+
+
static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
cpumask_var_t *node_to_cpumask,
const struct cpumask *cpu_mask,
@@ -332,6 +380,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
/**
* group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
* @numgrps: number of groups
+ * @is_managed: if these groups managed by kernel
*
* Return: cpumask array if successful, NULL otherwise. And each element
* includes CPUs assigned to this group
@@ -344,7 +393,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
* We guarantee in the resulted grouping that all CPUs are covered, and
* no same CPU is assigned to multiple groups
*/
-struct cpumask *group_cpus_evenly(unsigned int numgrps)
+struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed)
{
unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
cpumask_var_t *node_to_cpumask;
@@ -382,6 +431,10 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
*/
cpumask_copy(npresmsk, data_race(cpu_present_mask));
+ /* Limit the count of managed interrupts on every node */
+ if (is_managed && managed_irqs_per_node)
+ __group_prepare_affinity(npresmsk, node_to_cpumask);
+
/* grouping present CPUs first */
ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
npresmsk, nmsk, masks);
--
2.43.5
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity
2024-10-31 7:46 [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation 'Guanjun'
2024-10-31 7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
@ 2024-10-31 7:46 ` 'Guanjun'
1 sibling, 0 replies; 11+ messages in thread
From: 'Guanjun' @ 2024-10-31 7:46 UTC (permalink / raw)
To: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
stefanha, miklos, tglx, peterz, akpm, paulmck, thuth, rostedt, bp,
xiongwei.song, linux-doc, linux-kernel, linux-block,
virtualization, linux-fsdevel
Cc: guanjun
From: Guanjun <guanjun@linux.alibaba.com>
Once we limit the number of managed interrupts, if the last online CPU in
the affinity goes offline, it will result in the interrupt becoming unavailable
util one of the assigned CPUs comes online again. So prevent the last online
CPU in the affinity from going offline, and return -EBUSY in this situation.
Signed-off-by: Guanjun <guanjun@linux.alibaba.com>
---
.../admin-guide/kernel-parameters.txt | 3 ++
include/linux/irq.h | 2 +
kernel/cpu.c | 2 +-
kernel/irq/cpuhotplug.c | 51 +++++++++++++++++++
4 files changed, 57 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ac80f35d04c9..173598cbf4a6 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3244,6 +3244,9 @@
interrupts cannot be properly allocated where a large
number of devices are present. The default number is 0,
that means no limit to the number of managed irqs.
+ Once we limit the number of managed interrupts, the last
+ online CPU in the affinity goes offline will fail with
+ the error code -EBUSY.
Format: integer between 0 and num_possible_cpus() / num_possible_nodes()
Default: 0
diff --git a/include/linux/irq.h b/include/linux/irq.h
index fa711f80957b..68ce05a74079 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -615,8 +615,10 @@ extern int irq_set_vcpu_affinity(unsigned int irq, void *vcpu_info);
#if defined(CONFIG_SMP) && defined(CONFIG_GENERIC_IRQ_MIGRATION)
extern void irq_migrate_all_off_this_cpu(void);
extern int irq_affinity_online_cpu(unsigned int cpu);
+extern int irq_affinity_offline_cpu(unsigned int cpu);
#else
# define irq_affinity_online_cpu NULL
+# define irq_affinity_offline_cpu NULL
#endif
#if defined(CONFIG_SMP) && defined(CONFIG_GENERIC_PENDING_IRQ)
diff --git a/kernel/cpu.c b/kernel/cpu.c
index c4aaf73dec9e..672d920970b2 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2219,7 +2219,7 @@ static struct cpuhp_step cpuhp_hp_states[] = {
[CPUHP_AP_IRQ_AFFINITY_ONLINE] = {
.name = "irq/affinity:online",
.startup.single = irq_affinity_online_cpu,
- .teardown.single = NULL,
+ .teardown.single = irq_affinity_offline_cpu,
},
[CPUHP_AP_PERF_ONLINE] = {
.name = "perf:online",
diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index 15a7654eff68..e6f068198e4a 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -232,6 +232,31 @@ static void irq_restore_affinity_of_irq(struct irq_desc *desc, unsigned int cpu)
irq_set_affinity_locked(data, affinity, false);
}
+static int irq_check_affinity_of_irq(struct irq_desc *desc, unsigned int cpu)
+{
+ struct irq_data *data = irq_desc_get_irq_data(desc);
+ const struct cpumask *affinity = irq_data_get_affinity_mask(data);
+ unsigned int cur;
+
+ if (!irqd_affinity_is_managed(data) || !desc->action ||
+ !irq_data_get_irq_chip(data) || !cpumask_test_cpu(cpu, affinity))
+ return 0;
+
+ for_each_cpu(cur, affinity)
+ if (cur != cpu && cpumask_test_cpu(cur, cpu_online_mask))
+ return 0;
+
+ /*
+ * If the onging offline CPU is the last one in the affinity,
+ * the managed interrupts will be unavailable until one of
+ * the assigned CPUs comes online. To prevent this unavailability,
+ * return -EBUSY directly in this case.
+ */
+ pr_warn("Affinity %*pbl of managed IRQ%u contains only one CPU%u that online\n",
+ cpumask_pr_args(affinity), data->irq, cpu);
+ return -EBUSY;
+}
+
/**
* irq_affinity_online_cpu - Restore affinity for managed interrupts
* @cpu: Upcoming CPU for which interrupts should be restored
@@ -252,3 +277,29 @@ int irq_affinity_online_cpu(unsigned int cpu)
return 0;
}
+
+/**
+ * irq_affinity_offline_cpu - Check affinity for managed interrupts
+ * to prevent the unavailability caused by taking the last CPU in the
+ * affinity offline.
+ * @cpu: Upcoming CPU for which interrupts should be checked
+ */
+int irq_affinity_offline_cpu(unsigned int cpu)
+{
+ struct irq_desc *desc;
+ unsigned int irq;
+ int ret = 0;
+
+ irq_lock_sparse();
+ for_each_active_irq(irq) {
+ desc = irq_to_desc(irq);
+ raw_spin_lock_irq(&desc->lock);
+ ret = irq_check_affinity_of_irq(desc, cpu);
+ raw_spin_unlock_irq(&desc->lock);
+ if (ret < 0)
+ break;
+ }
+ irq_unlock_sparse();
+
+ return ret;
+}
--
2.43.5
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
2024-10-31 7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
@ 2024-10-31 10:35 ` Thomas Gleixner
2024-10-31 10:50 ` Ming Lei
2024-11-01 3:03 ` mapicccy
2024-11-01 7:06 ` Jiri Slaby
` (2 subsequent siblings)
3 siblings, 2 replies; 11+ messages in thread
From: Thomas Gleixner @ 2024-10-31 10:35 UTC (permalink / raw)
To: 'Guanjun', corbet, axboe, mst, jasowang, xuanzhuo,
eperezma, vgoyal, stefanha, miklos, peterz, akpm, paulmck, thuth,
rostedt, bp, xiongwei.song, linux-doc, linux-kernel, linux-block,
virtualization, linux-fsdevel
Cc: guanjun
On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote:
> #ifdef CONFIG_SMP
>
> +static unsigned int __read_mostly managed_irqs_per_node;
> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
> + [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
> +};
>
> +static void __group_prepare_affinity(struct cpumask *premask,
> + cpumask_var_t *node_to_cpumask)
> +{
> + nodemask_t nodemsk = NODE_MASK_NONE;
> + unsigned int ncpus, n;
> +
> + get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
> +
> + for_each_node_mask(n, nodemsk) {
> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
How is this managed_irqs_cpumsk array protected against concurrency?
> + ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
> + if (ncpus < managed_irqs_per_node) {
> + /* Reset node n to current node cpumask */
> + cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]);
This whole logic is incomprehensible and aside of the concurrency
problem it's broken when CPUs are made present at run-time because these
cpu masks are static and represent the stale state of the last
invocation.
Given the limitations of the x86 vector space, which is not going away
anytime soon, there are only two options IMO to handle such a scenario.
1) Tell the nvme/block layer to disable queue affinity management
2) Restrict the devices and queues to the nodes they sit on
Thanks,
tglx
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
2024-10-31 10:35 ` Thomas Gleixner
@ 2024-10-31 10:50 ` Ming Lei
[not found] ` <43FD1116-C188-4729-A3AB-C2A0F5A087D2@linux.alibaba.com>
2024-11-01 3:03 ` mapicccy
1 sibling, 1 reply; 11+ messages in thread
From: Ming Lei @ 2024-10-31 10:50 UTC (permalink / raw)
To: Thomas Gleixner, Christoph Hellwig
Cc: Guanjun, corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
stefanha, miklos, peterz, akpm, paulmck, thuth, rostedt, bp,
xiongwei.song, linux-doc, linux-kernel, linux-block,
virtualization, linux-fsdevel
On Thu, Oct 31, 2024 at 6:35 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote:
> > #ifdef CONFIG_SMP
> >
> > +static unsigned int __read_mostly managed_irqs_per_node;
> > +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
> > + [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
> > +};
> >
> > +static void __group_prepare_affinity(struct cpumask *premask,
> > + cpumask_var_t *node_to_cpumask)
> > +{
> > + nodemask_t nodemsk = NODE_MASK_NONE;
> > + unsigned int ncpus, n;
> > +
> > + get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
> > +
> > + for_each_node_mask(n, nodemsk) {
> > + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
> > + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
>
> How is this managed_irqs_cpumsk array protected against concurrency?
>
> > + ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
> > + if (ncpus < managed_irqs_per_node) {
> > + /* Reset node n to current node cpumask */
> > + cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]);
>
> This whole logic is incomprehensible and aside of the concurrency
> problem it's broken when CPUs are made present at run-time because these
> cpu masks are static and represent the stale state of the last
> invocation.
>
> Given the limitations of the x86 vector space, which is not going away
> anytime soon, there are only two options IMO to handle such a scenario.
>
> 1) Tell the nvme/block layer to disable queue affinity management
+1
There are other use cases, such as cpu isolation, which can benefit from
this way too.
https://lore.kernel.org/linux-nvme/20240702104112.4123810-1-ming.lei@redhat.com/
Thanks,
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
2024-10-31 10:35 ` Thomas Gleixner
2024-10-31 10:50 ` Ming Lei
@ 2024-11-01 3:03 ` mapicccy
2024-11-01 23:37 ` Thomas Gleixner
1 sibling, 1 reply; 11+ messages in thread
From: mapicccy @ 2024-11-01 3:03 UTC (permalink / raw)
To: Thomas Gleixner
Cc: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
stefanha, miklos, peterz, akpm, paulmck, thuth, rostedt, bp,
xiongwei.song, linux-doc, linux-kernel, linux-block,
virtualization, linux-fsdevel
> 2024年10月31日 18:35,Thomas Gleixner <tglx@linutronix.de> 写道:
>
> On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote:
>> #ifdef CONFIG_SMP
>>
>> +static unsigned int __read_mostly managed_irqs_per_node;
>> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
>> + [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
>> +};
>>
>> +static void __group_prepare_affinity(struct cpumask *premask,
>> + cpumask_var_t *node_to_cpumask)
>> +{
>> + nodemask_t nodemsk = NODE_MASK_NONE;
>> + unsigned int ncpus, n;
>> +
>> + get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
>> +
>> + for_each_node_mask(n, nodemsk) {
>> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
>> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
>
> How is this managed_irqs_cpumsk array protected against concurrency?
My intention was to allocate up to `managed_irq_per_node` cpu bits from `managed_irqs_cpumask[n]`,
even if another task modifies some of the bits in the `managed_irqs_cpumask[n]` at the same time.
>
>> + ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
>> + if (ncpus < managed_irqs_per_node) {
>> + /* Reset node n to current node cpumask */
>> + cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]);
>
> This whole logic is incomprehensible and aside of the concurrency
> problem it's broken when CPUs are made present at run-time because these
> cpu masks are static and represent the stale state of the last
> invocation.
Sorry, I realize there is indeed a logic issue here (caused by developing on 5.10 LTS and rebase to the latest linux-next).
>
> Given the limitations of the x86 vector space, which is not going away
> anytime soon, there are only two options IMO to handle such a scenario.
>
> 1) Tell the nvme/block layer to disable queue affinity management
>
> 2) Restrict the devices and queues to the nodes they sit on
I have tried fixing this issue through nvme driver, but later discovered that the same issue exists with virtio net.
Therefore, I want to address this with a more general solution.
Thanks,
Guanjun
>
> Thanks,
>
> tglx
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
[not found] ` <43FD1116-C188-4729-A3AB-C2A0F5A087D2@linux.alibaba.com>
@ 2024-11-01 3:34 ` Jason Wang
0 siblings, 0 replies; 11+ messages in thread
From: Jason Wang @ 2024-11-01 3:34 UTC (permalink / raw)
To: mapicccy
Cc: Ming Lei, Thomas Gleixner, Christoph Hellwig, corbet, axboe, mst,
xuanzhuo, eperezma, vgoyal, stefanha, miklos, peterz, akpm,
paulmck, thuth, rostedt, bp, xiongwei.song, linux-doc,
linux-kernel, linux-block, virtualization, linux-fsdevel
On Fri, Nov 1, 2024 at 11:12 AM mapicccy <guanjun@linux.alibaba.com> wrote:
>
>
>
> 2024年10月31日 18:50,Ming Lei <ming.lei@redhat.com> 写道:
>
> On Thu, Oct 31, 2024 at 6:35 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote:
>
> #ifdef CONFIG_SMP
>
> +static unsigned int __read_mostly managed_irqs_per_node;
> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
> + [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
> +};
>
> +static void __group_prepare_affinity(struct cpumask *premask,
> + cpumask_var_t *node_to_cpumask)
> +{
> + nodemask_t nodemsk = NODE_MASK_NONE;
> + unsigned int ncpus, n;
> +
> + get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
> +
> + for_each_node_mask(n, nodemsk) {
> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
>
>
> How is this managed_irqs_cpumsk array protected against concurrency?
>
> + ncpus = cpumask_weight(&managed_irqs_cpumsk[n]);
> + if (ncpus < managed_irqs_per_node) {
> + /* Reset node n to current node cpumask */
> + cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]);
>
>
> This whole logic is incomprehensible and aside of the concurrency
> problem it's broken when CPUs are made present at run-time because these
> cpu masks are static and represent the stale state of the last
> invocation.
>
> Given the limitations of the x86 vector space, which is not going away
> anytime soon, there are only two options IMO to handle such a scenario.
>
> 1) Tell the nvme/block layer to disable queue affinity management
>
>
> +1
>
> There are other use cases, such as cpu isolation, which can benefit from
> this way too.
>
> https://lore.kernel.org/linux-nvme/20240702104112.4123810-1-ming.lei@redhat.com/
>
I wonder if we need to do the same for virtio-blk.
>
> Thanks for your reminder. However, in this link only modified the NVMe driver,
> but there is the same issue in the virtio net driver as well.
I guess you meant virtio-blk actually?
>
> Guanjun
>
>
> Thanks,
>
Thanks
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
2024-10-31 7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
2024-10-31 10:35 ` Thomas Gleixner
@ 2024-11-01 7:06 ` Jiri Slaby
2024-11-02 16:30 ` kernel test robot
2024-11-02 16:41 ` kernel test robot
3 siblings, 0 replies; 11+ messages in thread
From: Jiri Slaby @ 2024-11-01 7:06 UTC (permalink / raw)
To: 'Guanjun', corbet, axboe, mst, jasowang, xuanzhuo,
eperezma, vgoyal, stefanha, miklos, tglx, peterz, akpm, paulmck,
thuth, rostedt, bp, xiongwei.song, linux-doc, linux-kernel,
linux-block, virtualization, linux-fsdevel
Hi,
On 31. 10. 24, 8:46, 'Guanjun' wrote:
> From: Guanjun <guanjun@linux.alibaba.com>
>
> Commit c410abbbacb9 (genirq/affinity: Add is_managed to struct irq_affinity_desc)
> introduced is_managed bit to struct irq_affinity_desc. Due to queue interrupts
> treated as managed interrupts, in scenarios where a large number of
> devices are present (using massive msix queue interrupts), an excessive number
> of IRQ matrix bits (about num_online_cpus() * nvecs) are reserved during
> interrupt allocation. This sequently leads to the situation where interrupts
> for some devices cannot be properly allocated.
>
> Support for limiting the number of managed interrupts on every node per allocation.
>
> Signed-off-by: Guanjun <guanjun@linux.alibaba.com>
> ---
> .../admin-guide/kernel-parameters.txt | 9 +++
> block/blk-mq-cpumap.c | 2 +-
> drivers/virtio/virtio_vdpa.c | 2 +-
> fs/fuse/virtio_fs.c | 2 +-
> include/linux/group_cpus.h | 2 +-
> kernel/irq/affinity.c | 11 ++--
> lib/group_cpus.c | 55 ++++++++++++++++++-
> 7 files changed, 73 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 9b61097a6448..ac80f35d04c9 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3238,6 +3238,15 @@
> different yeeloong laptops.
> Example: machtype=lemote-yeeloong-2f-7inch
>
> + managed_irqs_per_node=
> + [KNL,SMP] Support for limiting the number of managed
> + interrupts on every node to prevent the case that
> + interrupts cannot be properly allocated where a large
> + number of devices are present. The default number is 0,
> + that means no limit to the number of managed irqs.
> + Format: integer between 0 and num_possible_cpus() / num_possible_nodes()
> + Default: 0
Kernel parameters suck. Esp. here you have to guess to even properly
boot. Could this be auto-tuned instead?
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -11,6 +11,30 @@
>
> #ifdef CONFIG_SMP
>
> +static unsigned int __read_mostly managed_irqs_per_node;
> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = {
This is quite excessive. On SUSE configs, this is 8192 cpu bits * 1024
nodes = 1 M. For everyone. You have to allocate this dynamically
instead. See e.g. setup_node_to_cpumask_map().
> + [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL}
> +};
> +
> +static int __init irq_managed_setup(char *str)
> +{
> + int ret;
> +
> + ret = kstrtouint(str, 10, &managed_irqs_per_node);
> + if (ret < 0) {
> + pr_warn("managed_irqs_per_node= cannot parse, ignored\n");
could not be parsed
> + return 0;
> + }
> +
> + if (managed_irqs_per_node * num_possible_nodes() > num_possible_cpus()) {
> + managed_irqs_per_node = num_possible_cpus() / num_possible_nodes();
> + pr_warn("managed_irqs_per_node= cannot be larger than %u\n",
> + managed_irqs_per_node);
> + }
> + return 1;
> +}
> +__setup("managed_irqs_per_node=", irq_managed_setup);
> +
> static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
> unsigned int cpus_per_grp)
> {
...
> @@ -332,6 +380,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps,
> /**
> * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
> * @numgrps: number of groups
> + * @is_managed: if these groups managed by kernel
are managed by the kernel
> *
> * Return: cpumask array if successful, NULL otherwise. And each element
> * includes CPUs assigned to this group
thanks,
--
js
suse labs
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
2024-11-01 3:03 ` mapicccy
@ 2024-11-01 23:37 ` Thomas Gleixner
0 siblings, 0 replies; 11+ messages in thread
From: Thomas Gleixner @ 2024-11-01 23:37 UTC (permalink / raw)
To: mapicccy
Cc: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
stefanha, miklos, peterz, akpm, paulmck, thuth, rostedt, bp,
xiongwei.song, linux-doc, linux-kernel, linux-block,
virtualization, linux-fsdevel
On Fri, Nov 01 2024 at 11:03, mapicccy wrote:
>> 2024年10月31日 18:35,Thomas Gleixner <tglx@linutronix.de> 写道:
>>> + get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk);
>>> +
>>> + for_each_node_mask(n, nodemsk) {
>>> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask);
>>> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]);
>>
>> How is this managed_irqs_cpumsk array protected against concurrency?
>
> My intention was to allocate up to `managed_irq_per_node` cpu bits from `managed_irqs_cpumask[n]`,
> even if another task modifies some of the bits in the `managed_irqs_cpumask[n]` at the same time.
That may have been your intention, but how is this even remotely
correct?
Aside of that. If it's intentional and you think it's correct then you
should have documented that in the code and also annotated it to not
trigger santiziers.
>> Given the limitations of the x86 vector space, which is not going away
>> anytime soon, there are only two options IMO to handle such a scenario.
>>
>> 1) Tell the nvme/block layer to disable queue affinity management
>>
>> 2) Restrict the devices and queues to the nodes they sit on
>
> I have tried fixing this issue through nvme driver, but later
> discovered that the same issue exists with virtio net. Therefore, I
> want to address this with a more general solution.
I understand, but a general solution for this problem won't exist
ever.
It's very reasonable to restrict this for one particular device type or
subsystem while maintaining the strict managed property for others, no?
General solutions are definitely preferred, but not for the price that
they break existing completely correct and working setups. Which is what
your 2/2 patch does for sure.
Thanks,
tglx
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
2024-10-31 7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
2024-10-31 10:35 ` Thomas Gleixner
2024-11-01 7:06 ` Jiri Slaby
@ 2024-11-02 16:30 ` kernel test robot
2024-11-02 16:41 ` kernel test robot
3 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2024-11-02 16:30 UTC (permalink / raw)
To: 'Guanjun'; +Cc: oe-kbuild-all
Hi 'Guanjun',
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/irq/core]
[also build test ERROR on axboe-block/for-next mszeredi-fuse/for-next tip/smp/core linus/master v6.12-rc5 next-20241101]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Guanjun/genirq-affinity-add-support-for-limiting-managed-interrupts/20241031-154824
base: tip/irq/core
patch link: https://lore.kernel.org/r/20241031074618.3585491-2-guanjun%40linux.alibaba.com
patch subject: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20241103/202411030019.Ho11gGvG-lkp@intel.com/config)
compiler: alpha-linux-gcc (GCC) 13.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241103/202411030019.Ho11gGvG-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411030019.Ho11gGvG-lkp@intel.com/
All errors (new ones prefixed by >>):
>> lib/group_cpus.c:480:17: error: conflicting types for 'group_cpus_evenly'; have 'struct cpumask *(unsigned int)'
480 | struct cpumask *group_cpus_evenly(unsigned int numgrps)
| ^~~~~~~~~~~~~~~~~
In file included from lib/group_cpus.c:10:
include/linux/group_cpus.h:12:17: note: previous declaration of 'group_cpus_evenly' with type 'struct cpumask *(unsigned int, bool)' {aka 'struct cpumask *(unsigned int, _Bool)'}
12 | struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed);
| ^~~~~~~~~~~~~~~~~
In file included from include/linux/linkage.h:7,
from include/linux/kernel.h:18,
from lib/group_cpus.c:6:
lib/group_cpus.c:492:19: error: conflicting types for 'group_cpus_evenly'; have 'struct cpumask *(unsigned int)'
492 | EXPORT_SYMBOL_GPL(group_cpus_evenly);
| ^~~~~~~~~~~~~~~~~
include/linux/export.h:56:28: note: in definition of macro '__EXPORT_SYMBOL'
56 | extern typeof(sym) sym; \
| ^~~
include/linux/export.h:69:41: note: in expansion of macro '_EXPORT_SYMBOL'
69 | #define EXPORT_SYMBOL_GPL(sym) _EXPORT_SYMBOL(sym, "GPL")
| ^~~~~~~~~~~~~~
lib/group_cpus.c:492:1: note: in expansion of macro 'EXPORT_SYMBOL_GPL'
492 | EXPORT_SYMBOL_GPL(group_cpus_evenly);
| ^~~~~~~~~~~~~~~~~
include/linux/group_cpus.h:12:17: note: previous declaration of 'group_cpus_evenly' with type 'struct cpumask *(unsigned int, bool)' {aka 'struct cpumask *(unsigned int, _Bool)'}
12 | struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed);
| ^~~~~~~~~~~~~~~~~
vim +480 lib/group_cpus.c
f7b3ea8cf72f3d Ming Lei 2022-12-27 379
f7b3ea8cf72f3d Ming Lei 2022-12-27 380 /**
f7b3ea8cf72f3d Ming Lei 2022-12-27 381 * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
f7b3ea8cf72f3d Ming Lei 2022-12-27 382 * @numgrps: number of groups
59cfa36232b76f Guanjun 2024-10-31 383 * @is_managed: if these groups managed by kernel
f7b3ea8cf72f3d Ming Lei 2022-12-27 384 *
f7b3ea8cf72f3d Ming Lei 2022-12-27 385 * Return: cpumask array if successful, NULL otherwise. And each element
f7b3ea8cf72f3d Ming Lei 2022-12-27 386 * includes CPUs assigned to this group
f7b3ea8cf72f3d Ming Lei 2022-12-27 387 *
f7b3ea8cf72f3d Ming Lei 2022-12-27 388 * Try to put close CPUs from viewpoint of CPU and NUMA locality into
f7b3ea8cf72f3d Ming Lei 2022-12-27 389 * same group, and run two-stage grouping:
f7b3ea8cf72f3d Ming Lei 2022-12-27 390 * 1) allocate present CPUs on these groups evenly first
f7b3ea8cf72f3d Ming Lei 2022-12-27 391 * 2) allocate other possible CPUs on these groups evenly
f7b3ea8cf72f3d Ming Lei 2022-12-27 392 *
f7b3ea8cf72f3d Ming Lei 2022-12-27 393 * We guarantee in the resulted grouping that all CPUs are covered, and
f7b3ea8cf72f3d Ming Lei 2022-12-27 394 * no same CPU is assigned to multiple groups
f7b3ea8cf72f3d Ming Lei 2022-12-27 395 */
59cfa36232b76f Guanjun 2024-10-31 396 struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed)
f7b3ea8cf72f3d Ming Lei 2022-12-27 397 {
f7b3ea8cf72f3d Ming Lei 2022-12-27 398 unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
f7b3ea8cf72f3d Ming Lei 2022-12-27 399 cpumask_var_t *node_to_cpumask;
f7b3ea8cf72f3d Ming Lei 2022-12-27 400 cpumask_var_t nmsk, npresmsk;
f7b3ea8cf72f3d Ming Lei 2022-12-27 401 int ret = -ENOMEM;
f7b3ea8cf72f3d Ming Lei 2022-12-27 402 struct cpumask *masks = NULL;
f7b3ea8cf72f3d Ming Lei 2022-12-27 403
f7b3ea8cf72f3d Ming Lei 2022-12-27 404 if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
f7b3ea8cf72f3d Ming Lei 2022-12-27 405 return NULL;
f7b3ea8cf72f3d Ming Lei 2022-12-27 406
f7b3ea8cf72f3d Ming Lei 2022-12-27 407 if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
f7b3ea8cf72f3d Ming Lei 2022-12-27 408 goto fail_nmsk;
f7b3ea8cf72f3d Ming Lei 2022-12-27 409
f7b3ea8cf72f3d Ming Lei 2022-12-27 410 node_to_cpumask = alloc_node_to_cpumask();
f7b3ea8cf72f3d Ming Lei 2022-12-27 411 if (!node_to_cpumask)
f7b3ea8cf72f3d Ming Lei 2022-12-27 412 goto fail_npresmsk;
f7b3ea8cf72f3d Ming Lei 2022-12-27 413
f7b3ea8cf72f3d Ming Lei 2022-12-27 414 masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
f7b3ea8cf72f3d Ming Lei 2022-12-27 415 if (!masks)
f7b3ea8cf72f3d Ming Lei 2022-12-27 416 goto fail_node_to_cpumask;
f7b3ea8cf72f3d Ming Lei 2022-12-27 417
f7b3ea8cf72f3d Ming Lei 2022-12-27 418 build_node_to_cpumask(node_to_cpumask);
f7b3ea8cf72f3d Ming Lei 2022-12-27 419
0263f92fadbb9d Ming Lei 2023-11-20 420 /*
0263f92fadbb9d Ming Lei 2023-11-20 421 * Make a local cache of 'cpu_present_mask', so the two stages
0263f92fadbb9d Ming Lei 2023-11-20 422 * spread can observe consistent 'cpu_present_mask' without holding
0263f92fadbb9d Ming Lei 2023-11-20 423 * cpu hotplug lock, then we can reduce deadlock risk with cpu
0263f92fadbb9d Ming Lei 2023-11-20 424 * hotplug code.
0263f92fadbb9d Ming Lei 2023-11-20 425 *
0263f92fadbb9d Ming Lei 2023-11-20 426 * Here CPU hotplug may happen when reading `cpu_present_mask`, and
0263f92fadbb9d Ming Lei 2023-11-20 427 * we can live with the case because it only affects that hotplug
0263f92fadbb9d Ming Lei 2023-11-20 428 * CPU is handled in the 1st or 2nd stage, and either way is correct
0263f92fadbb9d Ming Lei 2023-11-20 429 * from API user viewpoint since 2-stage spread is sort of
0263f92fadbb9d Ming Lei 2023-11-20 430 * optimization.
0263f92fadbb9d Ming Lei 2023-11-20 431 */
0263f92fadbb9d Ming Lei 2023-11-20 432 cpumask_copy(npresmsk, data_race(cpu_present_mask));
0263f92fadbb9d Ming Lei 2023-11-20 433
59cfa36232b76f Guanjun 2024-10-31 434 /* Limit the count of managed interrupts on every node */
59cfa36232b76f Guanjun 2024-10-31 435 if (is_managed && managed_irqs_per_node)
59cfa36232b76f Guanjun 2024-10-31 436 __group_prepare_affinity(npresmsk, node_to_cpumask);
59cfa36232b76f Guanjun 2024-10-31 437
f7b3ea8cf72f3d Ming Lei 2022-12-27 438 /* grouping present CPUs first */
f7b3ea8cf72f3d Ming Lei 2022-12-27 439 ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
0263f92fadbb9d Ming Lei 2023-11-20 440 npresmsk, nmsk, masks);
f7b3ea8cf72f3d Ming Lei 2022-12-27 441 if (ret < 0)
f7b3ea8cf72f3d Ming Lei 2022-12-27 442 goto fail_build_affinity;
f7b3ea8cf72f3d Ming Lei 2022-12-27 443 nr_present = ret;
f7b3ea8cf72f3d Ming Lei 2022-12-27 444
f7b3ea8cf72f3d Ming Lei 2022-12-27 445 /*
f7b3ea8cf72f3d Ming Lei 2022-12-27 446 * Allocate non present CPUs starting from the next group to be
f7b3ea8cf72f3d Ming Lei 2022-12-27 447 * handled. If the grouping of present CPUs already exhausted the
f7b3ea8cf72f3d Ming Lei 2022-12-27 448 * group space, assign the non present CPUs to the already
f7b3ea8cf72f3d Ming Lei 2022-12-27 449 * allocated out groups.
f7b3ea8cf72f3d Ming Lei 2022-12-27 450 */
f7b3ea8cf72f3d Ming Lei 2022-12-27 451 if (nr_present >= numgrps)
f7b3ea8cf72f3d Ming Lei 2022-12-27 452 curgrp = 0;
f7b3ea8cf72f3d Ming Lei 2022-12-27 453 else
f7b3ea8cf72f3d Ming Lei 2022-12-27 454 curgrp = nr_present;
0263f92fadbb9d Ming Lei 2023-11-20 455 cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
f7b3ea8cf72f3d Ming Lei 2022-12-27 456 ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
f7b3ea8cf72f3d Ming Lei 2022-12-27 457 npresmsk, nmsk, masks);
f7b3ea8cf72f3d Ming Lei 2022-12-27 458 if (ret >= 0)
f7b3ea8cf72f3d Ming Lei 2022-12-27 459 nr_others = ret;
f7b3ea8cf72f3d Ming Lei 2022-12-27 460
f7b3ea8cf72f3d Ming Lei 2022-12-27 461 fail_build_affinity:
f7b3ea8cf72f3d Ming Lei 2022-12-27 462 if (ret >= 0)
f7b3ea8cf72f3d Ming Lei 2022-12-27 463 WARN_ON(nr_present + nr_others < numgrps);
f7b3ea8cf72f3d Ming Lei 2022-12-27 464
f7b3ea8cf72f3d Ming Lei 2022-12-27 465 fail_node_to_cpumask:
f7b3ea8cf72f3d Ming Lei 2022-12-27 466 free_node_to_cpumask(node_to_cpumask);
f7b3ea8cf72f3d Ming Lei 2022-12-27 467
f7b3ea8cf72f3d Ming Lei 2022-12-27 468 fail_npresmsk:
f7b3ea8cf72f3d Ming Lei 2022-12-27 469 free_cpumask_var(npresmsk);
f7b3ea8cf72f3d Ming Lei 2022-12-27 470
f7b3ea8cf72f3d Ming Lei 2022-12-27 471 fail_nmsk:
f7b3ea8cf72f3d Ming Lei 2022-12-27 472 free_cpumask_var(nmsk);
f7b3ea8cf72f3d Ming Lei 2022-12-27 473 if (ret < 0) {
f7b3ea8cf72f3d Ming Lei 2022-12-27 474 kfree(masks);
f7b3ea8cf72f3d Ming Lei 2022-12-27 475 return NULL;
f7b3ea8cf72f3d Ming Lei 2022-12-27 476 }
f7b3ea8cf72f3d Ming Lei 2022-12-27 477 return masks;
f7b3ea8cf72f3d Ming Lei 2022-12-27 478 }
188a569658584e Ingo Molnar 2023-01-18 479 #else /* CONFIG_SMP */
f7b3ea8cf72f3d Ming Lei 2022-12-27 @480 struct cpumask *group_cpus_evenly(unsigned int numgrps)
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
2024-10-31 7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
` (2 preceding siblings ...)
2024-11-02 16:30 ` kernel test robot
@ 2024-11-02 16:41 ` kernel test robot
3 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2024-11-02 16:41 UTC (permalink / raw)
To: 'Guanjun'; +Cc: llvm, oe-kbuild-all
Hi 'Guanjun',
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on tip/irq/core]
[also build test ERROR on axboe-block/for-next mszeredi-fuse/for-next tip/smp/core linus/master v6.12-rc5 next-20241101]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Guanjun/genirq-affinity-add-support-for-limiting-managed-interrupts/20241031-154824
base: tip/irq/core
patch link: https://lore.kernel.org/r/20241031074618.3585491-2-guanjun%40linux.alibaba.com
patch subject: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts
config: arm-allnoconfig (https://download.01.org/0day-ci/archive/20241103/202411030024.eIDPlX3p-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 639a7ac648f1e50ccd2556e17d401c04f9cce625)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241103/202411030024.eIDPlX3p-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202411030024.eIDPlX3p-lkp@intel.com/
All errors (new ones prefixed by >>):
>> lib/group_cpus.c:480:17: error: conflicting types for 'group_cpus_evenly'
480 | struct cpumask *group_cpus_evenly(unsigned int numgrps)
| ^
include/linux/group_cpus.h:12:17: note: previous declaration is here
12 | struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed);
| ^
1 error generated.
vim +/group_cpus_evenly +480 lib/group_cpus.c
f7b3ea8cf72f3d Ming Lei 2022-12-27 379
f7b3ea8cf72f3d Ming Lei 2022-12-27 380 /**
f7b3ea8cf72f3d Ming Lei 2022-12-27 381 * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality
f7b3ea8cf72f3d Ming Lei 2022-12-27 382 * @numgrps: number of groups
59cfa36232b76f Guanjun 2024-10-31 383 * @is_managed: if these groups managed by kernel
f7b3ea8cf72f3d Ming Lei 2022-12-27 384 *
f7b3ea8cf72f3d Ming Lei 2022-12-27 385 * Return: cpumask array if successful, NULL otherwise. And each element
f7b3ea8cf72f3d Ming Lei 2022-12-27 386 * includes CPUs assigned to this group
f7b3ea8cf72f3d Ming Lei 2022-12-27 387 *
f7b3ea8cf72f3d Ming Lei 2022-12-27 388 * Try to put close CPUs from viewpoint of CPU and NUMA locality into
f7b3ea8cf72f3d Ming Lei 2022-12-27 389 * same group, and run two-stage grouping:
f7b3ea8cf72f3d Ming Lei 2022-12-27 390 * 1) allocate present CPUs on these groups evenly first
f7b3ea8cf72f3d Ming Lei 2022-12-27 391 * 2) allocate other possible CPUs on these groups evenly
f7b3ea8cf72f3d Ming Lei 2022-12-27 392 *
f7b3ea8cf72f3d Ming Lei 2022-12-27 393 * We guarantee in the resulted grouping that all CPUs are covered, and
f7b3ea8cf72f3d Ming Lei 2022-12-27 394 * no same CPU is assigned to multiple groups
f7b3ea8cf72f3d Ming Lei 2022-12-27 395 */
59cfa36232b76f Guanjun 2024-10-31 396 struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed)
f7b3ea8cf72f3d Ming Lei 2022-12-27 397 {
f7b3ea8cf72f3d Ming Lei 2022-12-27 398 unsigned int curgrp = 0, nr_present = 0, nr_others = 0;
f7b3ea8cf72f3d Ming Lei 2022-12-27 399 cpumask_var_t *node_to_cpumask;
f7b3ea8cf72f3d Ming Lei 2022-12-27 400 cpumask_var_t nmsk, npresmsk;
f7b3ea8cf72f3d Ming Lei 2022-12-27 401 int ret = -ENOMEM;
f7b3ea8cf72f3d Ming Lei 2022-12-27 402 struct cpumask *masks = NULL;
f7b3ea8cf72f3d Ming Lei 2022-12-27 403
f7b3ea8cf72f3d Ming Lei 2022-12-27 404 if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL))
f7b3ea8cf72f3d Ming Lei 2022-12-27 405 return NULL;
f7b3ea8cf72f3d Ming Lei 2022-12-27 406
f7b3ea8cf72f3d Ming Lei 2022-12-27 407 if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL))
f7b3ea8cf72f3d Ming Lei 2022-12-27 408 goto fail_nmsk;
f7b3ea8cf72f3d Ming Lei 2022-12-27 409
f7b3ea8cf72f3d Ming Lei 2022-12-27 410 node_to_cpumask = alloc_node_to_cpumask();
f7b3ea8cf72f3d Ming Lei 2022-12-27 411 if (!node_to_cpumask)
f7b3ea8cf72f3d Ming Lei 2022-12-27 412 goto fail_npresmsk;
f7b3ea8cf72f3d Ming Lei 2022-12-27 413
f7b3ea8cf72f3d Ming Lei 2022-12-27 414 masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL);
f7b3ea8cf72f3d Ming Lei 2022-12-27 415 if (!masks)
f7b3ea8cf72f3d Ming Lei 2022-12-27 416 goto fail_node_to_cpumask;
f7b3ea8cf72f3d Ming Lei 2022-12-27 417
f7b3ea8cf72f3d Ming Lei 2022-12-27 418 build_node_to_cpumask(node_to_cpumask);
f7b3ea8cf72f3d Ming Lei 2022-12-27 419
0263f92fadbb9d Ming Lei 2023-11-20 420 /*
0263f92fadbb9d Ming Lei 2023-11-20 421 * Make a local cache of 'cpu_present_mask', so the two stages
0263f92fadbb9d Ming Lei 2023-11-20 422 * spread can observe consistent 'cpu_present_mask' without holding
0263f92fadbb9d Ming Lei 2023-11-20 423 * cpu hotplug lock, then we can reduce deadlock risk with cpu
0263f92fadbb9d Ming Lei 2023-11-20 424 * hotplug code.
0263f92fadbb9d Ming Lei 2023-11-20 425 *
0263f92fadbb9d Ming Lei 2023-11-20 426 * Here CPU hotplug may happen when reading `cpu_present_mask`, and
0263f92fadbb9d Ming Lei 2023-11-20 427 * we can live with the case because it only affects that hotplug
0263f92fadbb9d Ming Lei 2023-11-20 428 * CPU is handled in the 1st or 2nd stage, and either way is correct
0263f92fadbb9d Ming Lei 2023-11-20 429 * from API user viewpoint since 2-stage spread is sort of
0263f92fadbb9d Ming Lei 2023-11-20 430 * optimization.
0263f92fadbb9d Ming Lei 2023-11-20 431 */
0263f92fadbb9d Ming Lei 2023-11-20 432 cpumask_copy(npresmsk, data_race(cpu_present_mask));
0263f92fadbb9d Ming Lei 2023-11-20 433
59cfa36232b76f Guanjun 2024-10-31 434 /* Limit the count of managed interrupts on every node */
59cfa36232b76f Guanjun 2024-10-31 435 if (is_managed && managed_irqs_per_node)
59cfa36232b76f Guanjun 2024-10-31 436 __group_prepare_affinity(npresmsk, node_to_cpumask);
59cfa36232b76f Guanjun 2024-10-31 437
f7b3ea8cf72f3d Ming Lei 2022-12-27 438 /* grouping present CPUs first */
f7b3ea8cf72f3d Ming Lei 2022-12-27 439 ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
0263f92fadbb9d Ming Lei 2023-11-20 440 npresmsk, nmsk, masks);
f7b3ea8cf72f3d Ming Lei 2022-12-27 441 if (ret < 0)
f7b3ea8cf72f3d Ming Lei 2022-12-27 442 goto fail_build_affinity;
f7b3ea8cf72f3d Ming Lei 2022-12-27 443 nr_present = ret;
f7b3ea8cf72f3d Ming Lei 2022-12-27 444
f7b3ea8cf72f3d Ming Lei 2022-12-27 445 /*
f7b3ea8cf72f3d Ming Lei 2022-12-27 446 * Allocate non present CPUs starting from the next group to be
f7b3ea8cf72f3d Ming Lei 2022-12-27 447 * handled. If the grouping of present CPUs already exhausted the
f7b3ea8cf72f3d Ming Lei 2022-12-27 448 * group space, assign the non present CPUs to the already
f7b3ea8cf72f3d Ming Lei 2022-12-27 449 * allocated out groups.
f7b3ea8cf72f3d Ming Lei 2022-12-27 450 */
f7b3ea8cf72f3d Ming Lei 2022-12-27 451 if (nr_present >= numgrps)
f7b3ea8cf72f3d Ming Lei 2022-12-27 452 curgrp = 0;
f7b3ea8cf72f3d Ming Lei 2022-12-27 453 else
f7b3ea8cf72f3d Ming Lei 2022-12-27 454 curgrp = nr_present;
0263f92fadbb9d Ming Lei 2023-11-20 455 cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
f7b3ea8cf72f3d Ming Lei 2022-12-27 456 ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
f7b3ea8cf72f3d Ming Lei 2022-12-27 457 npresmsk, nmsk, masks);
f7b3ea8cf72f3d Ming Lei 2022-12-27 458 if (ret >= 0)
f7b3ea8cf72f3d Ming Lei 2022-12-27 459 nr_others = ret;
f7b3ea8cf72f3d Ming Lei 2022-12-27 460
f7b3ea8cf72f3d Ming Lei 2022-12-27 461 fail_build_affinity:
f7b3ea8cf72f3d Ming Lei 2022-12-27 462 if (ret >= 0)
f7b3ea8cf72f3d Ming Lei 2022-12-27 463 WARN_ON(nr_present + nr_others < numgrps);
f7b3ea8cf72f3d Ming Lei 2022-12-27 464
f7b3ea8cf72f3d Ming Lei 2022-12-27 465 fail_node_to_cpumask:
f7b3ea8cf72f3d Ming Lei 2022-12-27 466 free_node_to_cpumask(node_to_cpumask);
f7b3ea8cf72f3d Ming Lei 2022-12-27 467
f7b3ea8cf72f3d Ming Lei 2022-12-27 468 fail_npresmsk:
f7b3ea8cf72f3d Ming Lei 2022-12-27 469 free_cpumask_var(npresmsk);
f7b3ea8cf72f3d Ming Lei 2022-12-27 470
f7b3ea8cf72f3d Ming Lei 2022-12-27 471 fail_nmsk:
f7b3ea8cf72f3d Ming Lei 2022-12-27 472 free_cpumask_var(nmsk);
f7b3ea8cf72f3d Ming Lei 2022-12-27 473 if (ret < 0) {
f7b3ea8cf72f3d Ming Lei 2022-12-27 474 kfree(masks);
f7b3ea8cf72f3d Ming Lei 2022-12-27 475 return NULL;
f7b3ea8cf72f3d Ming Lei 2022-12-27 476 }
f7b3ea8cf72f3d Ming Lei 2022-12-27 477 return masks;
f7b3ea8cf72f3d Ming Lei 2022-12-27 478 }
188a569658584e Ingo Molnar 2023-01-18 479 #else /* CONFIG_SMP */
f7b3ea8cf72f3d Ming Lei 2022-12-27 @480 struct cpumask *group_cpus_evenly(unsigned int numgrps)
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-11-02 16:41 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-31 7:46 [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation 'Guanjun'
2024-10-31 7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
2024-10-31 10:35 ` Thomas Gleixner
2024-10-31 10:50 ` Ming Lei
[not found] ` <43FD1116-C188-4729-A3AB-C2A0F5A087D2@linux.alibaba.com>
2024-11-01 3:34 ` Jason Wang
2024-11-01 3:03 ` mapicccy
2024-11-01 23:37 ` Thomas Gleixner
2024-11-01 7:06 ` Jiri Slaby
2024-11-02 16:30 ` kernel test robot
2024-11-02 16:41 ` kernel test robot
2024-10-31 7:46 ` [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity 'Guanjun'
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.