* [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation.
@ 2024-10-31 7:46 'Guanjun'
2024-10-31 7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
2024-10-31 7:46 ` [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity 'Guanjun'
0 siblings, 2 replies; 11+ messages in thread
From: 'Guanjun' @ 2024-10-31 7:46 UTC (permalink / raw)
To: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal,
stefanha, miklos, tglx, peterz, akpm, paulmck, thuth, rostedt, bp,
xiongwei.song, linux-doc, linux-kernel, linux-block,
virtualization, linux-fsdevel
Cc: guanjun
From: Guanjun <guanjun@linux.alibaba.com>
We found that in scenarios with a large number of devices on the system,
for example, 256 NVMe block devices, each with 2 I/O queues, about a few
dozen of interrupts cannot be allocated, and get the error code -ENOSPC.
The reason for this issue is that the io queue interrupts are set to managed
interrupts (i.e., affinity is managed by the kernel), which leads to a
excessive number of the IRQ matrix bits being reserved.
This patch series support for limiting the number of managed interrupt
per allocation to address this issue.
Thanks,
Guanjun
Guanjun (2):
genirq/affinity: add support for limiting managed interrupts
genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out
in the affinity
.../admin-guide/kernel-parameters.txt | 12 ++++
block/blk-mq-cpumap.c | 2 +-
drivers/virtio/virtio_vdpa.c | 2 +-
fs/fuse/virtio_fs.c | 2 +-
include/linux/group_cpus.h | 2 +-
include/linux/irq.h | 2 +
kernel/cpu.c | 2 +-
kernel/irq/affinity.c | 11 ++--
kernel/irq/cpuhotplug.c | 51 +++++++++++++++++
lib/group_cpus.c | 55 ++++++++++++++++++-
10 files changed, 130 insertions(+), 11 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 2024-10-31 7:46 [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation 'Guanjun' @ 2024-10-31 7:46 ` 'Guanjun' 2024-10-31 10:35 ` Thomas Gleixner ` (3 more replies) 2024-10-31 7:46 ` [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity 'Guanjun' 1 sibling, 4 replies; 11+ messages in thread From: 'Guanjun' @ 2024-10-31 7:46 UTC (permalink / raw) To: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal, stefanha, miklos, tglx, peterz, akpm, paulmck, thuth, rostedt, bp, xiongwei.song, linux-doc, linux-kernel, linux-block, virtualization, linux-fsdevel Cc: guanjun From: Guanjun <guanjun@linux.alibaba.com> Commit c410abbbacb9 (genirq/affinity: Add is_managed to struct irq_affinity_desc) introduced is_managed bit to struct irq_affinity_desc. Due to queue interrupts treated as managed interrupts, in scenarios where a large number of devices are present (using massive msix queue interrupts), an excessive number of IRQ matrix bits (about num_online_cpus() * nvecs) are reserved during interrupt allocation. This sequently leads to the situation where interrupts for some devices cannot be properly allocated. Support for limiting the number of managed interrupts on every node per allocation. Signed-off-by: Guanjun <guanjun@linux.alibaba.com> --- .../admin-guide/kernel-parameters.txt | 9 +++ block/blk-mq-cpumap.c | 2 +- drivers/virtio/virtio_vdpa.c | 2 +- fs/fuse/virtio_fs.c | 2 +- include/linux/group_cpus.h | 2 +- kernel/irq/affinity.c | 11 ++-- lib/group_cpus.c | 55 ++++++++++++++++++- 7 files changed, 73 insertions(+), 10 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 9b61097a6448..ac80f35d04c9 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3238,6 +3238,15 @@ different yeeloong laptops. Example: machtype=lemote-yeeloong-2f-7inch + managed_irqs_per_node= + [KNL,SMP] Support for limiting the number of managed + interrupts on every node to prevent the case that + interrupts cannot be properly allocated where a large + number of devices are present. The default number is 0, + that means no limit to the number of managed irqs. + Format: integer between 0 and num_possible_cpus() / num_possible_nodes() + Default: 0 + maxcpus= [SMP,EARLY] Maximum number of processors that an SMP kernel will bring up during bootup. maxcpus=n : n >= 0 limits the kernel to bring up 'n' processors. Surely after diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c index 9638b25fd521..481c81318e00 100644 --- a/block/blk-mq-cpumap.c +++ b/block/blk-mq-cpumap.c @@ -20,7 +20,7 @@ void blk_mq_map_queues(struct blk_mq_queue_map *qmap) const struct cpumask *masks; unsigned int queue, cpu; - masks = group_cpus_evenly(qmap->nr_queues); + masks = group_cpus_evenly(qmap->nr_queues, true); if (!masks) { for_each_possible_cpu(cpu) qmap->mq_map[cpu] = qmap->queue_offset; diff --git a/drivers/virtio/virtio_vdpa.c b/drivers/virtio/virtio_vdpa.c index 7364bd53e38d..cd303ac64046 100644 --- a/drivers/virtio/virtio_vdpa.c +++ b/drivers/virtio/virtio_vdpa.c @@ -330,7 +330,7 @@ create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd) for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) { unsigned int this_vecs = affd->set_size[i]; int j; - struct cpumask *result = group_cpus_evenly(this_vecs); + struct cpumask *result = group_cpus_evenly(this_vecs, true); if (!result) { kfree(masks); diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index f68527891929..41b3bcc03f9c 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -881,7 +881,7 @@ static void virtio_fs_map_queues(struct virtio_device *vdev, struct virtio_fs *f return; fallback: /* Attempt to map evenly in groups over the CPUs */ - masks = group_cpus_evenly(fs->num_request_queues); + masks = group_cpus_evenly(fs->num_request_queues, true); /* If even this fails we default to all CPUs use queue zero */ if (!masks) { for_each_possible_cpu(cpu) diff --git a/include/linux/group_cpus.h b/include/linux/group_cpus.h index e42807ec61f6..10a12b9a7ed4 100644 --- a/include/linux/group_cpus.h +++ b/include/linux/group_cpus.h @@ -9,6 +9,6 @@ #include <linux/kernel.h> #include <linux/cpu.h> -struct cpumask *group_cpus_evenly(unsigned int numgrps); +struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed); #endif diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 44a4eba80315..775ab8537ddc 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -64,6 +64,10 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd) for (curvec = 0; curvec < affd->pre_vectors; curvec++) cpumask_copy(&masks[curvec].mask, irq_default_affinity); + /* Mark the managed interrupts */ + for (i = curvec; i < nvecs - affd->post_vectors; i++) + masks[i].is_managed = 1; + /* * Spread on present CPUs starting from affd->pre_vectors. If we * have multiple sets, build each sets affinity mask separately. @@ -71,7 +75,8 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd) for (i = 0, usedvecs = 0; i < affd->nr_sets; i++) { unsigned int this_vecs = affd->set_size[i]; int j; - struct cpumask *result = group_cpus_evenly(this_vecs); + struct cpumask *result = group_cpus_evenly(this_vecs, + masks[curvec].is_managed); if (!result) { kfree(masks); @@ -94,10 +99,6 @@ irq_create_affinity_masks(unsigned int nvecs, struct irq_affinity *affd) for (; curvec < nvecs; curvec++) cpumask_copy(&masks[curvec].mask, irq_default_affinity); - /* Mark the managed interrupts */ - for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++) - masks[i].is_managed = 1; - return masks; } diff --git a/lib/group_cpus.c b/lib/group_cpus.c index ee272c4cefcc..769a139491bc 100644 --- a/lib/group_cpus.c +++ b/lib/group_cpus.c @@ -11,6 +11,30 @@ #ifdef CONFIG_SMP +static unsigned int __read_mostly managed_irqs_per_node; +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = { + [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL} +}; + +static int __init irq_managed_setup(char *str) +{ + int ret; + + ret = kstrtouint(str, 10, &managed_irqs_per_node); + if (ret < 0) { + pr_warn("managed_irqs_per_node= cannot parse, ignored\n"); + return 0; + } + + if (managed_irqs_per_node * num_possible_nodes() > num_possible_cpus()) { + managed_irqs_per_node = num_possible_cpus() / num_possible_nodes(); + pr_warn("managed_irqs_per_node= cannot be larger than %u\n", + managed_irqs_per_node); + } + return 1; +} +__setup("managed_irqs_per_node=", irq_managed_setup); + static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk, unsigned int cpus_per_grp) { @@ -246,6 +270,30 @@ static void alloc_nodes_groups(unsigned int numgrps, } } +static void __group_prepare_affinity(struct cpumask *premask, + cpumask_var_t *node_to_cpumask) +{ + nodemask_t nodemsk = NODE_MASK_NONE; + unsigned int ncpus, n; + + get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk); + + for_each_node_mask(n, nodemsk) { + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask); + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]); + + ncpus = cpumask_weight(&managed_irqs_cpumsk[n]); + if (ncpus < managed_irqs_per_node) { + /* Reset node n to current node cpumask */ + cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]); + continue; + } + + grp_spread_init_one(premask, &managed_irqs_cpumsk[n], managed_irqs_per_node); + } +} + + static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps, cpumask_var_t *node_to_cpumask, const struct cpumask *cpu_mask, @@ -332,6 +380,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps, /** * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality * @numgrps: number of groups + * @is_managed: if these groups managed by kernel * * Return: cpumask array if successful, NULL otherwise. And each element * includes CPUs assigned to this group @@ -344,7 +393,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps, * We guarantee in the resulted grouping that all CPUs are covered, and * no same CPU is assigned to multiple groups */ -struct cpumask *group_cpus_evenly(unsigned int numgrps) +struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed) { unsigned int curgrp = 0, nr_present = 0, nr_others = 0; cpumask_var_t *node_to_cpumask; @@ -382,6 +431,10 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps) */ cpumask_copy(npresmsk, data_race(cpu_present_mask)); + /* Limit the count of managed interrupts on every node */ + if (is_managed && managed_irqs_per_node) + __group_prepare_affinity(npresmsk, node_to_cpumask); + /* grouping present CPUs first */ ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, npresmsk, nmsk, masks); -- 2.43.5 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 2024-10-31 7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun' @ 2024-10-31 10:35 ` Thomas Gleixner 2024-10-31 10:50 ` Ming Lei 2024-11-01 3:03 ` mapicccy 2024-11-01 7:06 ` Jiri Slaby ` (2 subsequent siblings) 3 siblings, 2 replies; 11+ messages in thread From: Thomas Gleixner @ 2024-10-31 10:35 UTC (permalink / raw) To: 'Guanjun', corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal, stefanha, miklos, peterz, akpm, paulmck, thuth, rostedt, bp, xiongwei.song, linux-doc, linux-kernel, linux-block, virtualization, linux-fsdevel Cc: guanjun On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote: > #ifdef CONFIG_SMP > > +static unsigned int __read_mostly managed_irqs_per_node; > +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = { > + [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL} > +}; > > +static void __group_prepare_affinity(struct cpumask *premask, > + cpumask_var_t *node_to_cpumask) > +{ > + nodemask_t nodemsk = NODE_MASK_NONE; > + unsigned int ncpus, n; > + > + get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk); > + > + for_each_node_mask(n, nodemsk) { > + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask); > + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]); How is this managed_irqs_cpumsk array protected against concurrency? > + ncpus = cpumask_weight(&managed_irqs_cpumsk[n]); > + if (ncpus < managed_irqs_per_node) { > + /* Reset node n to current node cpumask */ > + cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]); This whole logic is incomprehensible and aside of the concurrency problem it's broken when CPUs are made present at run-time because these cpu masks are static and represent the stale state of the last invocation. Given the limitations of the x86 vector space, which is not going away anytime soon, there are only two options IMO to handle such a scenario. 1) Tell the nvme/block layer to disable queue affinity management 2) Restrict the devices and queues to the nodes they sit on Thanks, tglx ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 2024-10-31 10:35 ` Thomas Gleixner @ 2024-10-31 10:50 ` Ming Lei [not found] ` <43FD1116-C188-4729-A3AB-C2A0F5A087D2@linux.alibaba.com> 2024-11-01 3:03 ` mapicccy 1 sibling, 1 reply; 11+ messages in thread From: Ming Lei @ 2024-10-31 10:50 UTC (permalink / raw) To: Thomas Gleixner, Christoph Hellwig Cc: Guanjun, corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal, stefanha, miklos, peterz, akpm, paulmck, thuth, rostedt, bp, xiongwei.song, linux-doc, linux-kernel, linux-block, virtualization, linux-fsdevel On Thu, Oct 31, 2024 at 6:35 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote: > > #ifdef CONFIG_SMP > > > > +static unsigned int __read_mostly managed_irqs_per_node; > > +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = { > > + [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL} > > +}; > > > > +static void __group_prepare_affinity(struct cpumask *premask, > > + cpumask_var_t *node_to_cpumask) > > +{ > > + nodemask_t nodemsk = NODE_MASK_NONE; > > + unsigned int ncpus, n; > > + > > + get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk); > > + > > + for_each_node_mask(n, nodemsk) { > > + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask); > > + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]); > > How is this managed_irqs_cpumsk array protected against concurrency? > > > + ncpus = cpumask_weight(&managed_irqs_cpumsk[n]); > > + if (ncpus < managed_irqs_per_node) { > > + /* Reset node n to current node cpumask */ > > + cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]); > > This whole logic is incomprehensible and aside of the concurrency > problem it's broken when CPUs are made present at run-time because these > cpu masks are static and represent the stale state of the last > invocation. > > Given the limitations of the x86 vector space, which is not going away > anytime soon, there are only two options IMO to handle such a scenario. > > 1) Tell the nvme/block layer to disable queue affinity management +1 There are other use cases, such as cpu isolation, which can benefit from this way too. https://lore.kernel.org/linux-nvme/20240702104112.4123810-1-ming.lei@redhat.com/ Thanks, ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <43FD1116-C188-4729-A3AB-C2A0F5A087D2@linux.alibaba.com>]
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts [not found] ` <43FD1116-C188-4729-A3AB-C2A0F5A087D2@linux.alibaba.com> @ 2024-11-01 3:34 ` Jason Wang 0 siblings, 0 replies; 11+ messages in thread From: Jason Wang @ 2024-11-01 3:34 UTC (permalink / raw) To: mapicccy Cc: Ming Lei, Thomas Gleixner, Christoph Hellwig, corbet, axboe, mst, xuanzhuo, eperezma, vgoyal, stefanha, miklos, peterz, akpm, paulmck, thuth, rostedt, bp, xiongwei.song, linux-doc, linux-kernel, linux-block, virtualization, linux-fsdevel On Fri, Nov 1, 2024 at 11:12 AM mapicccy <guanjun@linux.alibaba.com> wrote: > > > > 2024年10月31日 18:50,Ming Lei <ming.lei@redhat.com> 写道: > > On Thu, Oct 31, 2024 at 6:35 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > > On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote: > > #ifdef CONFIG_SMP > > +static unsigned int __read_mostly managed_irqs_per_node; > +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = { > + [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL} > +}; > > +static void __group_prepare_affinity(struct cpumask *premask, > + cpumask_var_t *node_to_cpumask) > +{ > + nodemask_t nodemsk = NODE_MASK_NONE; > + unsigned int ncpus, n; > + > + get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk); > + > + for_each_node_mask(n, nodemsk) { > + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask); > + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]); > > > How is this managed_irqs_cpumsk array protected against concurrency? > > + ncpus = cpumask_weight(&managed_irqs_cpumsk[n]); > + if (ncpus < managed_irqs_per_node) { > + /* Reset node n to current node cpumask */ > + cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]); > > > This whole logic is incomprehensible and aside of the concurrency > problem it's broken when CPUs are made present at run-time because these > cpu masks are static and represent the stale state of the last > invocation. > > Given the limitations of the x86 vector space, which is not going away > anytime soon, there are only two options IMO to handle such a scenario. > > 1) Tell the nvme/block layer to disable queue affinity management > > > +1 > > There are other use cases, such as cpu isolation, which can benefit from > this way too. > > https://lore.kernel.org/linux-nvme/20240702104112.4123810-1-ming.lei@redhat.com/ > I wonder if we need to do the same for virtio-blk. > > Thanks for your reminder. However, in this link only modified the NVMe driver, > but there is the same issue in the virtio net driver as well. I guess you meant virtio-blk actually? > > Guanjun > > > Thanks, > Thanks ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 2024-10-31 10:35 ` Thomas Gleixner 2024-10-31 10:50 ` Ming Lei @ 2024-11-01 3:03 ` mapicccy 2024-11-01 23:37 ` Thomas Gleixner 1 sibling, 1 reply; 11+ messages in thread From: mapicccy @ 2024-11-01 3:03 UTC (permalink / raw) To: Thomas Gleixner Cc: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal, stefanha, miklos, peterz, akpm, paulmck, thuth, rostedt, bp, xiongwei.song, linux-doc, linux-kernel, linux-block, virtualization, linux-fsdevel > 2024年10月31日 18:35,Thomas Gleixner <tglx@linutronix.de> 写道: > > On Thu, Oct 31 2024 at 15:46, guanjun@linux.alibaba.com wrote: >> #ifdef CONFIG_SMP >> >> +static unsigned int __read_mostly managed_irqs_per_node; >> +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = { >> + [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL} >> +}; >> >> +static void __group_prepare_affinity(struct cpumask *premask, >> + cpumask_var_t *node_to_cpumask) >> +{ >> + nodemask_t nodemsk = NODE_MASK_NONE; >> + unsigned int ncpus, n; >> + >> + get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk); >> + >> + for_each_node_mask(n, nodemsk) { >> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask); >> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]); > > How is this managed_irqs_cpumsk array protected against concurrency? My intention was to allocate up to `managed_irq_per_node` cpu bits from `managed_irqs_cpumask[n]`, even if another task modifies some of the bits in the `managed_irqs_cpumask[n]` at the same time. > >> + ncpus = cpumask_weight(&managed_irqs_cpumsk[n]); >> + if (ncpus < managed_irqs_per_node) { >> + /* Reset node n to current node cpumask */ >> + cpumask_copy(&managed_irqs_cpumsk[n], node_to_cpumask[n]); > > This whole logic is incomprehensible and aside of the concurrency > problem it's broken when CPUs are made present at run-time because these > cpu masks are static and represent the stale state of the last > invocation. Sorry, I realize there is indeed a logic issue here (caused by developing on 5.10 LTS and rebase to the latest linux-next). > > Given the limitations of the x86 vector space, which is not going away > anytime soon, there are only two options IMO to handle such a scenario. > > 1) Tell the nvme/block layer to disable queue affinity management > > 2) Restrict the devices and queues to the nodes they sit on I have tried fixing this issue through nvme driver, but later discovered that the same issue exists with virtio net. Therefore, I want to address this with a more general solution. Thanks, Guanjun > > Thanks, > > tglx ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 2024-11-01 3:03 ` mapicccy @ 2024-11-01 23:37 ` Thomas Gleixner 0 siblings, 0 replies; 11+ messages in thread From: Thomas Gleixner @ 2024-11-01 23:37 UTC (permalink / raw) To: mapicccy Cc: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal, stefanha, miklos, peterz, akpm, paulmck, thuth, rostedt, bp, xiongwei.song, linux-doc, linux-kernel, linux-block, virtualization, linux-fsdevel On Fri, Nov 01 2024 at 11:03, mapicccy wrote: >> 2024年10月31日 18:35,Thomas Gleixner <tglx@linutronix.de> 写道: >>> + get_nodes_in_cpumask(node_to_cpumask, premask, &nodemsk); >>> + >>> + for_each_node_mask(n, nodemsk) { >>> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], premask); >>> + cpumask_and(&managed_irqs_cpumsk[n], &managed_irqs_cpumsk[n], node_to_cpumask[n]); >> >> How is this managed_irqs_cpumsk array protected against concurrency? > > My intention was to allocate up to `managed_irq_per_node` cpu bits from `managed_irqs_cpumask[n]`, > even if another task modifies some of the bits in the `managed_irqs_cpumask[n]` at the same time. That may have been your intention, but how is this even remotely correct? Aside of that. If it's intentional and you think it's correct then you should have documented that in the code and also annotated it to not trigger santiziers. >> Given the limitations of the x86 vector space, which is not going away >> anytime soon, there are only two options IMO to handle such a scenario. >> >> 1) Tell the nvme/block layer to disable queue affinity management >> >> 2) Restrict the devices and queues to the nodes they sit on > > I have tried fixing this issue through nvme driver, but later > discovered that the same issue exists with virtio net. Therefore, I > want to address this with a more general solution. I understand, but a general solution for this problem won't exist ever. It's very reasonable to restrict this for one particular device type or subsystem while maintaining the strict managed property for others, no? General solutions are definitely preferred, but not for the price that they break existing completely correct and working setups. Which is what your 2/2 patch does for sure. Thanks, tglx ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 2024-10-31 7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun' 2024-10-31 10:35 ` Thomas Gleixner @ 2024-11-01 7:06 ` Jiri Slaby 2024-11-02 16:30 ` kernel test robot 2024-11-02 16:41 ` kernel test robot 3 siblings, 0 replies; 11+ messages in thread From: Jiri Slaby @ 2024-11-01 7:06 UTC (permalink / raw) To: 'Guanjun', corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal, stefanha, miklos, tglx, peterz, akpm, paulmck, thuth, rostedt, bp, xiongwei.song, linux-doc, linux-kernel, linux-block, virtualization, linux-fsdevel Hi, On 31. 10. 24, 8:46, 'Guanjun' wrote: > From: Guanjun <guanjun@linux.alibaba.com> > > Commit c410abbbacb9 (genirq/affinity: Add is_managed to struct irq_affinity_desc) > introduced is_managed bit to struct irq_affinity_desc. Due to queue interrupts > treated as managed interrupts, in scenarios where a large number of > devices are present (using massive msix queue interrupts), an excessive number > of IRQ matrix bits (about num_online_cpus() * nvecs) are reserved during > interrupt allocation. This sequently leads to the situation where interrupts > for some devices cannot be properly allocated. > > Support for limiting the number of managed interrupts on every node per allocation. > > Signed-off-by: Guanjun <guanjun@linux.alibaba.com> > --- > .../admin-guide/kernel-parameters.txt | 9 +++ > block/blk-mq-cpumap.c | 2 +- > drivers/virtio/virtio_vdpa.c | 2 +- > fs/fuse/virtio_fs.c | 2 +- > include/linux/group_cpus.h | 2 +- > kernel/irq/affinity.c | 11 ++-- > lib/group_cpus.c | 55 ++++++++++++++++++- > 7 files changed, 73 insertions(+), 10 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 9b61097a6448..ac80f35d04c9 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -3238,6 +3238,15 @@ > different yeeloong laptops. > Example: machtype=lemote-yeeloong-2f-7inch > > + managed_irqs_per_node= > + [KNL,SMP] Support for limiting the number of managed > + interrupts on every node to prevent the case that > + interrupts cannot be properly allocated where a large > + number of devices are present. The default number is 0, > + that means no limit to the number of managed irqs. > + Format: integer between 0 and num_possible_cpus() / num_possible_nodes() > + Default: 0 Kernel parameters suck. Esp. here you have to guess to even properly boot. Could this be auto-tuned instead? > --- a/lib/group_cpus.c > +++ b/lib/group_cpus.c > @@ -11,6 +11,30 @@ > > #ifdef CONFIG_SMP > > +static unsigned int __read_mostly managed_irqs_per_node; > +static struct cpumask managed_irqs_cpumsk[MAX_NUMNODES] __cacheline_aligned_in_smp = { This is quite excessive. On SUSE configs, this is 8192 cpu bits * 1024 nodes = 1 M. For everyone. You have to allocate this dynamically instead. See e.g. setup_node_to_cpumask_map(). > + [0 ... MAX_NUMNODES-1] = {CPU_BITS_ALL} > +}; > + > +static int __init irq_managed_setup(char *str) > +{ > + int ret; > + > + ret = kstrtouint(str, 10, &managed_irqs_per_node); > + if (ret < 0) { > + pr_warn("managed_irqs_per_node= cannot parse, ignored\n"); could not be parsed > + return 0; > + } > + > + if (managed_irqs_per_node * num_possible_nodes() > num_possible_cpus()) { > + managed_irqs_per_node = num_possible_cpus() / num_possible_nodes(); > + pr_warn("managed_irqs_per_node= cannot be larger than %u\n", > + managed_irqs_per_node); > + } > + return 1; > +} > +__setup("managed_irqs_per_node=", irq_managed_setup); > + > static void grp_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk, > unsigned int cpus_per_grp) > { ... > @@ -332,6 +380,7 @@ static int __group_cpus_evenly(unsigned int startgrp, unsigned int numgrps, > /** > * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality > * @numgrps: number of groups > + * @is_managed: if these groups managed by kernel are managed by the kernel > * > * Return: cpumask array if successful, NULL otherwise. And each element > * includes CPUs assigned to this group thanks, -- js suse labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 2024-10-31 7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun' 2024-10-31 10:35 ` Thomas Gleixner 2024-11-01 7:06 ` Jiri Slaby @ 2024-11-02 16:30 ` kernel test robot 2024-11-02 16:41 ` kernel test robot 3 siblings, 0 replies; 11+ messages in thread From: kernel test robot @ 2024-11-02 16:30 UTC (permalink / raw) To: 'Guanjun'; +Cc: oe-kbuild-all Hi 'Guanjun', [This is a private test report for your RFC patch.] kernel test robot noticed the following build errors: [auto build test ERROR on tip/irq/core] [also build test ERROR on axboe-block/for-next mszeredi-fuse/for-next tip/smp/core linus/master v6.12-rc5 next-20241101] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Guanjun/genirq-affinity-add-support-for-limiting-managed-interrupts/20241031-154824 base: tip/irq/core patch link: https://lore.kernel.org/r/20241031074618.3585491-2-guanjun%40linux.alibaba.com patch subject: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20241103/202411030019.Ho11gGvG-lkp@intel.com/config) compiler: alpha-linux-gcc (GCC) 13.3.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241103/202411030019.Ho11gGvG-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202411030019.Ho11gGvG-lkp@intel.com/ All errors (new ones prefixed by >>): >> lib/group_cpus.c:480:17: error: conflicting types for 'group_cpus_evenly'; have 'struct cpumask *(unsigned int)' 480 | struct cpumask *group_cpus_evenly(unsigned int numgrps) | ^~~~~~~~~~~~~~~~~ In file included from lib/group_cpus.c:10: include/linux/group_cpus.h:12:17: note: previous declaration of 'group_cpus_evenly' with type 'struct cpumask *(unsigned int, bool)' {aka 'struct cpumask *(unsigned int, _Bool)'} 12 | struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed); | ^~~~~~~~~~~~~~~~~ In file included from include/linux/linkage.h:7, from include/linux/kernel.h:18, from lib/group_cpus.c:6: lib/group_cpus.c:492:19: error: conflicting types for 'group_cpus_evenly'; have 'struct cpumask *(unsigned int)' 492 | EXPORT_SYMBOL_GPL(group_cpus_evenly); | ^~~~~~~~~~~~~~~~~ include/linux/export.h:56:28: note: in definition of macro '__EXPORT_SYMBOL' 56 | extern typeof(sym) sym; \ | ^~~ include/linux/export.h:69:41: note: in expansion of macro '_EXPORT_SYMBOL' 69 | #define EXPORT_SYMBOL_GPL(sym) _EXPORT_SYMBOL(sym, "GPL") | ^~~~~~~~~~~~~~ lib/group_cpus.c:492:1: note: in expansion of macro 'EXPORT_SYMBOL_GPL' 492 | EXPORT_SYMBOL_GPL(group_cpus_evenly); | ^~~~~~~~~~~~~~~~~ include/linux/group_cpus.h:12:17: note: previous declaration of 'group_cpus_evenly' with type 'struct cpumask *(unsigned int, bool)' {aka 'struct cpumask *(unsigned int, _Bool)'} 12 | struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed); | ^~~~~~~~~~~~~~~~~ vim +480 lib/group_cpus.c f7b3ea8cf72f3d Ming Lei 2022-12-27 379 f7b3ea8cf72f3d Ming Lei 2022-12-27 380 /** f7b3ea8cf72f3d Ming Lei 2022-12-27 381 * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality f7b3ea8cf72f3d Ming Lei 2022-12-27 382 * @numgrps: number of groups 59cfa36232b76f Guanjun 2024-10-31 383 * @is_managed: if these groups managed by kernel f7b3ea8cf72f3d Ming Lei 2022-12-27 384 * f7b3ea8cf72f3d Ming Lei 2022-12-27 385 * Return: cpumask array if successful, NULL otherwise. And each element f7b3ea8cf72f3d Ming Lei 2022-12-27 386 * includes CPUs assigned to this group f7b3ea8cf72f3d Ming Lei 2022-12-27 387 * f7b3ea8cf72f3d Ming Lei 2022-12-27 388 * Try to put close CPUs from viewpoint of CPU and NUMA locality into f7b3ea8cf72f3d Ming Lei 2022-12-27 389 * same group, and run two-stage grouping: f7b3ea8cf72f3d Ming Lei 2022-12-27 390 * 1) allocate present CPUs on these groups evenly first f7b3ea8cf72f3d Ming Lei 2022-12-27 391 * 2) allocate other possible CPUs on these groups evenly f7b3ea8cf72f3d Ming Lei 2022-12-27 392 * f7b3ea8cf72f3d Ming Lei 2022-12-27 393 * We guarantee in the resulted grouping that all CPUs are covered, and f7b3ea8cf72f3d Ming Lei 2022-12-27 394 * no same CPU is assigned to multiple groups f7b3ea8cf72f3d Ming Lei 2022-12-27 395 */ 59cfa36232b76f Guanjun 2024-10-31 396 struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed) f7b3ea8cf72f3d Ming Lei 2022-12-27 397 { f7b3ea8cf72f3d Ming Lei 2022-12-27 398 unsigned int curgrp = 0, nr_present = 0, nr_others = 0; f7b3ea8cf72f3d Ming Lei 2022-12-27 399 cpumask_var_t *node_to_cpumask; f7b3ea8cf72f3d Ming Lei 2022-12-27 400 cpumask_var_t nmsk, npresmsk; f7b3ea8cf72f3d Ming Lei 2022-12-27 401 int ret = -ENOMEM; f7b3ea8cf72f3d Ming Lei 2022-12-27 402 struct cpumask *masks = NULL; f7b3ea8cf72f3d Ming Lei 2022-12-27 403 f7b3ea8cf72f3d Ming Lei 2022-12-27 404 if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL)) f7b3ea8cf72f3d Ming Lei 2022-12-27 405 return NULL; f7b3ea8cf72f3d Ming Lei 2022-12-27 406 f7b3ea8cf72f3d Ming Lei 2022-12-27 407 if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL)) f7b3ea8cf72f3d Ming Lei 2022-12-27 408 goto fail_nmsk; f7b3ea8cf72f3d Ming Lei 2022-12-27 409 f7b3ea8cf72f3d Ming Lei 2022-12-27 410 node_to_cpumask = alloc_node_to_cpumask(); f7b3ea8cf72f3d Ming Lei 2022-12-27 411 if (!node_to_cpumask) f7b3ea8cf72f3d Ming Lei 2022-12-27 412 goto fail_npresmsk; f7b3ea8cf72f3d Ming Lei 2022-12-27 413 f7b3ea8cf72f3d Ming Lei 2022-12-27 414 masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL); f7b3ea8cf72f3d Ming Lei 2022-12-27 415 if (!masks) f7b3ea8cf72f3d Ming Lei 2022-12-27 416 goto fail_node_to_cpumask; f7b3ea8cf72f3d Ming Lei 2022-12-27 417 f7b3ea8cf72f3d Ming Lei 2022-12-27 418 build_node_to_cpumask(node_to_cpumask); f7b3ea8cf72f3d Ming Lei 2022-12-27 419 0263f92fadbb9d Ming Lei 2023-11-20 420 /* 0263f92fadbb9d Ming Lei 2023-11-20 421 * Make a local cache of 'cpu_present_mask', so the two stages 0263f92fadbb9d Ming Lei 2023-11-20 422 * spread can observe consistent 'cpu_present_mask' without holding 0263f92fadbb9d Ming Lei 2023-11-20 423 * cpu hotplug lock, then we can reduce deadlock risk with cpu 0263f92fadbb9d Ming Lei 2023-11-20 424 * hotplug code. 0263f92fadbb9d Ming Lei 2023-11-20 425 * 0263f92fadbb9d Ming Lei 2023-11-20 426 * Here CPU hotplug may happen when reading `cpu_present_mask`, and 0263f92fadbb9d Ming Lei 2023-11-20 427 * we can live with the case because it only affects that hotplug 0263f92fadbb9d Ming Lei 2023-11-20 428 * CPU is handled in the 1st or 2nd stage, and either way is correct 0263f92fadbb9d Ming Lei 2023-11-20 429 * from API user viewpoint since 2-stage spread is sort of 0263f92fadbb9d Ming Lei 2023-11-20 430 * optimization. 0263f92fadbb9d Ming Lei 2023-11-20 431 */ 0263f92fadbb9d Ming Lei 2023-11-20 432 cpumask_copy(npresmsk, data_race(cpu_present_mask)); 0263f92fadbb9d Ming Lei 2023-11-20 433 59cfa36232b76f Guanjun 2024-10-31 434 /* Limit the count of managed interrupts on every node */ 59cfa36232b76f Guanjun 2024-10-31 435 if (is_managed && managed_irqs_per_node) 59cfa36232b76f Guanjun 2024-10-31 436 __group_prepare_affinity(npresmsk, node_to_cpumask); 59cfa36232b76f Guanjun 2024-10-31 437 f7b3ea8cf72f3d Ming Lei 2022-12-27 438 /* grouping present CPUs first */ f7b3ea8cf72f3d Ming Lei 2022-12-27 439 ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, 0263f92fadbb9d Ming Lei 2023-11-20 440 npresmsk, nmsk, masks); f7b3ea8cf72f3d Ming Lei 2022-12-27 441 if (ret < 0) f7b3ea8cf72f3d Ming Lei 2022-12-27 442 goto fail_build_affinity; f7b3ea8cf72f3d Ming Lei 2022-12-27 443 nr_present = ret; f7b3ea8cf72f3d Ming Lei 2022-12-27 444 f7b3ea8cf72f3d Ming Lei 2022-12-27 445 /* f7b3ea8cf72f3d Ming Lei 2022-12-27 446 * Allocate non present CPUs starting from the next group to be f7b3ea8cf72f3d Ming Lei 2022-12-27 447 * handled. If the grouping of present CPUs already exhausted the f7b3ea8cf72f3d Ming Lei 2022-12-27 448 * group space, assign the non present CPUs to the already f7b3ea8cf72f3d Ming Lei 2022-12-27 449 * allocated out groups. f7b3ea8cf72f3d Ming Lei 2022-12-27 450 */ f7b3ea8cf72f3d Ming Lei 2022-12-27 451 if (nr_present >= numgrps) f7b3ea8cf72f3d Ming Lei 2022-12-27 452 curgrp = 0; f7b3ea8cf72f3d Ming Lei 2022-12-27 453 else f7b3ea8cf72f3d Ming Lei 2022-12-27 454 curgrp = nr_present; 0263f92fadbb9d Ming Lei 2023-11-20 455 cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk); f7b3ea8cf72f3d Ming Lei 2022-12-27 456 ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, f7b3ea8cf72f3d Ming Lei 2022-12-27 457 npresmsk, nmsk, masks); f7b3ea8cf72f3d Ming Lei 2022-12-27 458 if (ret >= 0) f7b3ea8cf72f3d Ming Lei 2022-12-27 459 nr_others = ret; f7b3ea8cf72f3d Ming Lei 2022-12-27 460 f7b3ea8cf72f3d Ming Lei 2022-12-27 461 fail_build_affinity: f7b3ea8cf72f3d Ming Lei 2022-12-27 462 if (ret >= 0) f7b3ea8cf72f3d Ming Lei 2022-12-27 463 WARN_ON(nr_present + nr_others < numgrps); f7b3ea8cf72f3d Ming Lei 2022-12-27 464 f7b3ea8cf72f3d Ming Lei 2022-12-27 465 fail_node_to_cpumask: f7b3ea8cf72f3d Ming Lei 2022-12-27 466 free_node_to_cpumask(node_to_cpumask); f7b3ea8cf72f3d Ming Lei 2022-12-27 467 f7b3ea8cf72f3d Ming Lei 2022-12-27 468 fail_npresmsk: f7b3ea8cf72f3d Ming Lei 2022-12-27 469 free_cpumask_var(npresmsk); f7b3ea8cf72f3d Ming Lei 2022-12-27 470 f7b3ea8cf72f3d Ming Lei 2022-12-27 471 fail_nmsk: f7b3ea8cf72f3d Ming Lei 2022-12-27 472 free_cpumask_var(nmsk); f7b3ea8cf72f3d Ming Lei 2022-12-27 473 if (ret < 0) { f7b3ea8cf72f3d Ming Lei 2022-12-27 474 kfree(masks); f7b3ea8cf72f3d Ming Lei 2022-12-27 475 return NULL; f7b3ea8cf72f3d Ming Lei 2022-12-27 476 } f7b3ea8cf72f3d Ming Lei 2022-12-27 477 return masks; f7b3ea8cf72f3d Ming Lei 2022-12-27 478 } 188a569658584e Ingo Molnar 2023-01-18 479 #else /* CONFIG_SMP */ f7b3ea8cf72f3d Ming Lei 2022-12-27 @480 struct cpumask *group_cpus_evenly(unsigned int numgrps) -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 2024-10-31 7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun' ` (2 preceding siblings ...) 2024-11-02 16:30 ` kernel test robot @ 2024-11-02 16:41 ` kernel test robot 3 siblings, 0 replies; 11+ messages in thread From: kernel test robot @ 2024-11-02 16:41 UTC (permalink / raw) To: 'Guanjun'; +Cc: llvm, oe-kbuild-all Hi 'Guanjun', [This is a private test report for your RFC patch.] kernel test robot noticed the following build errors: [auto build test ERROR on tip/irq/core] [also build test ERROR on axboe-block/for-next mszeredi-fuse/for-next tip/smp/core linus/master v6.12-rc5 next-20241101] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Guanjun/genirq-affinity-add-support-for-limiting-managed-interrupts/20241031-154824 base: tip/irq/core patch link: https://lore.kernel.org/r/20241031074618.3585491-2-guanjun%40linux.alibaba.com patch subject: [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts config: arm-allnoconfig (https://download.01.org/0day-ci/archive/20241103/202411030024.eIDPlX3p-lkp@intel.com/config) compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 639a7ac648f1e50ccd2556e17d401c04f9cce625) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241103/202411030024.eIDPlX3p-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202411030024.eIDPlX3p-lkp@intel.com/ All errors (new ones prefixed by >>): >> lib/group_cpus.c:480:17: error: conflicting types for 'group_cpus_evenly' 480 | struct cpumask *group_cpus_evenly(unsigned int numgrps) | ^ include/linux/group_cpus.h:12:17: note: previous declaration is here 12 | struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed); | ^ 1 error generated. vim +/group_cpus_evenly +480 lib/group_cpus.c f7b3ea8cf72f3d Ming Lei 2022-12-27 379 f7b3ea8cf72f3d Ming Lei 2022-12-27 380 /** f7b3ea8cf72f3d Ming Lei 2022-12-27 381 * group_cpus_evenly - Group all CPUs evenly per NUMA/CPU locality f7b3ea8cf72f3d Ming Lei 2022-12-27 382 * @numgrps: number of groups 59cfa36232b76f Guanjun 2024-10-31 383 * @is_managed: if these groups managed by kernel f7b3ea8cf72f3d Ming Lei 2022-12-27 384 * f7b3ea8cf72f3d Ming Lei 2022-12-27 385 * Return: cpumask array if successful, NULL otherwise. And each element f7b3ea8cf72f3d Ming Lei 2022-12-27 386 * includes CPUs assigned to this group f7b3ea8cf72f3d Ming Lei 2022-12-27 387 * f7b3ea8cf72f3d Ming Lei 2022-12-27 388 * Try to put close CPUs from viewpoint of CPU and NUMA locality into f7b3ea8cf72f3d Ming Lei 2022-12-27 389 * same group, and run two-stage grouping: f7b3ea8cf72f3d Ming Lei 2022-12-27 390 * 1) allocate present CPUs on these groups evenly first f7b3ea8cf72f3d Ming Lei 2022-12-27 391 * 2) allocate other possible CPUs on these groups evenly f7b3ea8cf72f3d Ming Lei 2022-12-27 392 * f7b3ea8cf72f3d Ming Lei 2022-12-27 393 * We guarantee in the resulted grouping that all CPUs are covered, and f7b3ea8cf72f3d Ming Lei 2022-12-27 394 * no same CPU is assigned to multiple groups f7b3ea8cf72f3d Ming Lei 2022-12-27 395 */ 59cfa36232b76f Guanjun 2024-10-31 396 struct cpumask *group_cpus_evenly(unsigned int numgrps, bool is_managed) f7b3ea8cf72f3d Ming Lei 2022-12-27 397 { f7b3ea8cf72f3d Ming Lei 2022-12-27 398 unsigned int curgrp = 0, nr_present = 0, nr_others = 0; f7b3ea8cf72f3d Ming Lei 2022-12-27 399 cpumask_var_t *node_to_cpumask; f7b3ea8cf72f3d Ming Lei 2022-12-27 400 cpumask_var_t nmsk, npresmsk; f7b3ea8cf72f3d Ming Lei 2022-12-27 401 int ret = -ENOMEM; f7b3ea8cf72f3d Ming Lei 2022-12-27 402 struct cpumask *masks = NULL; f7b3ea8cf72f3d Ming Lei 2022-12-27 403 f7b3ea8cf72f3d Ming Lei 2022-12-27 404 if (!zalloc_cpumask_var(&nmsk, GFP_KERNEL)) f7b3ea8cf72f3d Ming Lei 2022-12-27 405 return NULL; f7b3ea8cf72f3d Ming Lei 2022-12-27 406 f7b3ea8cf72f3d Ming Lei 2022-12-27 407 if (!zalloc_cpumask_var(&npresmsk, GFP_KERNEL)) f7b3ea8cf72f3d Ming Lei 2022-12-27 408 goto fail_nmsk; f7b3ea8cf72f3d Ming Lei 2022-12-27 409 f7b3ea8cf72f3d Ming Lei 2022-12-27 410 node_to_cpumask = alloc_node_to_cpumask(); f7b3ea8cf72f3d Ming Lei 2022-12-27 411 if (!node_to_cpumask) f7b3ea8cf72f3d Ming Lei 2022-12-27 412 goto fail_npresmsk; f7b3ea8cf72f3d Ming Lei 2022-12-27 413 f7b3ea8cf72f3d Ming Lei 2022-12-27 414 masks = kcalloc(numgrps, sizeof(*masks), GFP_KERNEL); f7b3ea8cf72f3d Ming Lei 2022-12-27 415 if (!masks) f7b3ea8cf72f3d Ming Lei 2022-12-27 416 goto fail_node_to_cpumask; f7b3ea8cf72f3d Ming Lei 2022-12-27 417 f7b3ea8cf72f3d Ming Lei 2022-12-27 418 build_node_to_cpumask(node_to_cpumask); f7b3ea8cf72f3d Ming Lei 2022-12-27 419 0263f92fadbb9d Ming Lei 2023-11-20 420 /* 0263f92fadbb9d Ming Lei 2023-11-20 421 * Make a local cache of 'cpu_present_mask', so the two stages 0263f92fadbb9d Ming Lei 2023-11-20 422 * spread can observe consistent 'cpu_present_mask' without holding 0263f92fadbb9d Ming Lei 2023-11-20 423 * cpu hotplug lock, then we can reduce deadlock risk with cpu 0263f92fadbb9d Ming Lei 2023-11-20 424 * hotplug code. 0263f92fadbb9d Ming Lei 2023-11-20 425 * 0263f92fadbb9d Ming Lei 2023-11-20 426 * Here CPU hotplug may happen when reading `cpu_present_mask`, and 0263f92fadbb9d Ming Lei 2023-11-20 427 * we can live with the case because it only affects that hotplug 0263f92fadbb9d Ming Lei 2023-11-20 428 * CPU is handled in the 1st or 2nd stage, and either way is correct 0263f92fadbb9d Ming Lei 2023-11-20 429 * from API user viewpoint since 2-stage spread is sort of 0263f92fadbb9d Ming Lei 2023-11-20 430 * optimization. 0263f92fadbb9d Ming Lei 2023-11-20 431 */ 0263f92fadbb9d Ming Lei 2023-11-20 432 cpumask_copy(npresmsk, data_race(cpu_present_mask)); 0263f92fadbb9d Ming Lei 2023-11-20 433 59cfa36232b76f Guanjun 2024-10-31 434 /* Limit the count of managed interrupts on every node */ 59cfa36232b76f Guanjun 2024-10-31 435 if (is_managed && managed_irqs_per_node) 59cfa36232b76f Guanjun 2024-10-31 436 __group_prepare_affinity(npresmsk, node_to_cpumask); 59cfa36232b76f Guanjun 2024-10-31 437 f7b3ea8cf72f3d Ming Lei 2022-12-27 438 /* grouping present CPUs first */ f7b3ea8cf72f3d Ming Lei 2022-12-27 439 ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, 0263f92fadbb9d Ming Lei 2023-11-20 440 npresmsk, nmsk, masks); f7b3ea8cf72f3d Ming Lei 2022-12-27 441 if (ret < 0) f7b3ea8cf72f3d Ming Lei 2022-12-27 442 goto fail_build_affinity; f7b3ea8cf72f3d Ming Lei 2022-12-27 443 nr_present = ret; f7b3ea8cf72f3d Ming Lei 2022-12-27 444 f7b3ea8cf72f3d Ming Lei 2022-12-27 445 /* f7b3ea8cf72f3d Ming Lei 2022-12-27 446 * Allocate non present CPUs starting from the next group to be f7b3ea8cf72f3d Ming Lei 2022-12-27 447 * handled. If the grouping of present CPUs already exhausted the f7b3ea8cf72f3d Ming Lei 2022-12-27 448 * group space, assign the non present CPUs to the already f7b3ea8cf72f3d Ming Lei 2022-12-27 449 * allocated out groups. f7b3ea8cf72f3d Ming Lei 2022-12-27 450 */ f7b3ea8cf72f3d Ming Lei 2022-12-27 451 if (nr_present >= numgrps) f7b3ea8cf72f3d Ming Lei 2022-12-27 452 curgrp = 0; f7b3ea8cf72f3d Ming Lei 2022-12-27 453 else f7b3ea8cf72f3d Ming Lei 2022-12-27 454 curgrp = nr_present; 0263f92fadbb9d Ming Lei 2023-11-20 455 cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk); f7b3ea8cf72f3d Ming Lei 2022-12-27 456 ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask, f7b3ea8cf72f3d Ming Lei 2022-12-27 457 npresmsk, nmsk, masks); f7b3ea8cf72f3d Ming Lei 2022-12-27 458 if (ret >= 0) f7b3ea8cf72f3d Ming Lei 2022-12-27 459 nr_others = ret; f7b3ea8cf72f3d Ming Lei 2022-12-27 460 f7b3ea8cf72f3d Ming Lei 2022-12-27 461 fail_build_affinity: f7b3ea8cf72f3d Ming Lei 2022-12-27 462 if (ret >= 0) f7b3ea8cf72f3d Ming Lei 2022-12-27 463 WARN_ON(nr_present + nr_others < numgrps); f7b3ea8cf72f3d Ming Lei 2022-12-27 464 f7b3ea8cf72f3d Ming Lei 2022-12-27 465 fail_node_to_cpumask: f7b3ea8cf72f3d Ming Lei 2022-12-27 466 free_node_to_cpumask(node_to_cpumask); f7b3ea8cf72f3d Ming Lei 2022-12-27 467 f7b3ea8cf72f3d Ming Lei 2022-12-27 468 fail_npresmsk: f7b3ea8cf72f3d Ming Lei 2022-12-27 469 free_cpumask_var(npresmsk); f7b3ea8cf72f3d Ming Lei 2022-12-27 470 f7b3ea8cf72f3d Ming Lei 2022-12-27 471 fail_nmsk: f7b3ea8cf72f3d Ming Lei 2022-12-27 472 free_cpumask_var(nmsk); f7b3ea8cf72f3d Ming Lei 2022-12-27 473 if (ret < 0) { f7b3ea8cf72f3d Ming Lei 2022-12-27 474 kfree(masks); f7b3ea8cf72f3d Ming Lei 2022-12-27 475 return NULL; f7b3ea8cf72f3d Ming Lei 2022-12-27 476 } f7b3ea8cf72f3d Ming Lei 2022-12-27 477 return masks; f7b3ea8cf72f3d Ming Lei 2022-12-27 478 } 188a569658584e Ingo Molnar 2023-01-18 479 #else /* CONFIG_SMP */ f7b3ea8cf72f3d Ming Lei 2022-12-27 @480 struct cpumask *group_cpus_evenly(unsigned int numgrps) -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity 2024-10-31 7:46 [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation 'Guanjun' 2024-10-31 7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun' @ 2024-10-31 7:46 ` 'Guanjun' 1 sibling, 0 replies; 11+ messages in thread From: 'Guanjun' @ 2024-10-31 7:46 UTC (permalink / raw) To: corbet, axboe, mst, jasowang, xuanzhuo, eperezma, vgoyal, stefanha, miklos, tglx, peterz, akpm, paulmck, thuth, rostedt, bp, xiongwei.song, linux-doc, linux-kernel, linux-block, virtualization, linux-fsdevel Cc: guanjun From: Guanjun <guanjun@linux.alibaba.com> Once we limit the number of managed interrupts, if the last online CPU in the affinity goes offline, it will result in the interrupt becoming unavailable util one of the assigned CPUs comes online again. So prevent the last online CPU in the affinity from going offline, and return -EBUSY in this situation. Signed-off-by: Guanjun <guanjun@linux.alibaba.com> --- .../admin-guide/kernel-parameters.txt | 3 ++ include/linux/irq.h | 2 + kernel/cpu.c | 2 +- kernel/irq/cpuhotplug.c | 51 +++++++++++++++++++ 4 files changed, 57 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ac80f35d04c9..173598cbf4a6 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3244,6 +3244,9 @@ interrupts cannot be properly allocated where a large number of devices are present. The default number is 0, that means no limit to the number of managed irqs. + Once we limit the number of managed interrupts, the last + online CPU in the affinity goes offline will fail with + the error code -EBUSY. Format: integer between 0 and num_possible_cpus() / num_possible_nodes() Default: 0 diff --git a/include/linux/irq.h b/include/linux/irq.h index fa711f80957b..68ce05a74079 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -615,8 +615,10 @@ extern int irq_set_vcpu_affinity(unsigned int irq, void *vcpu_info); #if defined(CONFIG_SMP) && defined(CONFIG_GENERIC_IRQ_MIGRATION) extern void irq_migrate_all_off_this_cpu(void); extern int irq_affinity_online_cpu(unsigned int cpu); +extern int irq_affinity_offline_cpu(unsigned int cpu); #else # define irq_affinity_online_cpu NULL +# define irq_affinity_offline_cpu NULL #endif #if defined(CONFIG_SMP) && defined(CONFIG_GENERIC_PENDING_IRQ) diff --git a/kernel/cpu.c b/kernel/cpu.c index c4aaf73dec9e..672d920970b2 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -2219,7 +2219,7 @@ static struct cpuhp_step cpuhp_hp_states[] = { [CPUHP_AP_IRQ_AFFINITY_ONLINE] = { .name = "irq/affinity:online", .startup.single = irq_affinity_online_cpu, - .teardown.single = NULL, + .teardown.single = irq_affinity_offline_cpu, }, [CPUHP_AP_PERF_ONLINE] = { .name = "perf:online", diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c index 15a7654eff68..e6f068198e4a 100644 --- a/kernel/irq/cpuhotplug.c +++ b/kernel/irq/cpuhotplug.c @@ -232,6 +232,31 @@ static void irq_restore_affinity_of_irq(struct irq_desc *desc, unsigned int cpu) irq_set_affinity_locked(data, affinity, false); } +static int irq_check_affinity_of_irq(struct irq_desc *desc, unsigned int cpu) +{ + struct irq_data *data = irq_desc_get_irq_data(desc); + const struct cpumask *affinity = irq_data_get_affinity_mask(data); + unsigned int cur; + + if (!irqd_affinity_is_managed(data) || !desc->action || + !irq_data_get_irq_chip(data) || !cpumask_test_cpu(cpu, affinity)) + return 0; + + for_each_cpu(cur, affinity) + if (cur != cpu && cpumask_test_cpu(cur, cpu_online_mask)) + return 0; + + /* + * If the onging offline CPU is the last one in the affinity, + * the managed interrupts will be unavailable until one of + * the assigned CPUs comes online. To prevent this unavailability, + * return -EBUSY directly in this case. + */ + pr_warn("Affinity %*pbl of managed IRQ%u contains only one CPU%u that online\n", + cpumask_pr_args(affinity), data->irq, cpu); + return -EBUSY; +} + /** * irq_affinity_online_cpu - Restore affinity for managed interrupts * @cpu: Upcoming CPU for which interrupts should be restored @@ -252,3 +277,29 @@ int irq_affinity_online_cpu(unsigned int cpu) return 0; } + +/** + * irq_affinity_offline_cpu - Check affinity for managed interrupts + * to prevent the unavailability caused by taking the last CPU in the + * affinity offline. + * @cpu: Upcoming CPU for which interrupts should be checked + */ +int irq_affinity_offline_cpu(unsigned int cpu) +{ + struct irq_desc *desc; + unsigned int irq; + int ret = 0; + + irq_lock_sparse(); + for_each_active_irq(irq) { + desc = irq_to_desc(irq); + raw_spin_lock_irq(&desc->lock); + ret = irq_check_affinity_of_irq(desc, cpu); + raw_spin_unlock_irq(&desc->lock); + if (ret < 0) + break; + } + irq_unlock_sparse(); + + return ret; +} -- 2.43.5 ^ permalink raw reply related [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-11-02 16:41 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-31 7:46 [PATCH RFC v1 0/2] Support for limiting the number of managed interrupts on every node per allocation 'Guanjun'
2024-10-31 7:46 ` [PATCH RFC v1 1/2] genirq/affinity: add support for limiting managed interrupts 'Guanjun'
2024-10-31 10:35 ` Thomas Gleixner
2024-10-31 10:50 ` Ming Lei
[not found] ` <43FD1116-C188-4729-A3AB-C2A0F5A087D2@linux.alibaba.com>
2024-11-01 3:34 ` Jason Wang
2024-11-01 3:03 ` mapicccy
2024-11-01 23:37 ` Thomas Gleixner
2024-11-01 7:06 ` Jiri Slaby
2024-11-02 16:30 ` kernel test robot
2024-11-02 16:41 ` kernel test robot
2024-10-31 7:46 ` [PATCH RFC v1 2/2] genirq/cpuhotplug: Handle managed IRQs when the last CPU hotplug out in the affinity 'Guanjun'
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.