* [PATCH RESEND] x86/irq: assign vectors from numa_node
@ 2010-12-09 22:48 Arthur Kepner
2010-12-09 23:41 ` Jesper Juhl
2010-12-10 1:18 ` Thomas Gleixner
0 siblings, 2 replies; 3+ messages in thread
From: Arthur Kepner @ 2010-12-09 22:48 UTC (permalink / raw)
To: linux-kernel; +Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86
(Resending with expanded cc list.)
Several drivers (e.g., mlx4_core) do something similar to:
err = pci_enable_msix(pdev, entries, num_possible_cpus());
which takes us down this code path:
pci_enable_msix
native_setup_msi_irqs
create_irq_nr
__assign_irq_vector
__assign_irq_vector() preferentially uses vectors from low-numbered
CPUs. On a system with a large number (>256) CPUs this can result in
a CPU running out of vectors, and subsequent attempts to assign an
interrupt to that CPU will fail.
The following patch prefers vectors from the node associated with the
device (if the device is associated with a node). This should make it
far less likely that a single CPU's vectors will be exhausted.
Signed-off-by: Arthur Kepner <akepner@sgi.com>
---
io_apic.c | 81 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 77 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 7cc0a72..af5f9d8 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1117,6 +1117,49 @@ next:
return err;
}
+static int
+__assign_irq_vector_node(int irq, struct irq_cfg *cfg,
+ const struct cpumask *mask, int node)
+{
+ int err = -EAGAIN;
+ int cpu, best_cpu = -1, min_vector_count = NR_VECTORS;
+
+ for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
+ /* find the 'best' CPU to take this vector -
+ * the one with the fewest assigned vectors is
+ * considered 'best' */
+ int i, vector_count = 0;
+
+ if (!cpu_online(cpu))
+ continue;
+
+ for (i = FIRST_EXTERNAL_VECTOR + VECTOR_OFFSET_START;
+ i < NR_VECTORS ; i++)
+ if (per_cpu(vector_irq, cpu)[i] != -1)
+ vector_count++;
+
+ if (vector_count < min_vector_count) {
+ min_vector_count = vector_count;
+ best_cpu = cpu;
+ }
+ }
+
+ if (best_cpu >= 0) {
+ cpumask_var_t tmp_mask;
+
+ if (!alloc_cpumask_var(&tmp_mask, GFP_ATOMIC))
+ return -ENOMEM;
+
+ cpumask_clear(tmp_mask);
+ cpumask_set_cpu(best_cpu, tmp_mask);
+ err = __assign_irq_vector(irq, cfg, tmp_mask);
+
+ free_cpumask_var(tmp_mask);
+ }
+
+ return err;
+}
+
int assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
{
int err;
@@ -1128,6 +1171,39 @@ int assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
return err;
}
+static int
+assign_irq_vector_node(int irq, struct irq_cfg *cfg,
+ const struct cpumask *mask, int node)
+{
+ int err;
+ unsigned long flags;
+
+ if (node == NUMA_NO_NODE)
+ return assign_irq_vector(irq, cfg, mask);
+
+ raw_spin_lock_irqsave(&vector_lock, flags);
+ err = __assign_irq_vector_node(irq, cfg, mask, node);
+ raw_spin_unlock_irqrestore(&vector_lock, flags);
+
+ if (err != 0)
+ /* uh oh - try again w/o specifying a node */
+ return assign_irq_vector(irq, cfg, mask);
+ else {
+ /* and set the affinity mask so that only
+ * CPUs on 'node' will be used */
+ struct irq_desc *desc = irq_to_desc(irq);
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&desc->lock, flags);
+ cpumask_and(desc->irq_data.affinity, cpu_online_mask,
+ cpumask_of_node(node));
+ desc->status |= IRQ_AFFINITY_SET;
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+ }
+
+ return err;
+}
+
static void __clear_irq_vector(int irq, struct irq_cfg *cfg)
{
int cpu, vector;
@@ -3057,7 +3133,6 @@ device_initcall(ioapic_init_sysfs);
unsigned int create_irq_nr(unsigned int from, int node)
{
struct irq_cfg *cfg;
- unsigned long flags;
unsigned int ret = 0;
int irq;
@@ -3073,10 +3148,8 @@ unsigned int create_irq_nr(unsigned int from, int node)
return 0;
}
- raw_spin_lock_irqsave(&vector_lock, flags);
- if (!__assign_irq_vector(irq, cfg, apic->target_cpus()))
+ if (!assign_irq_vector_node(irq, cfg, apic->target_cpus(), node))
ret = irq;
- raw_spin_unlock_irqrestore(&vector_lock, flags);
if (ret) {
set_irq_chip_data(irq, cfg);
^ permalink raw reply related [flat|nested] 3+ messages in thread* Re: [PATCH RESEND] x86/irq: assign vectors from numa_node
2010-12-09 22:48 [PATCH RESEND] x86/irq: assign vectors from numa_node Arthur Kepner
@ 2010-12-09 23:41 ` Jesper Juhl
2010-12-10 1:18 ` Thomas Gleixner
1 sibling, 0 replies; 3+ messages in thread
From: Jesper Juhl @ 2010-12-09 23:41 UTC (permalink / raw)
To: Arthur Kepner
Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86
On Thu, 9 Dec 2010, Arthur Kepner wrote:
>
> (Resending with expanded cc list.)
>
> Several drivers (e.g., mlx4_core) do something similar to:
>
> err = pci_enable_msix(pdev, entries, num_possible_cpus());
>
> which takes us down this code path:
>
> pci_enable_msix
> native_setup_msi_irqs
> create_irq_nr
> __assign_irq_vector
>
> __assign_irq_vector() preferentially uses vectors from low-numbered
> CPUs. On a system with a large number (>256) CPUs this can result in
> a CPU running out of vectors, and subsequent attempts to assign an
> interrupt to that CPU will fail.
>
> The following patch prefers vectors from the node associated with the
> device (if the device is associated with a node). This should make it
> far less likely that a single CPU's vectors will be exhausted.
>
I'm not going to pretend that I know this code *at all*, but what you
wrote made me think, and I want to share my thoughts. Perhaps they are
useful, perhaps not.
Assigning to the CPU associated with a device sounds sane and sounds like
it will distribute things more. So far so good. But I can't help wondering
if it wouldn't be sane (besides doing this) to simply fall back to the
next higher CPU if the chosen one is exhausted (or wrap around to the
first one if we are already at the highest one)... So that we'll only fail
completely if *all* CPU's are exhausted..?
--
Jesper Juhl <jj@chaosbits.net> http://www.chaosbits.net/
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH RESEND] x86/irq: assign vectors from numa_node
2010-12-09 22:48 [PATCH RESEND] x86/irq: assign vectors from numa_node Arthur Kepner
2010-12-09 23:41 ` Jesper Juhl
@ 2010-12-10 1:18 ` Thomas Gleixner
1 sibling, 0 replies; 3+ messages in thread
From: Thomas Gleixner @ 2010-12-10 1:18 UTC (permalink / raw)
To: Arthur Kepner; +Cc: linux-kernel, Ingo Molnar, H. Peter Anvin, x86
On Thu, 9 Dec 2010, Arthur Kepner wrote:
It's in my list of patches to go through already.
> (Resending with expanded cc list.)
FYI, you expanded the cc list just in your mail client. In reality
exactly by zero. x86@kernel.org is a mail exploder to hpa, mingo and
tglx :)
Thanks,
tglx
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2010-12-10 1:19 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-09 22:48 [PATCH RESEND] x86/irq: assign vectors from numa_node Arthur Kepner
2010-12-09 23:41 ` Jesper Juhl
2010-12-10 1:18 ` Thomas Gleixner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox