[PATCH] x86: Add check for number of available vectors before CPU down [v2]

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] x86: Add check for number of available vectors before CPU down [v2]
@ 2013-12-18 19:29 Prarit Bhargava
  2013-12-18 19:50 ` Tony Luck
  0 siblings, 1 reply; 18+ messages in thread
From: Prarit Bhargava @ 2013-12-18 19:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Prarit Bhargava, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86, Michel Lespinasse, Andi Kleen, Seiji Aguchi, Yang Zhang,
	Paul Gortmaker, janet.morgan, tony.luck, fenghua.yu

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=64791

When a cpu is downed on a system, the irqs on the cpu are assigned to other cpus.  It is possible, however, that when a cpu is downed there aren't enough free vectors on the remaining cpus to account for the vectors from the cpu that is being downed.

This results in an interesting "overflow" condition where irqs are "assigned" to a CPU but are not handled.

For example, when downing cpus on a 1-64 logical processor system:

<snip>
[  232.021745] smpboot: CPU 61 is now offline
[  238.480275] smpboot: CPU 62 is now offline
[  245.991080] ------------[ cut here ]------------
[  245.996270] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x246/0x250()
[  246.005688] NETDEV WATCHDOG: p786p1 (ixgbe): transmit queue 0 timed out
[  246.013070] Modules linked in: lockd sunrpc iTCO_wdt iTCO_vendor_support sb_edac ixgbe microcode e1000e pcspkr joydev edac_core lpc_ich ioatdma ptp mdio mfd_core i2c_i801 dca pps_core i2c_core wmi acpi_cpufreq isci libsas scsi_transport_sas
[  246.037633] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.12.0+ #14
[  246.044451] Hardware name: Intel Corporation S4600LH ........../SVRBD-ROW_T, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[  246.057371]  0000000000000009 ffff88081fa03d40 ffffffff8164fbf6 ffff88081fa0ee48
[  246.065728]  ffff88081fa03d90 ffff88081fa03d80 ffffffff81054ecc ffff88081fa13040
[  246.074073]  0000000000000000 ffff88200cce0000 0000000000000040 0000000000000000
[  246.082430] Call Trace:
[  246.085174]  <IRQ>  [<ffffffff8164fbf6>] dump_stack+0x46/0x58
[  246.091633]  [<ffffffff81054ecc>] warn_slowpath_common+0x8c/0xc0
[  246.098352]  [<ffffffff81054fb6>] warn_slowpath_fmt+0x46/0x50
[  246.104786]  [<ffffffff815710d6>] dev_watchdog+0x246/0x250
[  246.110923]  [<ffffffff81570e90>] ? dev_deactivate_queue.constprop.31+0x80/0x80
[  246.119097]  [<ffffffff8106092a>] call_timer_fn+0x3a/0x110
[  246.125224]  [<ffffffff8106280f>] ? update_process_times+0x6f/0x80
[  246.132137]  [<ffffffff81570e90>] ? dev_deactivate_queue.constprop.31+0x80/0x80
[  246.140308]  [<ffffffff81061db0>] run_timer_softirq+0x1f0/0x2a0
[  246.146933]  [<ffffffff81059a80>] __do_softirq+0xe0/0x220
[  246.152976]  [<ffffffff8165fedc>] call_softirq+0x1c/0x30
[  246.158920]  [<ffffffff810045f5>] do_softirq+0x55/0x90
[  246.164670]  [<ffffffff81059d35>] irq_exit+0xa5/0xb0
[  246.170227]  [<ffffffff8166062a>] smp_apic_timer_interrupt+0x4a/0x60
[  246.177324]  [<ffffffff8165f40a>] apic_timer_interrupt+0x6a/0x70
[  246.184041]  <EOI>  [<ffffffff81505a1b>] ? cpuidle_enter_state+0x5b/0xe0
[  246.191559]  [<ffffffff81505a17>] ? cpuidle_enter_state+0x57/0xe0
[  246.198374]  [<ffffffff81505b5d>] cpuidle_idle_call+0xbd/0x200
[  246.204900]  [<ffffffff8100b7ae>] arch_cpu_idle+0xe/0x30
[  246.210846]  [<ffffffff810a47b0>] cpu_startup_entry+0xd0/0x250
[  246.217371]  [<ffffffff81646b47>] rest_init+0x77/0x80
[  246.223028]  [<ffffffff81d09e8e>] start_kernel+0x3ee/0x3fb
[  246.229165]  [<ffffffff81d0989f>] ? repair_env_string+0x5e/0x5e
[  246.235787]  [<ffffffff81d095a5>] x86_64_start_reservations+0x2a/0x2c
[  246.242990]  [<ffffffff81d0969f>] x86_64_start_kernel+0xf8/0xfc
[  246.249610] ---[ end trace fb74fdef54d79039 ]---
[  246.254807] ixgbe 0000:c2:00.0 p786p1: initiating reset due to tx timeout
[  246.262489] ixgbe 0000:c2:00.0 p786p1: Reset adapter
Last login: Mon Nov 11 08:35:14 from 10.18.17.119
[root@(none) ~]# [  246.792676] ixgbe 0000:c2:00.0 p786p1: detected SFP+: 5
[  249.231598] ixgbe 0000:c2:00.0 p786p1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[  246.792676] ixgbe 0000:c2:00.0 p786p1: detected SFP+: 5
[  249.231598] ixgbe 0000:c2:00.0 p786p1: NIC Link is Up 10 Gbps, Flow Control: RX/TX

(last lines keep repeating.  ixgbe driver is dead until module reload.)

If the downed cpu has more vectors than are free on the remaining cpus on the
system, it is possible that some vectors are "orphaned" even though they are
assigned to a cpu.  In this case, since the ixgbe driver had a watchdog, the
watchdog fired and notified that something was wrong.

This patch adds a function, check_vectors(), to compare the number of vectors
on the CPU going down and compares it to the number of vectors available on
the system.  If there aren't enough vectors for the CPU to go down, an
error is returned and propogated back to userspace.

v2: Do not look at percpu irqs

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Michel Lespinasse <walken@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Yang Zhang <yang.z.zhang@Intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: janet.morgan@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
---
 arch/x86/include/asm/irq.h |    1 +
 arch/x86/kernel/irq.c      |   41 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/smpboot.c  |    6 ++++++
 3 files changed, 48 insertions(+)

diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index 0ea10f27..dfd7372 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -25,6 +25,7 @@ extern void irq_ctx_init(int cpu);
 
 #ifdef CONFIG_HOTPLUG_CPU
 #include <linux/cpumask.h>
+extern int check_vectors(void);
 extern void fixup_irqs(void);
 extern void irq_force_complete_move(int);
 #endif
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 22d0687..582639f 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -262,6 +262,47 @@ __visible void smp_trace_x86_platform_ipi(struct pt_regs *regs)
 EXPORT_SYMBOL_GPL(vector_used_by_percpu_irq);
 
 #ifdef CONFIG_HOTPLUG_CPU
+/*
+ * This cpu is going to be removed and its vectors migrated to the remaining
+ * online cpus.  Check to see if there are enough vectors in the remaining cpus.
+ */
+int check_vectors(void)
+{
+	int irq, cpu;
+	unsigned int vector, this_count, count;
+	struct irq_desc *desc;
+	struct irq_data *data;
+
+	this_count = 0;
+	for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
+		irq = __this_cpu_read(vector_irq[vector]);
+		if (irq >= 0) {
+			desc = irq_to_desc(irq);
+			data = irq_desc_get_irq_data(desc);
+			if (!irqd_is_per_cpu(data))
+				this_count++;
+		}
+	}
+
+	count = 0;
+	for_each_online_cpu(cpu) {
+		if (cpu == smp_processor_id())
+			continue;
+		for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS;
+		     vector++) {
+			if (per_cpu(vector_irq, cpu)[vector] < 0)
+				count++;
+		}
+	}
+
+	if (count < this_count) {
+		pr_warn("CPU %d disable failed: CPU has %u vectors assigned and there are only %u available.\n",
+			smp_processor_id(), this_count, count);
+		return -ERANGE;
+	}
+	return 0;
+}
+
 /* A cpu has been removed from cpu_online_mask.  Reset irq affinities. */
 void fixup_irqs(void)
 {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 85dc05a..7ac6654 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1312,6 +1312,12 @@ void cpu_disable_common(void)
 
 int native_cpu_disable(void)
 {
+	int ret;
+
+	ret = check_vectors();
+	if (ret)
+		return ret;
+
 	clear_local_APIC();
 
 	cpu_disable_common();
-- 
1.7.9.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2013-12-18 19:29 [PATCH] x86: Add check for number of available vectors before CPU down [v2] Prarit Bhargava
@ 2013-12-18 19:50 ` Tony Luck
  2013-12-19 18:05   ` Tony Luck
  0 siblings, 1 reply; 18+ messages in thread
From: Tony Luck @ 2013-12-18 19:50 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, X86-ML, Michel Lespinasse, Andi Kleen,
	Seiji Aguchi, Yang Zhang, Paul Gortmaker, janet.morgan,
	Yu, Fenghua

On Wed, Dec 18, 2013 at 11:29 AM, Prarit Bhargava <prarit@redhat.com> wrote:
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=64791
>
> When a cpu is downed on a system, the irqs on the cpu are assigned to other cpus.  It is possible, however, that when a cpu is downed there aren't enough free vectors on the remaining cpus to account for the vectors from the cpu that is being downed.
>
> This results in an interesting "overflow" condition where irqs are "assigned" to a CPU but are not handled.

Looks good to me.

Acked-by: Tony Luck <tony.luck@intel.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2013-12-18 19:50 ` Tony Luck
@ 2013-12-19 18:05   ` Tony Luck
  2013-12-19 18:11     ` Prarit Bhargava
  0 siblings, 1 reply; 18+ messages in thread
From: Tony Luck @ 2013-12-19 18:05 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, X86-ML, Michel Lespinasse, Andi Kleen,
	Seiji Aguchi, Yang Zhang, Paul Gortmaker, janet.morgan,
	Yu, Fenghua

On Wed, Dec 18, 2013 at 11:50 AM, Tony Luck <tony.luck@intel.com> wrote:
> Looks good to me.

Though now I've been confused by an offline question about affinity.

Suppose we have some interrupt that has affinity to multiple cpus. E.g.
(real example from one of my machines):

# cat /proc/irq/94/smp_affinity_list
26,54

Now If I want to take either cpu26 or cpu54 offline - I'm guessing that I don't
really need to find a new home for vector 94 - because the other one of that
pair already has that set up.  But your check_vectors code doesn't look like
it accounts for that - if we take cpu26 offline - it would see that
cpu54 doesn't
have 94 free - but doesn't check that it is for the same interrupt.

But I may be mixing "vectors" and "irqs" here.

-Tony

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2013-12-19 18:05   ` Tony Luck
@ 2013-12-19 18:11     ` Prarit Bhargava
  2013-12-20  7:18       ` Chen, Gong
  2013-12-20  9:41       ` rui wang
  0 siblings, 2 replies; 18+ messages in thread
From: Prarit Bhargava @ 2013-12-19 18:11 UTC (permalink / raw)
  To: Tony Luck
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin, X86-ML, Michel Lespinasse, Andi Kleen,
	Seiji Aguchi, Yang Zhang, Paul Gortmaker, janet.morgan,
	Yu, Fenghua



On 12/19/2013 01:05 PM, Tony Luck wrote:
> On Wed, Dec 18, 2013 at 11:50 AM, Tony Luck <tony.luck@intel.com> wrote:
>> Looks good to me.
> 
> Though now I've been confused by an offline question about affinity.

Heh :)  I'm pursuing it now.  Rui has asked a pretty good question that I don't
know the answer to off the top of my head.  I'm still looking at the code.

> 
> Suppose we have some interrupt that has affinity to multiple cpus. E.g.
> (real example from one of my machines):
> 
> # cat /proc/irq/94/smp_affinity_list
> 26,54
> 
> Now If I want to take either cpu26 or cpu54 offline - I'm guessing that I don't
> really need to find a new home for vector 94 - because the other one of that
> pair already has that set up.  But your check_vectors code doesn't look like
> it accounts for that - if we take cpu26 offline - it would see that
> cpu54 doesn't
> have 94 free - but doesn't check that it is for the same interrupt.
> 
> But I may be mixing "vectors" and "irqs" here.

Yep.  The question really is this: is the irq mapped to a single vector or
multiple vectors. (I think)

P.

> 
> -Tony

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2013-12-19 18:11     ` Prarit Bhargava
@ 2013-12-20  7:18       ` Chen, Gong
  2013-12-20  9:41       ` rui wang
  1 sibling, 0 replies; 18+ messages in thread
From: Chen, Gong @ 2013-12-20  7:18 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: Tony Luck, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86-ML, Michel Lespinasse,
	Andi Kleen, Seiji Aguchi, Yang Zhang, Paul Gortmaker,
	janet.morgan, Yu, Fenghua

[-- Attachment #1: Type: text/plain, Size: 589 bytes --]

On Thu, Dec 19, 2013 at 01:11:32PM -0500, Prarit Bhargava wrote:
> 
> Yep.  The question really is this: is the irq mapped to a single vector or
> multiple vectors. (I think)
> 
> P.
> 
Yes, this is the original thought I want to express on the bugzilla. On an
ideal environment, without irq balance, all vector_irq should own same
values, say, 0xffffffff. So when one CPU is off-lined, your patch will
considers no places to contain these to-be-migrated irqs but the fact is
they are shared on different CPUs. So if we can answer this question,
the answer will be clear :-).

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2013-12-19 18:11     ` Prarit Bhargava
  2013-12-20  7:18       ` Chen, Gong
@ 2013-12-20  9:41       ` rui wang
  2013-12-20 10:49         ` Prarit Bhargava
  2013-12-28 17:10         ` Prarit Bhargava
  1 sibling, 2 replies; 18+ messages in thread
From: rui wang @ 2013-12-20  9:41 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: Tony Luck, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86-ML, Michel Lespinasse,
	Andi Kleen, Seiji Aguchi, Yang Zhang, Paul Gortmaker,
	janet.morgan, Yu, Fenghua

On 12/20/13, Prarit Bhargava <prarit@redhat.com> wrote:
>
>
> On 12/19/2013 01:05 PM, Tony Luck wrote:
>> On Wed, Dec 18, 2013 at 11:50 AM, Tony Luck <tony.luck@intel.com> wrote:
>>> Looks good to me.
>>
>> Though now I've been confused by an offline question about affinity.
>
> Heh :)  I'm pursuing it now.  Rui has asked a pretty good question that I
> don't
> know the answer to off the top of my head.  I'm still looking at the code.
>
>>
>> Suppose we have some interrupt that has affinity to multiple cpus. E.g.
>> (real example from one of my machines):
>>
>> # cat /proc/irq/94/smp_affinity_list
>> 26,54
>>
>> Now If I want to take either cpu26 or cpu54 offline - I'm guessing that I
>> don't
>> really need to find a new home for vector 94 - because the other one of
>> that
>> pair already has that set up.  But your check_vectors code doesn't look
>> like
>> it accounts for that - if we take cpu26 offline - it would see that
>> cpu54 doesn't
>> have 94 free - but doesn't check that it is for the same interrupt.
>>
>> But I may be mixing "vectors" and "irqs" here.
>
> Yep.  The question really is this: is the irq mapped to a single vector or
> multiple vectors. (I think)
>

The vector number for an irq is programmed in the LSB of the IOAPIC
IRTE (or MSI data register in the case of MSI/MSIx). So there can be
only one vector number (although multiple CPUs can be specified
through DM). An MSI-capable device can dynamically change the lower
few bits in the LSB to signal multiple interrupts with a contiguous
range of vectors in powers of 2,but each of these vectors is treated
as a separate IRQ. i.e. each of them has a separate irq desc, or a
separate line in the /proc/interrupt file. This patch shows the MSI
irq allocation in detail:
http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=51906e779f2b13b38f8153774c4c7163d412ffd9

Thanks
Rui


> P.
>
>>
>> -Tony
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2013-12-20  9:41       ` rui wang
@ 2013-12-20 10:49         ` Prarit Bhargava
  2013-12-28 17:10         ` Prarit Bhargava
  1 sibling, 0 replies; 18+ messages in thread
From: Prarit Bhargava @ 2013-12-20 10:49 UTC (permalink / raw)
  To: rui wang
  Cc: Tony Luck, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86-ML, Michel Lespinasse,
	Andi Kleen, Seiji Aguchi, Yang Zhang, Paul Gortmaker,
	janet.morgan, Yu, Fenghua



On 12/20/2013 04:41 AM, rui wang wrote:
> On 12/20/13, Prarit Bhargava <prarit@redhat.com> wrote:
>>
>>
>> On 12/19/2013 01:05 PM, Tony Luck wrote:
>>> On Wed, Dec 18, 2013 at 11:50 AM, Tony Luck <tony.luck@intel.com> wrote:
>>>> Looks good to me.
>>>
>>> Though now I've been confused by an offline question about affinity.
>>
>> Heh :)  I'm pursuing it now.  Rui has asked a pretty good question that I
>> don't
>> know the answer to off the top of my head.  I'm still looking at the code.
>>
>>>
>>> Suppose we have some interrupt that has affinity to multiple cpus. E.g.
>>> (real example from one of my machines):
>>>
>>> # cat /proc/irq/94/smp_affinity_list
>>> 26,54
>>>
>>> Now If I want to take either cpu26 or cpu54 offline - I'm guessing that I
>>> don't
>>> really need to find a new home for vector 94 - because the other one of
>>> that
>>> pair already has that set up.  But your check_vectors code doesn't look
>>> like
>>> it accounts for that - if we take cpu26 offline - it would see that
>>> cpu54 doesn't
>>> have 94 free - but doesn't check that it is for the same interrupt.
>>>
>>> But I may be mixing "vectors" and "irqs" here.
>>
>> Yep.  The question really is this: is the irq mapped to a single vector or
>> multiple vectors. (I think)
>>
> 
> The vector number for an irq is programmed in the LSB of the IOAPIC
> IRTE (or MSI data register in the case of MSI/MSIx). So there can be
> only one vector number (although multiple CPUs can be specified
> through DM). An MSI-capable device can dynamically change the lower
> few bits in the LSB to signal multiple interrupts with a contiguous
> range of vectors in powers of 2,but each of these vectors is treated
> as a separate IRQ. i.e. each of them has a separate irq desc, or a
> separate line in the /proc/interrupt file. This patch shows the MSI
> irq allocation in detail:
> http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=51906e779f2b13b38f8153774c4c7163d412ffd9
> 

Yep, that's where I got the idea for the affinity check in my previous post.  So
far linux.git top-of-tree + my patch looks good.  I'm pulling down
linux-next.git right now ...

P.

> Thanks
> Rui
> 
> 
>> P.
>>
>>>
>>> -Tony
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2013-12-20  9:41       ` rui wang
  2013-12-20 10:49         ` Prarit Bhargava
@ 2013-12-28 17:10         ` Prarit Bhargava
  2013-12-30  7:44           ` Chen, Gong
  2013-12-30 12:56           ` rui wang
  1 sibling, 2 replies; 18+ messages in thread
From: Prarit Bhargava @ 2013-12-28 17:10 UTC (permalink / raw)
  To: rui wang
  Cc: Tony Luck, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86-ML, Michel Lespinasse,
	Andi Kleen, Seiji Aguchi, Yang Zhang, Paul Gortmaker,
	janet.morgan, Yu, Fenghua, chen gong



On 12/20/2013 04:41 AM, rui wang wrote:
> On 12/20/13, Prarit Bhargava <prarit@redhat.com> wrote:
>>
>>
>> On 12/19/2013 01:05 PM, Tony Luck wrote:
>>> On Wed, Dec 18, 2013 at 11:50 AM, Tony Luck <tony.luck@intel.com> wrote:
>>>> Looks good to me.
>>>
>>> Though now I've been confused by an offline question about affinity.
>>
>> Heh :)  I'm pursuing it now.  Rui has asked a pretty good question that I
>> don't
>> know the answer to off the top of my head.  I'm still looking at the code.
>>
>>>
>>> Suppose we have some interrupt that has affinity to multiple cpus. E.g.
>>> (real example from one of my machines):
>>>
>>> # cat /proc/irq/94/smp_affinity_list
>>> 26,54
>>>
>>> Now If I want to take either cpu26 or cpu54 offline - I'm guessing that I
>>> don't
>>> really need to find a new home for vector 94 - because the other one of
>>> that
>>> pair already has that set up.  But your check_vectors code doesn't look
>>> like
>>> it accounts for that - if we take cpu26 offline - it would see that
>>> cpu54 doesn't
>>> have 94 free - but doesn't check that it is for the same interrupt.
>>>
>>> But I may be mixing "vectors" and "irqs" here.
>>
>> Yep.  The question really is this: is the irq mapped to a single vector or
>> multiple vectors. (I think)
>>
> 
> The vector number for an irq is programmed in the LSB of the IOAPIC
> IRTE (or MSI data register in the case of MSI/MSIx). So there can be
> only one vector number (although multiple CPUs can be specified
> through DM). An MSI-capable device can dynamically change the lower
> few bits in the LSB to signal multiple interrupts with a contiguous
> range of vectors in powers of 2,but each of these vectors is treated
> as a separate IRQ. i.e. each of them has a separate irq desc, or a
> separate line in the /proc/interrupt file. This patch shows the MSI
> irq allocation in detail:
> http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=51906e779f2b13b38f8153774c4c7163d412ffd9
> 
> Thanks
> Rui
> 

Gong and Rui,

After looking at this in detail I realized I made a mistake in my patch by
including the check for the smp_affinity.  Simply put, it shouldn't be there
given Rui's explanation above.

So I think the patch simply needs to do:

        this_count = 0;
        for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
                irq = __this_cpu_read(vector_irq[vector]);
                if (irq >= 0) {
                        desc = irq_to_desc(irq);
                        data = irq_desc_get_irq_data(desc);
                        affinity = data->affinity;
                        if (irq_has_action(irq) && !irqd_is_per_cpu(data))
                                this_count++;
                }
        }

Can the two of you confirm the above is correct?  It would be greatly appreciated.

Tony, I apologize -- your comments made me think you were stating a fact and not
asking a question on the behavior of affinity.  I completely misunderstood what
you were suggesting.  I thought you were implying that that the affinity "tied"
IRQ behavior together; it does not.  It is simply a suggestion of what IRQs
should be assigned to a particular CPU.  There is an expectation that the system
will attempt to honour the affinity, however, it is not like each CPU is
assigned a separate IRQ.

P.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2013-12-28 17:10         ` Prarit Bhargava
@ 2013-12-30  7:44           ` Chen, Gong
  2013-12-30 15:09             ` Prarit Bhargava
  2013-12-30 12:56           ` rui wang
  1 sibling, 1 reply; 18+ messages in thread
From: Chen, Gong @ 2013-12-30  7:44 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: rui wang, Tony Luck, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86-ML, Michel Lespinasse,
	Andi Kleen, Seiji Aguchi, Yang Zhang, Paul Gortmaker,
	janet.morgan, Yu, Fenghua

[-- Attachment #1: Type: text/plain, Size: 2002 bytes --]

On Sat, Dec 28, 2013 at 12:10:38PM -0500, Prarit Bhargava wrote:
> Gong and Rui,
> 
> After looking at this in detail I realized I made a mistake in my patch by
> including the check for the smp_affinity.  Simply put, it shouldn't be there
> given Rui's explanation above.
> 
> So I think the patch simply needs to do:
> 
>         this_count = 0;
>         for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
>                 irq = __this_cpu_read(vector_irq[vector]);
>                 if (irq >= 0) {
>                         desc = irq_to_desc(irq);
>                         data = irq_desc_get_irq_data(desc);
>                         affinity = data->affinity;
>                         if (irq_has_action(irq) && !irqd_is_per_cpu(data))
>                                 this_count++;
>                 }
>         }
> 
> Can the two of you confirm the above is correct?  It would be greatly appreciated.
> 

No, I don't think it is correct. We still need to consider smp_affinity.

fixup_irqs
        irq_set_affinity(native_ioapic_set_affinity)
                __ioapic_set_affinity
                        assign_irq_vector
                                __assign_irq_vector
                        cpu_mask_to_apicid_and
                /* now begin to set ioapic RET */

__assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
{
...
        apic->vector_allocation_domain(cpu, tmp_mask, mask);
...
        for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
                per_cpu(vector_irq, new_cpu)[vector] = irq;
        cfg->vector = vector;
        cpumask_copy(cfg->domain, tmp_mask);
...
}

On same vecotr on all related vector_irq, irq is set. So such kind of
irq should happen in multiple vector_irq. In cpu_mask_to_apicid_and(e.g.
x2apic_cpu_mask_to_apicid_and for cluster mode), apic is updated
depending on new mask. That's why I think this kind of interrupt
should be bypassed.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2013-12-28 17:10         ` Prarit Bhargava
  2013-12-30  7:44           ` Chen, Gong
@ 2013-12-30 12:56           ` rui wang
  2013-12-30 15:08             ` Prarit Bhargava
  2013-12-30 17:22             ` Prarit Bhargava
  1 sibling, 2 replies; 18+ messages in thread
From: rui wang @ 2013-12-30 12:56 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: Tony Luck, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86-ML, Michel Lespinasse,
	Andi Kleen, Seiji Aguchi, Yang Zhang, Paul Gortmaker,
	janet.morgan, Yu, Fenghua, chen gong

On 12/29/13, Prarit Bhargava <prarit@redhat.com> wrote:
>
>
> On 12/20/2013 04:41 AM, rui wang wrote:
<<snip>>
>> The vector number for an irq is programmed in the LSB of the IOAPIC
>> IRTE (or MSI data register in the case of MSI/MSIx). So there can be
>> only one vector number (although multiple CPUs can be specified
>> through DM). An MSI-capable device can dynamically change the lower
>> few bits in the LSB to signal multiple interrupts with a contiguous
>> range of vectors in powers of 2,but each of these vectors is treated
>> as a separate IRQ. i.e. each of them has a separate irq desc, or a
>> separate line in the /proc/interrupt file. This patch shows the MSI
>> irq allocation in detail:
>> http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=51906e779f2b13b38f8153774c4c7163d412ffd9
>>
>> Thanks
>> Rui
>>
>
> Gong and Rui,
>
> After looking at this in detail I realized I made a mistake in my patch by
> including the check for the smp_affinity.  Simply put, it shouldn't be
> there
> given Rui's explanation above.
>
> So I think the patch simply needs to do:
>
>         this_count = 0;
>         for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++)
> {
>                 irq = __this_cpu_read(vector_irq[vector]);
>                 if (irq >= 0) {
>                         desc = irq_to_desc(irq);
>                         data = irq_desc_get_irq_data(desc);
>                         affinity = data->affinity;
>                         if (irq_has_action(irq) && !irqd_is_per_cpu(data))
>                                 this_count++;
>                 }
>         }
>
> Can the two of you confirm the above is correct?  It would be greatly
> appreciated.

An irq can be mapped to only one vector number, but can have multiple
destination CPUs. i.e. the same irq/vector can appear on multiple
CPUs' vector_irq[]. So checking data->affinity is necessary I think.
But notice that data->affinity is updated in chip->irq_set_affinity()
inside fixup_irqs(), while cpu_online_mask is updated in
remove_cpu_from_maps() inside cpu_disable_common(). They are updated
in different places. So the algorithm to check them against each other
should be different, depending on where you put the check_vectors().
That's my understanding.

Thanks
Rui

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2013-12-30 12:56           ` rui wang
@ 2013-12-30 15:08             ` Prarit Bhargava
  2013-12-31  2:58               ` rui wang
  2013-12-30 17:22             ` Prarit Bhargava
  1 sibling, 1 reply; 18+ messages in thread
From: Prarit Bhargava @ 2013-12-30 15:08 UTC (permalink / raw)
  To: rui wang
  Cc: Tony Luck, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86-ML, Michel Lespinasse,
	Andi Kleen, Seiji Aguchi, Yang Zhang, Paul Gortmaker,
	janet.morgan, Yu, Fenghua, chen gong

On 12/30/2013 07:56 AM, rui wang wrote:

> An irq can be mapped to only one vector number, but can have multiple
> destination CPUs. i.e. the same irq/vector can appear on multiple
> CPUs' vector_irq[]. So checking data->affinity is necessary I think.

That's true Rui -- but here's what I think the scenario actually is.

Suppose we have a 4-cpu system, and we have an IRQ that is mapped to multiple
cpu's vector_irq[].  For example, we have IRQ 200 that is mapped to CPU 2
vector_irq[50], and CPU 3 vector_irq[60].

Now I 'echo 0 > /sys/devices/system/cpu/cpu3/online'.

cpu_disable is called and the kernel migrates IRQs off to other cpus.
Regardless if IRQ 200 is already mapped to CPU2 vector_irq[50], the mapping for
CPU 3 vector_irq[60] *must be migrated* to another CPU.  It has a valid irq
handler and the IRQ is active.  It doesn't just disappear because the CPU went down.

ie) AFAICT we should not differentiate between a multiple mapped IRQ and a
singly mapped IRQ when traversing the vector_irq[] for CPU 3.

I'm probably being dense on this but I'm not seeing a problem with migrating the
IRQ.

> But notice that data->affinity is updated in chip->irq_set_affinity()
> inside fixup_irqs(), while cpu_online_mask is updated in
> remove_cpu_from_maps() inside cpu_disable_common(). 

It shouldn't matter that the maps are updated in different areas during the
execution as we're in stop_machine().

They are updated
> in different places. So the algorithm to check them against each other
> should be different, depending on where you put the check_vectors().
> That's my understanding.
> 

P.

> Thanks
> Rui

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2013-12-30  7:44           ` Chen, Gong
@ 2013-12-30 15:09             ` Prarit Bhargava
  0 siblings, 0 replies; 18+ messages in thread
From: Prarit Bhargava @ 2013-12-30 15:09 UTC (permalink / raw)
  To: rui wang, Tony Luck, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86-ML, Michel Lespinasse,
	Andi Kleen, Seiji Aguchi, Yang Zhang, Paul Gortmaker,
	janet.morgan, Yu, Fenghua



On 12/30/2013 02:44 AM, Chen, Gong wrote:
> On Sat, Dec 28, 2013 at 12:10:38PM -0500, Prarit Bhargava wrote:
>> Gong and Rui,
>>
>> After looking at this in detail I realized I made a mistake in my patch by
>> including the check for the smp_affinity.  Simply put, it shouldn't be there
>> given Rui's explanation above.
>>
>> So I think the patch simply needs to do:
>>
>>         this_count = 0;
>>         for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
>>                 irq = __this_cpu_read(vector_irq[vector]);
>>                 if (irq >= 0) {
>>                         desc = irq_to_desc(irq);
>>                         data = irq_desc_get_irq_data(desc);
>>                         affinity = data->affinity;
>>                         if (irq_has_action(irq) && !irqd_is_per_cpu(data))
>>                                 this_count++;
>>                 }
>>         }
>>
>> Can the two of you confirm the above is correct?  It would be greatly appreciated.
>>
> 
> No, I don't think it is correct. We still need to consider smp_affinity.
> 
> fixup_irqs
>         irq_set_affinity(native_ioapic_set_affinity)
>                 __ioapic_set_affinity
>                         assign_irq_vector
>                                 __assign_irq_vector
>                         cpu_mask_to_apicid_and
>                 /* now begin to set ioapic RET */
> 
> __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)
> {
> ...
>         apic->vector_allocation_domain(cpu, tmp_mask, mask);
> ...
>         for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
>                 per_cpu(vector_irq, new_cpu)[vector] = irq;
>         cfg->vector = vector;
>         cpumask_copy(cfg->domain, tmp_mask);
> ...
> }
> 
> On same vecotr on all related vector_irq, irq is set. So such kind of
> irq should happen in multiple vector_irq. In cpu_mask_to_apicid_and(e.g.
> x2apic_cpu_mask_to_apicid_and for cluster mode), apic is updated
> depending on new mask. That's why I think this kind of interrupt
> should be bypassed.

Hmm ... okay.  I'll take a closer look at this.

Thanks for the additional information.

P.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2013-12-30 12:56           ` rui wang
  2013-12-30 15:08             ` Prarit Bhargava
@ 2013-12-30 17:22             ` Prarit Bhargava
  1 sibling, 0 replies; 18+ messages in thread
From: Prarit Bhargava @ 2013-12-30 17:22 UTC (permalink / raw)
  To: rui wang
  Cc: Tony Luck, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86-ML, Michel Lespinasse,
	Andi Kleen, Seiji Aguchi, Yang Zhang, Paul Gortmaker,
	janet.morgan, Yu, Fenghua, chen gong



On 12/30/2013 07:56 AM, rui wang wrote:
> On 12/29/13, Prarit Bhargava <prarit@redhat.com> wrote:
>>
>>
>> On 12/20/2013 04:41 AM, rui wang wrote:
> <<snip>>
>>> The vector number for an irq is programmed in the LSB of the IOAPIC
>>> IRTE (or MSI data register in the case of MSI/MSIx). So there can be
>>> only one vector number (although multiple CPUs can be specified
>>> through DM). An MSI-capable device can dynamically change the lower
>>> few bits in the LSB to signal multiple interrupts with a contiguous
>>> range of vectors in powers of 2,but each of these vectors is treated
>>> as a separate IRQ. i.e. each of them has a separate irq desc, or a
>>> separate line in the /proc/interrupt file. This patch shows the MSI
>>> irq allocation in detail:
>>> http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=51906e779f2b13b38f8153774c4c7163d412ffd9
>>>
>>> Thanks
>>> Rui
>>>
>>
>> Gong and Rui,
>>
>> After looking at this in detail I realized I made a mistake in my patch by
>> including the check for the smp_affinity.  Simply put, it shouldn't be
>> there
>> given Rui's explanation above.
>>
>> So I think the patch simply needs to do:
>>
>>         this_count = 0;
>>         for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++)
>> {
>>                 irq = __this_cpu_read(vector_irq[vector]);
>>                 if (irq >= 0) {
>>                         desc = irq_to_desc(irq);
>>                         data = irq_desc_get_irq_data(desc);
>>                         affinity = data->affinity;
>>                         if (irq_has_action(irq) && !irqd_is_per_cpu(data))
>>                                 this_count++;
>>                 }
>>         }
>>
>> Can the two of you confirm the above is correct?  It would be greatly
>> appreciated.
> 
> An irq can be mapped to only one vector number, but can have multiple
> destination CPUs. i.e. the same irq/vector can appear on multiple
> CPUs' vector_irq[]. So checking data->affinity is necessary I think.
> But notice that data->affinity is updated in chip->irq_set_affinity()
> inside fixup_irqs(), while cpu_online_mask is updated in
> remove_cpu_from_maps() inside cpu_disable_common(). They are updated
> in different places. So the algorithm to check them against each other
> should be different, depending on where you put the check_vectors().
> That's my understanding.


Okay, so the big issue is that we need to do the calculation without this cpu,
so I think this works (sorry for the cut-and-paste)

int check_irq_vectors_for_cpu_disable(void)
{
        int irq, cpu;
        unsigned int vector, this_count, count;
        struct irq_desc *desc;
        struct irq_data *data;
        struct cpumask online_new; /* cpu_online_mask - this_cpu */
        struct cpumask affinity_new; /* affinity - this_cpu */

        cpumask_copy(&online_new, cpu_online_mask);
        cpu_clear(smp_processor_id(), online_new);

        this_count = 0;
        for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
                irq = __this_cpu_read(vector_irq[vector]);
                if (irq >= 0) {
                        desc = irq_to_desc(irq);
                        data = irq_desc_get_irq_data(desc);
                        cpumask_copy(&affinity_new, data->affinity);
                        cpu_clear(smp_processor_id(), affinity_new);
                        if (irq_has_action(irq) && !irqd_is_per_cpu(data) &&
                            !cpumask_subset(&affinity_new, &online_new) &&
                            !cpumask_empty(&affinity_new))
                                this_count++;
                }
        }

...

If I go back to the various examples this appears to work.  For example, your
previous case was all cpus are online, CPU 1 goes down and we have an IRQ with
affinity for CPU (1,2).  We skip this IRQ which is correct.

And if we have another IRQ with affinity of only CPU 1 we will not skip this
IRQ, which is also correct.

I've tried other examples and they appear to work AFAICT.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2013-12-30 15:08             ` Prarit Bhargava
@ 2013-12-31  2:58               ` rui wang
  2013-12-31 21:22                 ` Prarit Bhargava
  0 siblings, 1 reply; 18+ messages in thread
From: rui wang @ 2013-12-31  2:58 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: Tony Luck, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86-ML, Michel Lespinasse,
	Andi Kleen, Seiji Aguchi, Yang Zhang, Paul Gortmaker,
	janet.morgan, Yu, Fenghua, chen gong

On 12/30/13, Prarit Bhargava <prarit@redhat.com> wrote:
>
>
> On 12/30/2013 07:56 AM, rui wang wrote:
>
>> An irq can be mapped to only one vector number, but can have multiple
>> destination CPUs. i.e. the same irq/vector can appear on multiple
>> CPUs' vector_irq[]. So checking data->affinity is necessary I think.
>
> That's true Rui -- but here's what I think the scenario actually is.
>
> Suppose we have a 4-cpu system, and we have an IRQ that is mapped to
> multiple
> cpu's vector_irq[].  For example, we have IRQ 200 that is mapped to CPU 2
> vector_irq[50], and CPU 3 vector_irq[60].
>

This should not happen. There's only one LSB to fill the vector number
in IRTE. So either 50 or 60 but not both for an irq.

...
> Okay, so the big issue is that we need to do the calculation without this cpu,

Yes if it's done before calling cpu_disable_comm().

> so I think this works (sorry for the cut-and-paste)
>
> int check_irq_vectors_for_cpu_disable(void)
> {
>        int irq, cpu;
>        unsigned int vector, this_count, count;
>        struct irq_desc *desc;
>        struct irq_data *data;
>        struct cpumask online_new; /* cpu_online_mask - this_cpu */
>        struct cpumask affinity_new; /* affinity - this_cpu */
>
>        cpumask_copy(&online_new, cpu_online_mask);
>        cpu_clear(smp_processor_id(), online_new);
>
>        this_count = 0;
>        for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
>                irq = __this_cpu_read(vector_irq[vector]);
>                if (irq >= 0) {
>                        desc = irq_to_desc(irq);
>                        data = irq_desc_get_irq_data(desc);
>                        cpumask_copy(&affinity_new, data->affinity);
>                        cpu_clear(smp_processor_id(), affinity_new);
>                        if (irq_has_action(irq) && !irqd_is_per_cpu(data) &&
>                            !cpumask_subset(&affinity_new, &online_new) &&
>                            !cpumask_empty(&affinity_new))

If this cpu is the only target, then affinity_new becomes empty.
Should we count it for migration?

Thanks
Rui

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2013-12-31  2:58               ` rui wang
@ 2013-12-31 21:22                 ` Prarit Bhargava
  2014-01-02  2:41                   ` Chen, Gong
  0 siblings, 1 reply; 18+ messages in thread
From: Prarit Bhargava @ 2013-12-31 21:22 UTC (permalink / raw)
  To: rui wang
  Cc: Tony Luck, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86-ML, Michel Lespinasse,
	Andi Kleen, Seiji Aguchi, Yang Zhang, Paul Gortmaker,
	janet.morgan, Yu, Fenghua, chen gong



On 12/30/2013 09:58 PM, rui wang wrote:
> On 12/30/13, Prarit Bhargava <prarit@redhat.com> wrote:
>>
>>
>> On 12/30/2013 07:56 AM, rui wang wrote:
>>
> ...
>> Okay, so the big issue is that we need to do the calculation without this cpu,
> 
>>
>> int check_irq_vectors_for_cpu_disable(void)
>> {
>>        int irq, cpu;
>>        unsigned int vector, this_count, count;
>>        struct irq_desc *desc;
>>        struct irq_data *data;
>>        struct cpumask online_new; /* cpu_online_mask - this_cpu */
>>        struct cpumask affinity_new; /* affinity - this_cpu */
>>
>>        cpumask_copy(&online_new, cpu_online_mask);
>>        cpu_clear(smp_processor_id(), online_new);
>>
>>        this_count = 0;
>>        for (vector = FIRST_EXTERNAL_VECTOR; vector < NR_VECTORS; vector++) {
>>                irq = __this_cpu_read(vector_irq[vector]);
>>                if (irq >= 0) {
>>                        desc = irq_to_desc(irq);
>>                        data = irq_desc_get_irq_data(desc);
>>                        cpumask_copy(&affinity_new, data->affinity);
>>                        cpu_clear(smp_processor_id(), affinity_new);
>>                        if (irq_has_action(irq) && !irqd_is_per_cpu(data) &&
>>                            !cpumask_subset(&affinity_new, &online_new) &&
>>                            !cpumask_empty(&affinity_new))
> 
> If this cpu is the only target, then affinity_new becomes empty.
> Should we count it for migration?

Okay, how about,
                        if (irq_has_action(irq) && !irqd_is_per_cpu(data) &&
                            ((!cpumask_empty(&affinity_new)) &&
                              !cpumask_subset(&affinity_new, &online_new)) ||
                             cpumask_empty(&affinity_new))
                                this_count++;

I tried this with the following examples and AFAICT I get the correct result:

1) affinity mask = online mask = 0xf.  CPU 3 (1000b) is down'd.

this_count is not incremented.

2) affinity mask is a non-zero subset of the online mask (which IMO is
the "typical" case).  For example, affinity_mask = 0x9, online mask = 0xf.  CPU
3 is again down'd.

this_count is not incremented.

3) affinity_mask = 0x1, online mask = 0x3. (this is your example).  CPU
1 is going down.

this_count is incremented, as the resulting affinity mask will be 0.

4) affinity_mask = 0x0, online mask = 0x7.  CPU 1 is going down.

this_count is incremented, as the affinity mask is 0.

P.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2013-12-31 21:22                 ` Prarit Bhargava
@ 2014-01-02  2:41                   ` Chen, Gong
  2014-01-02 12:57                     ` Prarit Bhargava
  0 siblings, 1 reply; 18+ messages in thread
From: Chen, Gong @ 2014-01-02  2:41 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: rui wang, Tony Luck, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86-ML, Michel Lespinasse,
	Andi Kleen, Seiji Aguchi, Yang Zhang, Paul Gortmaker,
	janet.morgan, Yu, Fenghua

[-- Attachment #1: Type: text/plain, Size: 2372 bytes --]

On Tue, Dec 31, 2013 at 04:22:09PM -0500, Prarit Bhargava wrote:
> Okay, how about,
>                         if (irq_has_action(irq) && !irqd_is_per_cpu(data) &&
>                             ((!cpumask_empty(&affinity_new)) &&
>                               !cpumask_subset(&affinity_new, &online_new)) ||
>                              cpumask_empty(&affinity_new))
>                                 this_count++;
> 
I think it is good but a little bit complicated. How about this:

        if (irq_has_action(irq) && !irqd_is_per_cpu(data) &&
            /* add some commments to emphysize the importance of turn */
            (cpumask_empty(&affinity_new) ||
            !cpumask_subset(&affinity_new, &online_new)))

> I tried this with the following examples and AFAICT I get the correct result:
> 
> 1) affinity mask = online mask = 0xf.  CPU 3 (1000b) is down'd.
> 
> this_count is not incremented.
> 
> 2) affinity mask is a non-zero subset of the online mask (which IMO is
> the "typical" case).  For example, affinity_mask = 0x9, online mask = 0xf.  CPU
> 3 is again down'd.
> 
> this_count is not incremented.
> 
> 3) affinity_mask = 0x1, online mask = 0x3. (this is your example).  CPU
> 1 is going down.
> 
> this_count is incremented, as the resulting affinity mask will be 0.
> 
> 4) affinity_mask = 0x0, online mask = 0x7.  CPU 1 is going down.
> 
> this_count is incremented, as the affinity mask is 0.
> 
The 4th scenario is very tricky. If you try to set affinity from user space,
it will return failure because before kernel tried to change the affinity it
will verify it:
int __ioapic_set_affinity(...)
{
...
        if (!cpumask_intersects(mask, cpu_online_mask))
                return -EINVAL;
...
}

So from this point of view, affinity can't be 0. But your patch is very
special because you change it by hand:
        cpu_clear(smp_processor_id(), affinity_new);

so it is reasonable. It makes me thinking a little bit more. In fixup_irqs
we have similar logic but we don't protect it. Maybe it is because currently
the scenario 4 can't happen because we stop it in advance. But who knows
if one day we use it in other situation we will hit this subtle issue
probably.

So, Prarit, I suggest you writing another patch to fix this potential issue
for fixup_irqs. How would you think?

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2014-01-02  2:41                   ` Chen, Gong
@ 2014-01-02 12:57                     ` Prarit Bhargava
  2014-01-02 16:04                       ` Prarit Bhargava
  0 siblings, 1 reply; 18+ messages in thread
From: Prarit Bhargava @ 2014-01-02 12:57 UTC (permalink / raw)
  To: rui wang, Tony Luck, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86-ML, Michel Lespinasse,
	Andi Kleen, Seiji Aguchi, Yang Zhang, Paul Gortmaker,
	janet.morgan, Yu, Fenghua, chen gong



On 01/01/2014 09:41 PM, Chen, Gong wrote:
> On Tue, Dec 31, 2013 at 04:22:09PM -0500, Prarit Bhargava wrote:
>> Okay, how about,
>>                         if (irq_has_action(irq) && !irqd_is_per_cpu(data) &&
>>                             ((!cpumask_empty(&affinity_new)) &&
>>                               !cpumask_subset(&affinity_new, &online_new)) ||
>>                              cpumask_empty(&affinity_new))
>>                                 this_count++;
>>
> I think it is good but a little bit complicated. How about this:
> 
>         if (irq_has_action(irq) && !irqd_is_per_cpu(data) &&
>             /* add some commments to emphysize the importance of turn */
>             (cpumask_empty(&affinity_new) ||
>             !cpumask_subset(&affinity_new, &online_new)))

Yeah :)  I thought of that after I sent it. :)

> 
>> I tried this with the following examples and AFAICT I get the correct result:
>>
>> 1) affinity mask = online mask = 0xf.  CPU 3 (1000b) is down'd.
>>
>> this_count is not incremented.
>>
>> 2) affinity mask is a non-zero subset of the online mask (which IMO is
>> the "typical" case).  For example, affinity_mask = 0x9, online mask = 0xf.  CPU
>> 3 is again down'd.
>>
>> this_count is not incremented.
>>
>> 3) affinity_mask = 0x1, online mask = 0x3. (this is your example).  CPU
>> 1 is going down.
>>
>> this_count is incremented, as the resulting affinity mask will be 0.
>>
>> 4) affinity_mask = 0x0, online mask = 0x7.  CPU 1 is going down.
>>
>> this_count is incremented, as the affinity mask is 0.
>>
> The 4th scenario is very tricky. If you try to set affinity from user space,
> it will return failure because before kernel tried to change the affinity it
> will verify it:
> int __ioapic_set_affinity(...)
> {
> ...
>         if (!cpumask_intersects(mask, cpu_online_mask))
>                 return -EINVAL;
> ...
> }
> 
> So from this point of view, affinity can't be 0. But your patch is very
> special because you change it by hand:
>         cpu_clear(smp_processor_id(), affinity_new);
> 
> so it is reasonable. It makes me thinking a little bit more. In fixup_irqs
> we have similar logic but we don't protect it. Maybe it is because currently
> the scenario 4 can't happen because we stop it in advance. But who knows
> if one day we use it in other situation we will hit this subtle issue
> probably.
> 
> So, Prarit, I suggest you writing another patch to fix this potential issue
> for fixup_irqs. How would you think?

As you know Rui, I've been staring at that code wondering if it needs a fix.
I'd like to hear Gong Chen's thoughts about it too...

P.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] x86: Add check for number of available vectors before CPU down [v2]
  2014-01-02 12:57                     ` Prarit Bhargava
@ 2014-01-02 16:04                       ` Prarit Bhargava
  0 siblings, 0 replies; 18+ messages in thread
From: Prarit Bhargava @ 2014-01-02 16:04 UTC (permalink / raw)
  To: rui wang, Tony Luck, Linux Kernel Mailing List, Thomas Gleixner,
	Ingo Molnar, H. Peter Anvin, X86-ML, Michel Lespinasse,
	Andi Kleen, Seiji Aguchi, Yang Zhang, Paul Gortmaker,
	janet.morgan, Yu, Fenghua, chen gong



On 01/02/2014 07:57 AM, Prarit Bhargava wrote:
> 
> 
> On 01/01/2014 09:41 PM, Chen, Gong wrote:
>> On Tue, Dec 31, 2013 at 04:22:09PM -0500, Prarit Bhargava wrote:
>>> Okay, how about,
>>>                         if (irq_has_action(irq) && !irqd_is_per_cpu(data) &&
>>>                             ((!cpumask_empty(&affinity_new)) &&
>>>                               !cpumask_subset(&affinity_new, &online_new)) ||
>>>                              cpumask_empty(&affinity_new))
>>>                                 this_count++;
>>>
>> I think it is good but a little bit complicated. How about this:
>>
>>         if (irq_has_action(irq) && !irqd_is_per_cpu(data) &&
>>             /* add some commments to emphysize the importance of turn */
>>             (cpumask_empty(&affinity_new) ||
>>             !cpumask_subset(&affinity_new, &online_new)))
> 
> Yeah :)  I thought of that after I sent it. :)
> 
>>
>>> I tried this with the following examples and AFAICT I get the correct result:
>>>
>>> 1) affinity mask = online mask = 0xf.  CPU 3 (1000b) is down'd.
>>>
>>> this_count is not incremented.
>>>
>>> 2) affinity mask is a non-zero subset of the online mask (which IMO is
>>> the "typical" case).  For example, affinity_mask = 0x9, online mask = 0xf.  CPU
>>> 3 is again down'd.
>>>
>>> this_count is not incremented.
>>>
>>> 3) affinity_mask = 0x1, online mask = 0x3. (this is your example).  CPU
>>> 1 is going down.
>>>
>>> this_count is incremented, as the resulting affinity mask will be 0.
>>>
>>> 4) affinity_mask = 0x0, online mask = 0x7.  CPU 1 is going down.
>>>
>>> this_count is incremented, as the affinity mask is 0.
>>>
>> The 4th scenario is very tricky. If you try to set affinity from user space,
>> it will return failure because before kernel tried to change the affinity it
>> will verify it:
>> int __ioapic_set_affinity(...)
>> {
>> ...
>>         if (!cpumask_intersects(mask, cpu_online_mask))
>>                 return -EINVAL;
>> ...
>> }
>>
>> So from this point of view, affinity can't be 0. But your patch is very
>> special because you change it by hand:
>>         cpu_clear(smp_processor_id(), affinity_new);
>>
>> so it is reasonable. It makes me thinking a little bit more. In fixup_irqs
>> we have similar logic but we don't protect it. Maybe it is because currently
>> the scenario 4 can't happen because we stop it in advance. But who knows
>> if one day we use it in other situation we will hit this subtle issue
>> probably.
>>
>> So, Prarit, I suggest you writing another patch to fix this potential issue
>> for fixup_irqs. How would you think?
> 
> As you know Rui, I've been staring at that code wondering if it needs a fix.
> I'd like to hear Gong Chen's thoughts about it too...
> 

Oops -- got my sender header mixed up somehow.  Sorry 'bout that Gong :(

Gong, I think I'd like to go with the above patch for now and then take a closer
look at fixing the fixup_irqs() in another patch.  The reason is that I think
the above patch is critical for getting cpu hotplug working with large systems
with many devices.

I'm going to do more testing and will resubmit the patch later in the day.

P.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-01-02 16:04 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-18 19:29 [PATCH] x86: Add check for number of available vectors before CPU down [v2] Prarit Bhargava
2013-12-18 19:50 ` Tony Luck
2013-12-19 18:05   ` Tony Luck
2013-12-19 18:11     ` Prarit Bhargava
2013-12-20  7:18       ` Chen, Gong
2013-12-20  9:41       ` rui wang
2013-12-20 10:49         ` Prarit Bhargava
2013-12-28 17:10         ` Prarit Bhargava
2013-12-30  7:44           ` Chen, Gong
2013-12-30 15:09             ` Prarit Bhargava
2013-12-30 12:56           ` rui wang
2013-12-30 15:08             ` Prarit Bhargava
2013-12-31  2:58               ` rui wang
2013-12-31 21:22                 ` Prarit Bhargava
2014-01-02  2:41                   ` Chen, Gong
2014-01-02 12:57                     ` Prarit Bhargava
2014-01-02 16:04                       ` Prarit Bhargava
2013-12-30 17:22             ` Prarit Bhargava

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox