[PATCH] stack and rcu interaction bug in smp_call_function

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] stack and rcu interaction bug in smp_call_function_mask()
@ 2008-08-08 19:37 Venki Pallipadi
  2008-08-09  0:14 ` Andi Kleen
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Venki Pallipadi @ 2008-08-08 19:37 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ingo Molnar, npiggin, linux-kernel, suresh.b.siddha



Found a OOPS on a big SMP box during an overnight reboot test with upstream git.

Suresh and I looked at the oops and looks like the root cause is in
generic_smp_call_function_interrupt() and smp_call_function_mask() with
wait parameter.

The actual oops looked like

[   11.277260] BUG: unable to handle kernel paging request at ffff8802ffffffff
[   11.277815] IP: [<ffff8802ffffffff>] 0xffff8802ffffffff
[   11.278155] PGD 202063 PUD 0
[   11.278576] Oops: 0010 [1] SMP
[   11.279006] CPU 5
[   11.279336] Modules linked in:
[   11.279752] Pid: 0, comm: swapper Not tainted 2.6.27-rc2-00020-g685d87f #290
[   11.280039] RIP: 0010:[<ffff8802ffffffff>]  [<ffff8802ffffffff>] 0xffff8802ffffffff
[   11.280692] RSP: 0018:ffff88027f1f7f70  EFLAGS: 00010086
[   11.280976] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 0000000000000000
[   11.281264] RDX: 0000000000004f4e RSI: 0000000000000001 RDI: 0000000000000000
[   11.281624] RBP: ffff88027f1f7f98 R08: 0000000000000001 R09: ffffffff802509af
[   11.281925] R10: ffff8800280c2780 R11: 0000000000000000 R12: ffff88027f097d48
[   11.282214] R13: ffff88027f097d70 R14: 0000000000000005 R15: ffff88027e571000
[   11.282502] FS:  0000000000000000(0000) GS:ffff88027f1c3340(0000) knlGS:0000000000000000
[   11.283096] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[   11.283382] CR2: ffff8802ffffffff CR3: 0000000000201000 CR4: 00000000000006e0
[   11.283760] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   11.284048] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   11.284337] Process swapper (pid: 0, threadinfo ffff88027f1f2000, task ffff88027f1f0640)
[   11.284936] Stack:  ffffffff80250963 0000000000000212 0000000000ee8c78 0000000000ee8a66
[   11.285802]  ffff88027e571550 ffff88027f1f7fa8 ffffffff8021adb5 ffff88027f1f3e40
[   11.286599]  ffffffff8020bdd6 ffff88027f1f3e40 <EOI>  ffff88027f1f3ef8 0000000000000000
[   11.287120] Call Trace:
[   11.287768]  <IRQ>  [<ffffffff80250963>] ? generic_smp_call_function_interrupt+0x61/0x12c
[   11.288354]  [<ffffffff8021adb5>] smp_call_function_interrupt+0x17/0x27
[   11.288744]  [<ffffffff8020bdd6>] call_function_interrupt+0x66/0x70
[   11.289030]  <EOI>  [<ffffffff8024ab3b>] ? clockevents_notify+0x19/0x73
[   11.289380]  [<ffffffff803b9b75>] ? acpi_idle_enter_simple+0x18b/0x1fa
[   11.289760]  [<ffffffff803b9b6b>] ? acpi_idle_enter_simple+0x181/0x1fa
[   11.290051]  [<ffffffff8053aeca>] ? cpuidle_idle_call+0x70/0xa2
[   11.290338]  [<ffffffff80209f61>] ? cpu_idle+0x5f/0x7d
[   11.290723]  [<ffffffff8060224a>] ? start_secondary+0x14d/0x152
[   11.291010]
[   11.291287]
[   11.291654] Code:  Bad RIP value.
[   11.292041] RIP  [<ffff8802ffffffff>] 0xffff8802ffffffff
[   11.292380]  RSP <ffff88027f1f7f70>
[   11.292741] CR2: ffff8802ffffffff
[   11.310951] ---[ end trace 137c54d525305f1c ]---

The problem is with the following sequence of events:

- CPU A calls smp_call_function_mask() for CPU B with wait parameter
- CPU A sets up the call_function_data on the stack and does an rcu add to
  call_function_queue
- CPU A waits until the WAIT flag is cleared
- CPU B gets the call function interrupt and starts going through the
  call_function_queue
- CPU C also gets some other call function interrupt and starts going through
  the call_function_queue
- CPU C, which is also going through the call_function_queue, starts referencing
  CPU A's stack, as that element is still in call_function_queue
- CPU B finishes the function call that CPU A set up and as there are no other
  references to it, rcu deletes the call_function_data (which was from CPU A
  stack)
- CPU B sees the wait flag and just clears the flag (no call_rcu to free)
- CPU A which was waiting on the flag continues executing and the stack
  contents change

- CPU C is still in rcu_read section accessing the CPU A's stack sees
  inconsistent call_funation_data and can try to execute
  function with some random pointer, causing stack corruption for A
  (by clearing the bits in mask field) and oops.


One way to solve the problem is to have CPU A wait as long as there is a rcu
read happening. But, we cannot use synchronize_rcu() here as we cannot block.
Another way around is to always allocate call_function_data, instead
of using CPU A's stack. Below patch does this. But, that will still have to
handle the kmalloc failure case somehow.

Any other ideas on how to solve this problem?

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>

---
 kernel/smp.c |   40 +++++++++++++++++++---------------------
 1 file changed, 19 insertions(+), 21 deletions(-)

Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c	2008-08-08 11:45:55.000000000 -0700
+++ linux-2.6/kernel/smp.c	2008-08-08 12:25:07.000000000 -0700
@@ -128,6 +128,8 @@ void generic_smp_call_function_interrupt
 		list_del_rcu(&data->csd.list);
 		spin_unlock(&call_function_lock);
 
+		if (data->csd.flags & CSD_FLAG_ALLOC)
+			call_rcu(&data->rcu_head, rcu_free_call_data);
 		if (data->csd.flags & CSD_FLAG_WAIT) {
 			/*
 			 * serialize stores to data with the flag clear
@@ -135,8 +137,7 @@ void generic_smp_call_function_interrupt
 			 */
 			smp_wmb();
 			data->csd.flags &= ~CSD_FLAG_WAIT;
-		} else
-			call_rcu(&data->rcu_head, rcu_free_call_data);
+		}
 	}
 	rcu_read_unlock();
 
@@ -181,11 +182,12 @@ void generic_smp_call_function_single_in
 
 			data->func(data->info);
 
+			if (data_flags & CSD_FLAG_ALLOC)
+				kfree(data);
 			if (data_flags & CSD_FLAG_WAIT) {
 				smp_wmb();
 				data->flags &= ~CSD_FLAG_WAIT;
-			} else if (data_flags & CSD_FLAG_ALLOC)
-				kfree(data);
+			}
 		}
 		/*
 		 * See comment on outer loop
@@ -207,7 +209,6 @@ void generic_smp_call_function_single_in
 int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
 			     int wait)
 {
-	struct call_single_data d;
 	unsigned long flags;
 	/* prevent preemption and reschedule on another processor */
 	int me = get_cpu();
@@ -222,13 +223,12 @@ int smp_call_function_single(int cpu, vo
 	} else {
 		struct call_single_data *data = NULL;
 
-		if (!wait) {
-			data = kmalloc(sizeof(*data), GFP_ATOMIC);
-			if (data)
-				data->flags = CSD_FLAG_ALLOC;
-		}
-		if (!data) {
-			data = &d;
+		data = kmalloc(sizeof(*data), GFP_ATOMIC);
+		if (!data)
+			return -1;
+
+		data->flags = CSD_FLAG_ALLOC;
+		if (wait) {
 			data->flags = CSD_FLAG_WAIT;
 		}
 
@@ -280,7 +280,6 @@ void __smp_call_function_single(int cpu,
 int smp_call_function_mask(cpumask_t mask, void (*func)(void *), void *info,
 			   int wait)
 {
-	struct call_function_data d;
 	struct call_function_data *data = NULL;
 	cpumask_t allbutself;
 	unsigned long flags;
@@ -306,15 +305,14 @@ int smp_call_function_mask(cpumask_t mas
 		return smp_call_function_single(cpu, func, info, wait);
 	}
 
-	if (!wait) {
-		data = kmalloc(sizeof(*data), GFP_ATOMIC);
-		if (data)
-			data->csd.flags = CSD_FLAG_ALLOC;
-	}
-	if (!data) {
-		data = &d;
+	data = kmalloc(sizeof(*data), GFP_ATOMIC);
+	if (!data)
+		return -1;
+
+	data->csd.flags = CSD_FLAG_ALLOC;
+
+	if (wait) {
 		data->csd.flags = CSD_FLAG_WAIT;
-		wait = 1;
 	}
 
 	spin_lock_init(&data->lock);


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] stack and rcu interaction bug in smp_call_function_mask()
  2008-08-08 19:37 [PATCH] stack and rcu interaction bug in smp_call_function_mask() Venki Pallipadi
@ 2008-08-09  0:14 ` Andi Kleen
  2008-08-09  2:17 ` Jeremy Fitzhardinge
  2008-08-10  6:24 ` Nick Piggin
  2 siblings, 0 replies; 12+ messages in thread
From: Andi Kleen @ 2008-08-09  0:14 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Jens Axboe, Ingo Molnar, npiggin, linux-kernel, suresh.b.siddha

Venki Pallipadi <venkatesh.pallipadi@intel.com> writes:
>
> One way to solve the problem is to have CPU A wait as long as there is a rcu
> read happening. But, we cannot use synchronize_rcu() here as we cannot block.
> Another way around is to always allocate call_function_data, instead
> of using CPU A's stack. Below patch does this. But, that will still have to
> handle the kmalloc failure case somehow.
>
> Any other ideas on how to solve this problem?

Perhaps it needs a custom RCU cycle? That is possible to define
with some effort.

I would feel uneasy about always allocating. GFP_ATOMIC can fail and I
expect most callers are not prepared to do handle that and in many
cases there is not even a good way. Now that there are patches
floating around even using this for the tlb flush that would be
deadly.

-Andi

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] stack and rcu interaction bug in smp_call_function_mask()
  2008-08-08 19:37 [PATCH] stack and rcu interaction bug in smp_call_function_mask() Venki Pallipadi
  2008-08-09  0:14 ` Andi Kleen
@ 2008-08-09  2:17 ` Jeremy Fitzhardinge
  2008-08-10  6:24 ` Nick Piggin
  2 siblings, 0 replies; 12+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-09  2:17 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Jens Axboe, Ingo Molnar, npiggin, linux-kernel, suresh.b.siddha

Venki Pallipadi wrote:
> Found a OOPS on a big SMP box during an overnight reboot test with upstream git.
>
> Suresh and I looked at the oops and looks like the root cause is in
> generic_smp_call_function_interrupt() and smp_call_function_mask() with
> wait parameter.
>
> The actual oops looked like
>
> [   11.277260] BUG: unable to handle kernel paging request at ffff8802ffffffff
> [   11.277815] IP: [<ffff8802ffffffff>] 0xffff8802ffffffff
> [   11.278155] PGD 202063 PUD 0
> [   11.278576] Oops: 0010 [1] SMP
> [   11.279006] CPU 5
> [   11.279336] Modules linked in:
> [   11.279752] Pid: 0, comm: swapper Not tainted 2.6.27-rc2-00020-g685d87f #290
> [   11.280039] RIP: 0010:[<ffff8802ffffffff>]  [<ffff8802ffffffff>] 0xffff8802ffffffff
> [   11.280692] RSP: 0018:ffff88027f1f7f70  EFLAGS: 00010086
> [   11.280976] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 0000000000000000
> [   11.281264] RDX: 0000000000004f4e RSI: 0000000000000001 RDI: 0000000000000000
> [   11.281624] RBP: ffff88027f1f7f98 R08: 0000000000000001 R09: ffffffff802509af
> [   11.281925] R10: ffff8800280c2780 R11: 0000000000000000 R12: ffff88027f097d48
> [   11.282214] R13: ffff88027f097d70 R14: 0000000000000005 R15: ffff88027e571000
> [   11.282502] FS:  0000000000000000(0000) GS:ffff88027f1c3340(0000) knlGS:0000000000000000
> [   11.283096] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [   11.283382] CR2: ffff8802ffffffff CR3: 0000000000201000 CR4: 00000000000006e0
> [   11.283760] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   11.284048] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [   11.284337] Process swapper (pid: 0, threadinfo ffff88027f1f2000, task ffff88027f1f0640)
> [   11.284936] Stack:  ffffffff80250963 0000000000000212 0000000000ee8c78 0000000000ee8a66
> [   11.285802]  ffff88027e571550 ffff88027f1f7fa8 ffffffff8021adb5 ffff88027f1f3e40
> [   11.286599]  ffffffff8020bdd6 ffff88027f1f3e40 <EOI>  ffff88027f1f3ef8 0000000000000000
> [   11.287120] Call Trace:
> [   11.287768]  <IRQ>  [<ffffffff80250963>] ? generic_smp_call_function_interrupt+0x61/0x12c
> [   11.288354]  [<ffffffff8021adb5>] smp_call_function_interrupt+0x17/0x27
> [   11.288744]  [<ffffffff8020bdd6>] call_function_interrupt+0x66/0x70
> [   11.289030]  <EOI>  [<ffffffff8024ab3b>] ? clockevents_notify+0x19/0x73
> [   11.289380]  [<ffffffff803b9b75>] ? acpi_idle_enter_simple+0x18b/0x1fa
> [   11.289760]  [<ffffffff803b9b6b>] ? acpi_idle_enter_simple+0x181/0x1fa
> [   11.290051]  [<ffffffff8053aeca>] ? cpuidle_idle_call+0x70/0xa2
> [   11.290338]  [<ffffffff80209f61>] ? cpu_idle+0x5f/0x7d
> [   11.290723]  [<ffffffff8060224a>] ? start_secondary+0x14d/0x152
> [   11.291010]
> [   11.291287]
> [   11.291654] Code:  Bad RIP value.
> [   11.292041] RIP  [<ffff8802ffffffff>] 0xffff8802ffffffff
> [   11.292380]  RSP <ffff88027f1f7f70>
> [   11.292741] CR2: ffff8802ffffffff
> [   11.310951] ---[ end trace 137c54d525305f1c ]---
>
> The problem is with the following sequence of events:
>
> - CPU A calls smp_call_function_mask() for CPU B with wait parameter
> - CPU A sets up the call_function_data on the stack and does an rcu add to
>   call_function_queue
> - CPU A waits until the WAIT flag is cleared
> - CPU B gets the call function interrupt and starts going through the
>   call_function_queue
> - CPU C also gets some other call function interrupt and starts going through
>   the call_function_queue
> - CPU C, which is also going through the call_function_queue, starts referencing
>   CPU A's stack, as that element is still in call_function_queue
> - CPU B finishes the function call that CPU A set up and as there are no other
>   references to it, rcu deletes the call_function_data (which was from CPU A
>   stack)
> - CPU B sees the wait flag and just clears the flag (no call_rcu to free)
> - CPU A which was waiting on the flag continues executing and the stack
>   contents change
>
> - CPU C is still in rcu_read section accessing the CPU A's stack sees
>   inconsistent call_funation_data and can try to execute
>   function with some random pointer, causing stack corruption for A
>   (by clearing the bits in mask field) and oops.
>   

Ah! Thank you!  I had seen this crash a couple of times, but I could 
work out how to reproduce it enough to root cause it.   I'd also only 
ever seen it while running under Xen, so I wasn't sure if it was 
Xen-specific or not.

See Ingo, I wasn't imagining it ;)

> One way to solve the problem is to have CPU A wait as long as there is a rcu
> read happening.

OK, looks like I could get that if a vcpu was preempted while the rcu 
read is happening.

>  But, we cannot use synchronize_rcu() here as we cannot block.
> Another way around is to always allocate call_function_data, instead
> of using CPU A's stack. Below patch does this. But, that will still have to
> handle the kmalloc failure case somehow.
>
> Any other ideas on how to solve this problem?
>   

Nothing leaps to mind at the moment.  Always allocating would not be 
good, as we'd like to use this for tlb flushes.

    J

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] stack and rcu interaction bug in smp_call_function_mask()
  2008-08-08 19:37 [PATCH] stack and rcu interaction bug in smp_call_function_mask() Venki Pallipadi
  2008-08-09  0:14 ` Andi Kleen
  2008-08-09  2:17 ` Jeremy Fitzhardinge
@ 2008-08-10  6:24 ` Nick Piggin
  2008-08-11  3:49   ` Nick Piggin
                     ` (2 more replies)
  2 siblings, 3 replies; 12+ messages in thread
From: Nick Piggin @ 2008-08-10  6:24 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Jens Axboe, Ingo Molnar, npiggin, linux-kernel, suresh.b.siddha

[-- Attachment #1: Type: text/plain, Size: 5024 bytes --]

On Saturday 09 August 2008 05:37, Venki Pallipadi wrote:
> Found a OOPS on a big SMP box during an overnight reboot test with upstream
> git.
>
> Suresh and I looked at the oops and looks like the root cause is in
> generic_smp_call_function_interrupt() and smp_call_function_mask() with
> wait parameter.
>
> The actual oops looked like
>
> [   11.277260] BUG: unable to handle kernel paging request at
> ffff8802ffffffff [   11.277815] IP: [<ffff8802ffffffff>] 0xffff8802ffffffff
> [   11.278155] PGD 202063 PUD 0
> [   11.278576] Oops: 0010 [1] SMP
> [   11.279006] CPU 5
> [   11.279336] Modules linked in:
> [   11.279752] Pid: 0, comm: swapper Not tainted 2.6.27-rc2-00020-g685d87f
> #290 [   11.280039] RIP: 0010:[<ffff8802ffffffff>]  [<ffff8802ffffffff>]
> 0xffff8802ffffffff [   11.280692] RSP: 0018:ffff88027f1f7f70  EFLAGS:
> 00010086
> [   11.280976] RAX: 00000000ffffffff RBX: 0000000000000000 RCX:
> 0000000000000000 [   11.281264] RDX: 0000000000004f4e RSI: 0000000000000001
> RDI: 0000000000000000 [   11.281624] RBP: ffff88027f1f7f98 R08:
> 0000000000000001 R09: ffffffff802509af [   11.281925] R10: ffff8800280c2780
> R11: 0000000000000000 R12: ffff88027f097d48 [   11.282214] R13:
> ffff88027f097d70 R14: 0000000000000005 R15: ffff88027e571000 [   11.282502]
> FS:  0000000000000000(0000) GS:ffff88027f1c3340(0000)
> knlGS:0000000000000000 [   11.283096] CS:  0010 DS: 0018 ES: 0018 CR0:
> 000000008005003b
> [   11.283382] CR2: ffff8802ffffffff CR3: 0000000000201000 CR4:
> 00000000000006e0 [   11.283760] DR0: 0000000000000000 DR1: 0000000000000000
> DR2: 0000000000000000 [   11.284048] DR3: 0000000000000000 DR6:
> 00000000ffff0ff0 DR7: 0000000000000400 [   11.284337] Process swapper (pid:
> 0, threadinfo ffff88027f1f2000, task ffff88027f1f0640) [   11.284936]
> Stack:  ffffffff80250963 0000000000000212 0000000000ee8c78 0000000000ee8a66
> [   11.285802]  ffff88027e571550 ffff88027f1f7fa8 ffffffff8021adb5
> ffff88027f1f3e40 [   11.286599]  ffffffff8020bdd6 ffff88027f1f3e40 <EOI> 
> ffff88027f1f3ef8 0000000000000000 [   11.287120] Call Trace:
> [   11.287768]  <IRQ>  [<ffffffff80250963>] ?
> generic_smp_call_function_interrupt+0x61/0x12c [   11.288354] 
> [<ffffffff8021adb5>] smp_call_function_interrupt+0x17/0x27 [   11.288744] 
> [<ffffffff8020bdd6>] call_function_interrupt+0x66/0x70 [   11.289030] 
> <EOI>  [<ffffffff8024ab3b>] ? clockevents_notify+0x19/0x73 [   11.289380] 
> [<ffffffff803b9b75>] ? acpi_idle_enter_simple+0x18b/0x1fa [   11.289760] 
> [<ffffffff803b9b6b>] ? acpi_idle_enter_simple+0x181/0x1fa [   11.290051] 
> [<ffffffff8053aeca>] ? cpuidle_idle_call+0x70/0xa2 [   11.290338] 
> [<ffffffff80209f61>] ? cpu_idle+0x5f/0x7d
> [   11.290723]  [<ffffffff8060224a>] ? start_secondary+0x14d/0x152
> [   11.291010]
> [   11.291287]
> [   11.291654] Code:  Bad RIP value.
> [   11.292041] RIP  [<ffff8802ffffffff>] 0xffff8802ffffffff
> [   11.292380]  RSP <ffff88027f1f7f70>
> [   11.292741] CR2: ffff8802ffffffff
> [   11.310951] ---[ end trace 137c54d525305f1c ]---
>
> The problem is with the following sequence of events:
>
> - CPU A calls smp_call_function_mask() for CPU B with wait parameter
> - CPU A sets up the call_function_data on the stack and does an rcu add to
>   call_function_queue
> - CPU A waits until the WAIT flag is cleared
> - CPU B gets the call function interrupt and starts going through the
>   call_function_queue
> - CPU C also gets some other call function interrupt and starts going
> through the call_function_queue
> - CPU C, which is also going through the call_function_queue, starts
> referencing CPU A's stack, as that element is still in call_function_queue
> - CPU B finishes the function call that CPU A set up and as there are no
> other references to it, rcu deletes the call_function_data (which was from
> CPU A stack)
> - CPU B sees the wait flag and just clears the flag (no call_rcu to free)
> - CPU A which was waiting on the flag continues executing and the stack
>   contents change
>
> - CPU C is still in rcu_read section accessing the CPU A's stack sees
>   inconsistent call_funation_data and can try to execute
>   function with some random pointer, causing stack corruption for A
>   (by clearing the bits in mask field) and oops.
>
>
> One way to solve the problem is to have CPU A wait as long as there is a
> rcu read happening. But, we cannot use synchronize_rcu() here as we cannot
> block. Another way around is to always allocate call_function_data, instead
> of using CPU A's stack. Below patch does this. But, that will still have to
> handle the kmalloc failure case somehow.
>
> Any other ideas on how to solve this problem?

Nice debugging work.

I'd suggest something like the attached (untested) patch as the simple
fix for now.

I expect the benefits from the less synchronized, multiple-in-flight-data
global queue will still outweigh the costs of dynamic allocations. But
if worst comes to worst then we just go back to a globally synchronous
one-at-a-time implementation, but that would be pretty sad!

[-- Attachment #2: kernel-smp-call-function-fix.patch --]
[-- Type: text/x-diff, Size: 2596 bytes --]

Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c
+++ linux-2.6/kernel/smp.c
@@ -260,6 +260,41 @@ void __smp_call_function_single(int cpu,
 	generic_exec_single(cpu, data);
 }
 
+/* Dummy function */
+static void quiesce_dummy(void *unused)
+{
+}
+
+/*
+ * Ensure stack based data used in call function mask is safe to free.
+ *
+ * This is needed by smp_call_function_mask when using on-stack data, because
+ * a single call function queue is shared by all CPUs, and any CPU may pick up
+ * the data item on the queue at any time before it is deleted. So we need to
+ * ensure that all CPUs have transitioned through a quiescent state after
+ * this call.
+ *
+ * This is a very slow function, implemented by sending synchronous IPIs to
+ * all possible CPUs. For this reason, we have to alloc data rather than use
+ * stack based data even in the case of synchronous calls. The stack based
+ * data is then just used for deadlock/oom fallback which will be very rare.
+ *
+ * If a faster scheme can be made, we could go back to preferring stack based
+ * data -- the data allocation/free is non-zero cost.
+ */
+static void smp_call_function_mask_quiesce_stack(cpumask_t mask)
+{
+	struct call_single_data data;
+	int cpu;
+
+	data.func = quiesce_dummy;
+	data.info = NULL;
+	data.flags = CSD_FLAG_WAIT;
+
+	for_each_cpu_mask(cpu, mask)
+		generic_exec_single(cpu, &data);
+}
+
 /**
  * smp_call_function_mask(): Run a function on a set of other CPUs.
  * @mask: The set of cpus to run on.
@@ -285,6 +320,7 @@ int smp_call_function_mask(cpumask_t mas
 	cpumask_t allbutself;
 	unsigned long flags;
 	int cpu, num_cpus;
+	int slowpath = 0;
 
 	/* Can deadlock when called with interrupts disabled */
 	WARN_ON(irqs_disabled());
@@ -306,15 +342,14 @@ int smp_call_function_mask(cpumask_t mas
 		return smp_call_function_single(cpu, func, info, wait);
 	}
 
-	if (!wait) {
-		data = kmalloc(sizeof(*data), GFP_ATOMIC);
-		if (data)
-			data->csd.flags = CSD_FLAG_ALLOC;
-	}
-	if (!data) {
+	data = kmalloc(sizeof(*data), GFP_ATOMIC);
+	if (data)
+		data->csd.flags = CSD_FLAG_ALLOC;
+	else {
 		data = &d;
 		data->csd.flags = CSD_FLAG_WAIT;
 		wait = 1;
+		slowpath = 1;
 	}
 
 	spin_lock_init(&data->lock);
@@ -331,8 +366,11 @@ int smp_call_function_mask(cpumask_t mas
 	arch_send_call_function_ipi(mask);
 
 	/* optionally wait for the CPUs to complete */
-	if (wait)
+	if (wait) {
 		csd_flag_wait(&data->csd);
+		if (unlikely(slowpath))
+			smp_call_function_mask_quiesce_stack(allbutself);
+	}
 
 	return 0;
 }

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] stack and rcu interaction bug in smp_call_function_mask()
  2008-08-10  6:24 ` Nick Piggin
@ 2008-08-11  3:49   ` Nick Piggin
  2008-08-11 13:22     ` Ingo Molnar
  2008-08-11  4:26   ` Jeremy Fitzhardinge
  2008-08-21 20:50   ` Jeremy Fitzhardinge
  2 siblings, 1 reply; 12+ messages in thread
From: Nick Piggin @ 2008-08-11  3:49 UTC (permalink / raw)
  To: Venki Pallipadi
  Cc: Jens Axboe, Ingo Molnar, npiggin, linux-kernel, suresh.b.siddha

[-- Attachment #1: Type: text/plain, Size: 515 bytes --]

On Sunday 10 August 2008 16:24, Nick Piggin wrote:

> I'd suggest something like the attached (untested) patch as the simple
> fix for now.
>
> I expect the benefits from the less synchronized, multiple-in-flight-data
> global queue will still outweigh the costs of dynamic allocations. But
> if worst comes to worst then we just go back to a globally synchronous
> one-at-a-time implementation, but that would be pretty sad!

Just needed a little fix and it appears to boot now. I think it does
the right thing...

[-- Attachment #2: kernel-smp-call-function-fix.patch --]
[-- Type: text/x-diff, Size: 2651 bytes --]

Index: linux-2.6/kernel/smp.c
===================================================================
--- linux-2.6.orig/kernel/smp.c
+++ linux-2.6/kernel/smp.c
@@ -260,6 +260,41 @@ void __smp_call_function_single(int cpu,
 	generic_exec_single(cpu, data);
 }
 
+/* Dummy function */
+static void quiesce_dummy(void *unused)
+{
+}
+
+/*
+ * Ensure stack based data used in call function mask is safe to free.
+ *
+ * This is needed by smp_call_function_mask when using on-stack data, because
+ * a single call function queue is shared by all CPUs, and any CPU may pick up
+ * the data item on the queue at any time before it is deleted. So we need to
+ * ensure that all CPUs have transitioned through a quiescent state after
+ * this call.
+ *
+ * This is a very slow function, implemented by sending synchronous IPIs to
+ * all possible CPUs. For this reason, we have to alloc data rather than use
+ * stack based data even in the case of synchronous calls. The stack based
+ * data is then just used for deadlock/oom fallback which will be very rare.
+ *
+ * If a faster scheme can be made, we could go back to preferring stack based
+ * data -- the data allocation/free is non-zero cost.
+ */
+static void smp_call_function_mask_quiesce_stack(cpumask_t mask)
+{
+	struct call_single_data data;
+	int cpu;
+
+	data.func = quiesce_dummy;
+	data.info = NULL;
+	data.flags = CSD_FLAG_WAIT;
+
+	for_each_cpu_mask(cpu, mask)
+		generic_exec_single(cpu, &data);
+}
+
 /**
  * smp_call_function_mask(): Run a function on a set of other CPUs.
  * @mask: The set of cpus to run on.
@@ -285,6 +320,7 @@ int smp_call_function_mask(cpumask_t mas
 	cpumask_t allbutself;
 	unsigned long flags;
 	int cpu, num_cpus;
+	int slowpath = 0;
 
 	/* Can deadlock when called with interrupts disabled */
 	WARN_ON(irqs_disabled());
@@ -306,15 +342,16 @@ int smp_call_function_mask(cpumask_t mas
 		return smp_call_function_single(cpu, func, info, wait);
 	}
 
-	if (!wait) {
-		data = kmalloc(sizeof(*data), GFP_ATOMIC);
-		if (data)
-			data->csd.flags = CSD_FLAG_ALLOC;
-	}
-	if (!data) {
+	data = kmalloc(sizeof(*data), GFP_ATOMIC);
+	if (data) {
+		data->csd.flags = CSD_FLAG_ALLOC;
+		if (wait)
+			data->csd.flags |= CSD_FLAG_WAIT;
+	} else {
 		data = &d;
 		data->csd.flags = CSD_FLAG_WAIT;
 		wait = 1;
+		slowpath = 1;
 	}
 
 	spin_lock_init(&data->lock);
@@ -331,8 +368,11 @@ int smp_call_function_mask(cpumask_t mas
 	arch_send_call_function_ipi(mask);
 
 	/* optionally wait for the CPUs to complete */
-	if (wait)
+	if (wait) {
 		csd_flag_wait(&data->csd);
+		if (unlikely(slowpath))
+			smp_call_function_mask_quiesce_stack(allbutself);
+	}
 
 	return 0;
 }

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] stack and rcu interaction bug in smp_call_function_mask()
  2008-08-11  3:49   ` Nick Piggin
@ 2008-08-11 13:22     ` Ingo Molnar
  0 siblings, 0 replies; 12+ messages in thread
From: Ingo Molnar @ 2008-08-11 13:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Venki Pallipadi, Jens Axboe, npiggin, linux-kernel,
	suresh.b.siddha, Jeremy Fitzhardinge, H. Peter Anvin,
	Thomas Gleixner


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Sunday 10 August 2008 16:24, Nick Piggin wrote:
> 
> > I'd suggest something like the attached (untested) patch as the simple
> > fix for now.
> >
> > I expect the benefits from the less synchronized, 
> > multiple-in-flight-data global queue will still outweigh the costs 
> > of dynamic allocations. But if worst comes to worst then we just go 
> > back to a globally synchronous one-at-a-time implementation, but 
> > that would be pretty sad!
> 
> Just needed a little fix and it appears to boot now. I think it does 
> the right thing...

nice find! I've queued it up in tip/core/urgent and started testing it. 
Full commit is quoted below.

	Ingo

------------------------>
>From cc7a486cac78f6fc1a24e8cd63036bae8d2ab431 Mon Sep 17 00:00:00 2001
From: Nick Piggin <nickpiggin@yahoo.com.au>
Date: Mon, 11 Aug 2008 13:49:30 +1000
Subject: [PATCH] generic-ipi: fix stack and rcu interaction bug in smp_call_function_mask()

* Venki Pallipadi <venkatesh.pallipadi@intel.com> wrote:

> Found a OOPS on a big SMP box during an overnight reboot test with
> upstream git.
>
> Suresh and I looked at the oops and looks like the root cause is in
> generic_smp_call_function_interrupt() and smp_call_function_mask() with
> wait parameter.
>
> The actual oops looked like
>
> [   11.277260] BUG: unable to handle kernel paging request at ffff8802ffffffff
> [   11.277815] IP: [<ffff8802ffffffff>] 0xffff8802ffffffff
> [   11.278155] PGD 202063 PUD 0
> [   11.278576] Oops: 0010 [1] SMP
> [   11.279006] CPU 5
> [   11.279336] Modules linked in:
> [   11.279752] Pid: 0, comm: swapper Not tainted 2.6.27-rc2-00020-g685d87f #290
> [   11.280039] RIP: 0010:[<ffff8802ffffffff>]  [<ffff8802ffffffff>] 0xffff8802ffffffff
> [   11.280692] RSP: 0018:ffff88027f1f7f70  EFLAGS: 00010086
> [   11.280976] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 0000000000000000
> [   11.281264] RDX: 0000000000004f4e RSI: 0000000000000001 RDI: 0000000000000000
> [   11.281624] RBP: ffff88027f1f7f98 R08: 0000000000000001 R09: ffffffff802509af
> [   11.281925] R10: ffff8800280c2780 R11: 0000000000000000 R12: ffff88027f097d48
> [   11.282214] R13: ffff88027f097d70 R14: 0000000000000005 R15: ffff88027e571000
> [   11.282502] FS:  0000000000000000(0000) GS:ffff88027f1c3340(0000) knlGS:0000000000000000
> [   11.283096] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [   11.283382] CR2: ffff8802ffffffff CR3: 0000000000201000 CR4: 00000000000006e0
> [   11.283760] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   11.284048] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [   11.284337] Process swapper (pid: 0, threadinfo ffff88027f1f2000, task ffff88027f1f0640)
> [   11.284936] Stack:  ffffffff80250963 0000000000000212 0000000000ee8c78 0000000000ee8a66
> [   11.285802]  ffff88027e571550 ffff88027f1f7fa8 ffffffff8021adb5 ffff88027f1f3e40
> [   11.286599]  ffffffff8020bdd6 ffff88027f1f3e40 <EOI>  ffff88027f1f3ef8 0000000000000000
> [   11.287120] Call Trace:
> [   11.287768]  <IRQ>  [<ffffffff80250963>] ? generic_smp_call_function_interrupt+0x61/0x12c
> [   11.288354]  [<ffffffff8021adb5>] smp_call_function_interrupt+0x17/0x27
> [   11.288744]  [<ffffffff8020bdd6>] call_function_interrupt+0x66/0x70
> [   11.289030]  <EOI>  [<ffffffff8024ab3b>] ? clockevents_notify+0x19/0x73
> [   11.289380]  [<ffffffff803b9b75>] ? acpi_idle_enter_simple+0x18b/0x1fa
> [   11.289760]  [<ffffffff803b9b6b>] ? acpi_idle_enter_simple+0x181/0x1fa
> [   11.290051]  [<ffffffff8053aeca>] ? cpuidle_idle_call+0x70/0xa2
> [   11.290338]  [<ffffffff80209f61>] ? cpu_idle+0x5f/0x7d
> [   11.290723]  [<ffffffff8060224a>] ? start_secondary+0x14d/0x152
> [   11.291010]
> [   11.291287]
> [   11.291654] Code:  Bad RIP value.
> [   11.292041] RIP  [<ffff8802ffffffff>] 0xffff8802ffffffff
> [   11.292380]  RSP <ffff88027f1f7f70>
> [   11.292741] CR2: ffff8802ffffffff
> [   11.310951] ---[ end trace 137c54d525305f1c ]---
>
> The problem is with the following sequence of events:
>
> - CPU A calls smp_call_function_mask() for CPU B with wait parameter
> - CPU A sets up the call_function_data on the stack and does an rcu add to
>   call_function_queue
> - CPU A waits until the WAIT flag is cleared
> - CPU B gets the call function interrupt and starts going through the
>   call_function_queue
> - CPU C also gets some other call function interrupt and starts going through
>   the call_function_queue
> - CPU C, which is also going through the call_function_queue, starts referencing
>   CPU A's stack, as that element is still in call_function_queue
> - CPU B finishes the function call that CPU A set up and as there are no other
>   references to it, rcu deletes the call_function_data (which was from CPU A
>   stack)
> - CPU B sees the wait flag and just clears the flag (no call_rcu to free)
> - CPU A which was waiting on the flag continues executing and the stack
>   contents change
>
> - CPU C is still in rcu_read section accessing the CPU A's stack sees
>   inconsistent call_funation_data and can try to execute
>   function with some random pointer, causing stack corruption for A
>   (by clearing the bits in mask field) and oops.

Nice debugging work.

I'd suggest something like the attached (boot tested) patch as the simple
fix for now.

I expect the benefits from the less synchronized, multiple-in-flight-data
global queue will still outweigh the costs of dynamic allocations. But
if worst comes to worst then we just go back to a globally synchronous
one-at-a-time implementation, but that would be pretty sad!

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/smp.c |   54 +++++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 96fc7c0..e6084f6 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -260,6 +260,41 @@ void __smp_call_function_single(int cpu, struct call_single_data *data)
 	generic_exec_single(cpu, data);
 }
 
+/* Dummy function */
+static void quiesce_dummy(void *unused)
+{
+}
+
+/*
+ * Ensure stack based data used in call function mask is safe to free.
+ *
+ * This is needed by smp_call_function_mask when using on-stack data, because
+ * a single call function queue is shared by all CPUs, and any CPU may pick up
+ * the data item on the queue at any time before it is deleted. So we need to
+ * ensure that all CPUs have transitioned through a quiescent state after
+ * this call.
+ *
+ * This is a very slow function, implemented by sending synchronous IPIs to
+ * all possible CPUs. For this reason, we have to alloc data rather than use
+ * stack based data even in the case of synchronous calls. The stack based
+ * data is then just used for deadlock/oom fallback which will be very rare.
+ *
+ * If a faster scheme can be made, we could go back to preferring stack based
+ * data -- the data allocation/free is non-zero cost.
+ */
+static void smp_call_function_mask_quiesce_stack(cpumask_t mask)
+{
+	struct call_single_data data;
+	int cpu;
+
+	data.func = quiesce_dummy;
+	data.info = NULL;
+	data.flags = CSD_FLAG_WAIT;
+
+	for_each_cpu_mask(cpu, mask)
+		generic_exec_single(cpu, &data);
+}
+
 /**
  * smp_call_function_mask(): Run a function on a set of other CPUs.
  * @mask: The set of cpus to run on.
@@ -285,6 +320,7 @@ int smp_call_function_mask(cpumask_t mask, void (*func)(void *), void *info,
 	cpumask_t allbutself;
 	unsigned long flags;
 	int cpu, num_cpus;
+	int slowpath = 0;
 
 	/* Can deadlock when called with interrupts disabled */
 	WARN_ON(irqs_disabled());
@@ -306,15 +342,16 @@ int smp_call_function_mask(cpumask_t mask, void (*func)(void *), void *info,
 		return smp_call_function_single(cpu, func, info, wait);
 	}
 
-	if (!wait) {
-		data = kmalloc(sizeof(*data), GFP_ATOMIC);
-		if (data)
-			data->csd.flags = CSD_FLAG_ALLOC;
-	}
-	if (!data) {
+	data = kmalloc(sizeof(*data), GFP_ATOMIC);
+	if (data) {
+		data->csd.flags = CSD_FLAG_ALLOC;
+		if (wait)
+			data->csd.flags |= CSD_FLAG_WAIT;
+	} else {
 		data = &d;
 		data->csd.flags = CSD_FLAG_WAIT;
 		wait = 1;
+		slowpath = 1;
 	}
 
 	spin_lock_init(&data->lock);
@@ -331,8 +368,11 @@ int smp_call_function_mask(cpumask_t mask, void (*func)(void *), void *info,
 	arch_send_call_function_ipi(mask);
 
 	/* optionally wait for the CPUs to complete */
-	if (wait)
+	if (wait) {
 		csd_flag_wait(&data->csd);
+		if (unlikely(slowpath))
+			smp_call_function_mask_quiesce_stack(allbutself);
+	}
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] stack and rcu interaction bug in smp_call_function_mask()
  2008-08-10  6:24 ` Nick Piggin
  2008-08-11  3:49   ` Nick Piggin
@ 2008-08-11  4:26   ` Jeremy Fitzhardinge
  2008-08-11  4:34     ` Arjan van de Ven
  2008-08-11  4:49     ` Nick Piggin
  2008-08-21 20:50   ` Jeremy Fitzhardinge
  2 siblings, 2 replies; 12+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-11  4:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Venki Pallipadi, Jens Axboe, Ingo Molnar, npiggin, linux-kernel,
	suresh.b.siddha

Nick Piggin wrote:
> Nice debugging work.
>
> I'd suggest something like the attached (untested) patch as the simple
> fix for now.
>
> I expect the benefits from the less synchronized, multiple-in-flight-data
> global queue will still outweigh the costs of dynamic allocations. But
> if worst comes to worst then we just go back to a globally synchronous
> one-at-a-time implementation, but that would be pretty sad!
>   

What if we went the other way and strictly used queue-per-cpu?  It means 
multicast would require multiple enqueueing operations, which is a bit 
heavy, but it does make dequeuing and lifetime management very simple...

    J

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] stack and rcu interaction bug in smp_call_function_mask()
  2008-08-11  4:26   ` Jeremy Fitzhardinge
@ 2008-08-11  4:34     ` Arjan van de Ven
  2008-08-11 18:18       ` Jeremy Fitzhardinge
  2008-08-11  4:49     ` Nick Piggin
  1 sibling, 1 reply; 12+ messages in thread
From: Arjan van de Ven @ 2008-08-11  4:34 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Nick Piggin, Venki Pallipadi, Jens Axboe, Ingo Molnar, npiggin,
	linux-kernel, suresh.b.siddha

On Sun, 10 Aug 2008 21:26:18 -0700
Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> Nick Piggin wrote:
> > Nice debugging work.
> >
> > I'd suggest something like the attached (untested) patch as the
> > simple fix for now.
> >
> > I expect the benefits from the less synchronized,
> > multiple-in-flight-data global queue will still outweigh the costs
> > of dynamic allocations. But if worst comes to worst then we just go
> > back to a globally synchronous one-at-a-time implementation, but
> > that would be pretty sad! 
> 
> What if we went the other way and strictly used queue-per-cpu?  It
> means multicast would require multiple enqueueing operations, which
> is a bit heavy, but it does make dequeuing and lifetime management
> very simple...

as long as send-to-all is still one apic operation.. otherwise it gets
*really* expensive....
(just think about waking all cpus up out of their C-states.. one by one
getting the full exit latency sequentially)

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] stack and rcu interaction bug in smp_call_function_mask()
  2008-08-11  4:34     ` Arjan van de Ven
@ 2008-08-11 18:18       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 12+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-11 18:18 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Nick Piggin, Venki Pallipadi, Jens Axboe, Ingo Molnar, npiggin,
	linux-kernel, suresh.b.siddha

Arjan van de Ven wrote:
> On Sun, 10 Aug 2008 21:26:18 -0700
> Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> Nick Piggin wrote:
>>     
>>> Nice debugging work.
>>>
>>> I'd suggest something like the attached (untested) patch as the
>>> simple fix for now.
>>>
>>> I expect the benefits from the less synchronized,
>>> multiple-in-flight-data global queue will still outweigh the costs
>>> of dynamic allocations. But if worst comes to worst then we just go
>>> back to a globally synchronous one-at-a-time implementation, but
>>> that would be pretty sad! 
>>>       
>> What if we went the other way and strictly used queue-per-cpu?  It
>> means multicast would require multiple enqueueing operations, which
>> is a bit heavy, but it does make dequeuing and lifetime management
>> very simple...
>>     
>
> as long as send-to-all is still one apic operation.. otherwise it gets
> *really* expensive....
> (just think about waking all cpus up out of their C-states.. one by one
> getting the full exit latency sequentially)
>   

Well, send-to-all can be special cased (is already at the apic IPI 
level, but we could have a broadcast queue as well).

But I wonder how common an operation that is?  Most calls to 
smp_call_function_mask are sending to mm->cpu_vm_mask.  For a small 
number of cores, that could well be broadcast, but as the core count 
goes up, the likelihood that all cpus have been involved with a given mm 
will go down (very workload dependent, of course).

It could be that if we're sending to more than some proportion of the 
cpus, it would be more efficient to just broadcast, and let the cpus 
work out whether they need to do anything or not.  But that's more or 
less the scheme we have now...

    J

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] stack and rcu interaction bug in smp_call_function_mask()
  2008-08-11  4:26   ` Jeremy Fitzhardinge
  2008-08-11  4:34     ` Arjan van de Ven
@ 2008-08-11  4:49     ` Nick Piggin
  2008-08-11 18:27       ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 12+ messages in thread
From: Nick Piggin @ 2008-08-11  4:49 UTC (permalink / raw)
  To: Jeremy Fitzhardinge
  Cc: Venki Pallipadi, Jens Axboe, Ingo Molnar, npiggin, linux-kernel,
	suresh.b.siddha

On Monday 11 August 2008 14:26, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > Nice debugging work.
> >
> > I'd suggest something like the attached (untested) patch as the simple
> > fix for now.
> >
> > I expect the benefits from the less synchronized, multiple-in-flight-data
> > global queue will still outweigh the costs of dynamic allocations. But
> > if worst comes to worst then we just go back to a globally synchronous
> > one-at-a-time implementation, but that would be pretty sad!
>
> What if we went the other way and strictly used queue-per-cpu?  It means
> multicast would require multiple enqueueing operations, which is a bit
> heavy, but it does make dequeuing and lifetime management very simple...

Well that's implemented with the optimized call-single code of course,
so it could be used to implement the masked calls...

I had wanted to look into finding a good cutoff point and use the
percpu queues for light weight masks, and the single global queue for
larger ones.

Queue per cpu is not going to be perfect, though. In the current
implementation, you would need a lot of data structures. You could
alleviate this problem by using per CPU vectors rather than lists,
but then you get the added problem of resource starvation at the
remote end too.

For heavy weight masks on large systems, the single queue I'd say
will be a win. But I never did detailed measurements, so I'm open
to be proven wrong.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] stack and rcu interaction bug in smp_call_function_mask()
  2008-08-11  4:49     ` Nick Piggin
@ 2008-08-11 18:27       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 12+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-11 18:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Venki Pallipadi, Jens Axboe, Ingo Molnar, npiggin, linux-kernel,
	suresh.b.siddha

Nick Piggin wrote:
> Well that's implemented with the optimized call-single code of course,
> so it could be used to implement the masked calls...
>
> I had wanted to look into finding a good cutoff point and use the
> percpu queues for light weight masks, and the single global queue for
> larger ones.
>
> Queue per cpu is not going to be perfect, though. In the current
> implementation, you would need a lot of data structures. You could
> alleviate this problem by using per CPU vectors rather than lists,
> but then you get the added problem of resource starvation at the
> remote end too.
>
> For heavy weight masks on large systems, the single queue I'd say
> will be a win. But I never did detailed measurements, so I'm open
> to be proven wrong.
>   

Yeah, there's a lot of parameters there.  And as I've mentioned before, 
I wonder whether we should take NUMA topology into account when deciding 
where and when to use queues.  My intuition is that most cross-cpu calls 
are going to be within cpus on a node, on the grounds that most are 
mm->cpu_vm_mask calls, and the rest of the system tries hard to 
co-locate processes sharing memory on one node.

Waffle, handwave.

    J

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] stack and rcu interaction bug in smp_call_function_mask()
  2008-08-10  6:24 ` Nick Piggin
  2008-08-11  3:49   ` Nick Piggin
  2008-08-11  4:26   ` Jeremy Fitzhardinge
@ 2008-08-21 20:50   ` Jeremy Fitzhardinge
  2 siblings, 0 replies; 12+ messages in thread
From: Jeremy Fitzhardinge @ 2008-08-21 20:50 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Venki Pallipadi, Jens Axboe, Ingo Molnar, npiggin, linux-kernel,
	suresh.b.siddha

Nick Piggin wrote:
> On Saturday 09 August 2008 05:37, Venki Pallipadi wrote:
>   
>> Found a OOPS on a big SMP box during an overnight reboot test with upstream
>> git.
>>
>> Suresh and I looked at the oops and looks like the root cause is in
>> generic_smp_call_function_interrupt() and smp_call_function_mask() with
>> wait parameter.
>>
>> The actual oops looked like
>>
>> [   11.277260] BUG: unable to handle kernel paging request at
>> ffff8802ffffffff [   11.277815] IP: [<ffff8802ffffffff>] 0xffff8802ffffffff
>> [   11.278155] PGD 202063 PUD 0
>> [   11.278576] Oops: 0010 [1] SMP
>> [   11.279006] CPU 5
>> [   11.279336] Modules linked in:
>> [   11.279752] Pid: 0, comm: swapper Not tainted 2.6.27-rc2-00020-g685d87f
>> #290 [   11.280039] RIP: 0010:[<ffff8802ffffffff>]  [<ffff8802ffffffff>]
>> 0xffff8802ffffffff [   11.280692] RSP: 0018:ffff88027f1f7f70  EFLAGS:
>> 00010086
>> [   11.280976] RAX: 00000000ffffffff RBX: 0000000000000000 RCX:
>> 0000000000000000 [   11.281264] RDX: 0000000000004f4e RSI: 0000000000000001
>> RDI: 0000000000000000 [   11.281624] RBP: ffff88027f1f7f98 R08:
>> 0000000000000001 R09: ffffffff802509af [   11.281925] R10: ffff8800280c2780
>> R11: 0000000000000000 R12: ffff88027f097d48 [   11.282214] R13:
>> ffff88027f097d70 R14: 0000000000000005 R15: ffff88027e571000 [   11.282502]
>> FS:  0000000000000000(0000) GS:ffff88027f1c3340(0000)
>> knlGS:0000000000000000 [   11.283096] CS:  0010 DS: 0018 ES: 0018 CR0:
>> 000000008005003b
>> [   11.283382] CR2: ffff8802ffffffff CR3: 0000000000201000 CR4:
>> 00000000000006e0 [   11.283760] DR0: 0000000000000000 DR1: 0000000000000000
>> DR2: 0000000000000000 [   11.284048] DR3: 0000000000000000 DR6:
>> 00000000ffff0ff0 DR7: 0000000000000400 [   11.284337] Process swapper (pid:
>> 0, threadinfo ffff88027f1f2000, task ffff88027f1f0640) [   11.284936]
>> Stack:  ffffffff80250963 0000000000000212 0000000000ee8c78 0000000000ee8a66
>> [   11.285802]  ffff88027e571550 ffff88027f1f7fa8 ffffffff8021adb5
>> ffff88027f1f3e40 [   11.286599]  ffffffff8020bdd6 ffff88027f1f3e40 <EOI> 
>> ffff88027f1f3ef8 0000000000000000 [   11.287120] Call Trace:
>> [   11.287768]  <IRQ>  [<ffffffff80250963>] ?
>> generic_smp_call_function_interrupt+0x61/0x12c [   11.288354] 
>> [<ffffffff8021adb5>] smp_call_function_interrupt+0x17/0x27 [   11.288744] 
>> [<ffffffff8020bdd6>] call_function_interrupt+0x66/0x70 [   11.289030] 
>> <EOI>  [<ffffffff8024ab3b>] ? clockevents_notify+0x19/0x73 [   11.289380] 
>> [<ffffffff803b9b75>] ? acpi_idle_enter_simple+0x18b/0x1fa [   11.289760] 
>> [<ffffffff803b9b6b>] ? acpi_idle_enter_simple+0x181/0x1fa [   11.290051] 
>> [<ffffffff8053aeca>] ? cpuidle_idle_call+0x70/0xa2 [   11.290338] 
>> [<ffffffff80209f61>] ? cpu_idle+0x5f/0x7d
>> [   11.290723]  [<ffffffff8060224a>] ? start_secondary+0x14d/0x152
>> [   11.291010]
>> [   11.291287]
>> [   11.291654] Code:  Bad RIP value.
>> [   11.292041] RIP  [<ffff8802ffffffff>] 0xffff8802ffffffff
>> [   11.292380]  RSP <ffff88027f1f7f70>
>> [   11.292741] CR2: ffff8802ffffffff
>> [   11.310951] ---[ end trace 137c54d525305f1c ]---
>>
>> The problem is with the following sequence of events:
>>
>> - CPU A calls smp_call_function_mask() for CPU B with wait parameter
>> - CPU A sets up the call_function_data on the stack and does an rcu add to
>>   call_function_queue
>> - CPU A waits until the WAIT flag is cleared
>> - CPU B gets the call function interrupt and starts going through the
>>   call_function_queue
>> - CPU C also gets some other call function interrupt and starts going
>> through the call_function_queue
>> - CPU C, which is also going through the call_function_queue, starts
>> referencing CPU A's stack, as that element is still in call_function_queue
>> - CPU B finishes the function call that CPU A set up and as there are no
>> other references to it, rcu deletes the call_function_data (which was from
>> CPU A stack)
>> - CPU B sees the wait flag and just clears the flag (no call_rcu to free)
>> - CPU A which was waiting on the flag continues executing and the stack
>>   contents change
>>
>> - CPU C is still in rcu_read section accessing the CPU A's stack sees
>>   inconsistent call_funation_data and can try to execute
>>   function with some random pointer, causing stack corruption for A
>>   (by clearing the bits in mask field) and oops.
>>
>>
>> One way to solve the problem is to have CPU A wait as long as there is a
>> rcu read happening. But, we cannot use synchronize_rcu() here as we cannot
>> block. Another way around is to always allocate call_function_data, instead
>> of using CPU A's stack. Below patch does this. But, that will still have to
>> handle the kmalloc failure case somehow.
>>
>> Any other ideas on how to solve this problem?
>>     
>
> Nice debugging work.
>
> I'd suggest something like the attached (untested) patch as the simple
> fix for now.
>
> I expect the benefits from the less synchronized, multiple-in-flight-data
> global queue will still outweigh the costs of dynamic allocations. But
> if worst comes to worst then we just go back to a globally synchronous
> one-at-a-time implementation, but that would be pretty sad!
>   

So the upshot of this is that when using stack allocation, the rcu stuff
is more or less useless, because rcu has no influence over the lifetime
of the stack-allocated structures.  We need to move to all allocation,
so that rcu can control the memory lifetime.

Doesn't that just mean we should avoid RCU for stack allocated data
altogether, and instead have the queue controlled by an rwlock?  A cpu
which adds a call to the queue does something like:

   1. write_lock(&queue_lock);
   2. add element
   3. write_unlock(&queue_lock);
   4. csd_flag_wait(...);
   5. write_lock(&queue_lock);
   6. remove element
   7. write_unlock(&queue_lock);

and a CPU running a function would do:

   1. read_lock(&queue_lock);
   2. do stuff
   3. if (refs-- == 1) data->csd.flags &= ~CSD_FLAG_WAIT;
   4. read_unlock(&queue_lock);

And of course using multiple queues would mitigate the locking cost.

Perhaps it would make sense to have separate queue(s) for heap vs stack
allocated data (=async vs sync calls, in effect)?

    J

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-08-21 20:50 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-08 19:37 [PATCH] stack and rcu interaction bug in smp_call_function_mask() Venki Pallipadi
2008-08-09  0:14 ` Andi Kleen
2008-08-09  2:17 ` Jeremy Fitzhardinge
2008-08-10  6:24 ` Nick Piggin
2008-08-11  3:49   ` Nick Piggin
2008-08-11 13:22     ` Ingo Molnar
2008-08-11  4:26   ` Jeremy Fitzhardinge
2008-08-11  4:34     ` Arjan van de Ven
2008-08-11 18:18       ` Jeremy Fitzhardinge
2008-08-11  4:49     ` Nick Piggin
2008-08-11 18:27       ` Jeremy Fitzhardinge
2008-08-21 20:50   ` Jeremy Fitzhardinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox