public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] sched: divide by 0 error
@ 2008-11-27  2:04 Steven Rostedt
  2008-11-27  2:04 ` [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task Steven Rostedt
  0 siblings, 1 reply; 5+ messages in thread
From: Steven Rostedt @ 2008-11-27  2:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Gregory Haskins

There exists a race where we can hit a divide by zero in the scheduler
and crash the system.

The following patches are in:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git

    branch: urgent


Steven Rostedt (1):
      sched: prevent divide by zero error in cpu_avg_load_per_task

----
 kernel/sched.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)
-- 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task
  2008-11-27  2:04 [PATCH 0/1] sched: divide by 0 error Steven Rostedt
@ 2008-11-27  2:04 ` Steven Rostedt
  2008-11-27  9:29   ` Ingo Molnar
  2008-11-29 19:19   ` Linus Torvalds
  0 siblings, 2 replies; 5+ messages in thread
From: Steven Rostedt @ 2008-11-27  2:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Gregory Haskins, Steven Rostedt

[-- Attachment #1: 0001-sched-prevent-divide-by-zero-error-in-cpu_avg_load_.patch --]
[-- Type: text/plain, Size: 1925 bytes --]

From: Steven Rostedt <rostedt@goodmis.org>

Impact: fix to divide by zero

While testing the branch profiler, I hit this crash:

divide error: 0000 [#1] PREEMPT SMP
[...]
RIP: 0010:[<ffffffff8024a008>]  [<ffffffff8024a008>] cpu_avg_load_per_task+0x50/0x7f
[...]
Call Trace:
 <IRQ> <0> [<ffffffff8024fd43>] find_busiest_group+0x3e5/0xcaa
 [<ffffffff8025da75>] rebalance_domains+0x2da/0xa21
 [<ffffffff80478769>] ? find_next_bit+0x1b2/0x1e6
 [<ffffffff8025e2ce>] run_rebalance_domains+0x112/0x19f
 [<ffffffff8026d7c2>] __do_softirq+0xa8/0x232
 [<ffffffff8020ea7c>] call_softirq+0x1c/0x3e
 [<ffffffff8021047a>] do_softirq+0x94/0x1cd
 [<ffffffff8026d5eb>] irq_exit+0x6b/0x10e
 [<ffffffff8022e6ec>] smp_apic_timer_interrupt+0xd3/0xff
 [<ffffffff8020e4b3>] apic_timer_interrupt+0x13/0x20

The code for cpu_avg_load_per_task has:

	if (rq->nr_running)
		rq->avg_load_per_task = rq->load.weight / rq->nr_running;

The runqueue lock is not held here, and there is nothing that prevents
the rq->nr_running from going to zero after it passes the if condition.

The branch profiler simply made the race window bigger.

This patch saves off the rq->nr_running to a local variable and uses that
for both the condition and the division.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/sched.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 9b1e793..700aa9a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1453,9 +1453,10 @@ static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd);
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
+	unsigned long nr_running = rq->nr_running;
 
-	if (rq->nr_running)
-		rq->avg_load_per_task = rq->load.weight / rq->nr_running;
+	if (nr_running)
+		rq->avg_load_per_task = rq->load.weight / nr_running;
 	else
 		rq->avg_load_per_task = 0;
 
-- 
1.5.6.5

-- 

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task
  2008-11-27  2:04 ` [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task Steven Rostedt
@ 2008-11-27  9:29   ` Ingo Molnar
  2008-11-29 19:19   ` Linus Torvalds
  1 sibling, 0 replies; 5+ messages in thread
From: Ingo Molnar @ 2008-11-27  9:29 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Andrew Morton, Linus Torvalds, Peter Zijlstra,
	Thomas Gleixner, Gregory Haskins, Steven Rostedt


* Steven Rostedt <rostedt@goodmis.org> wrote:

> From: Steven Rostedt <rostedt@goodmis.org>
> 
> Impact: fix to divide by zero
> 
> While testing the branch profiler, I hit this crash:
> 
> divide error: 0000 [#1] PREEMPT SMP
> [...]
> Call Trace:
>  <IRQ> <0> [<ffffffff8024fd43>] find_busiest_group+0x3e5/0xcaa
>  [<ffffffff8025da75>] rebalance_domains+0x2da/0xa21

> The code for cpu_avg_load_per_task has:
> 
> 	if (rq->nr_running)
> 		rq->avg_load_per_task = rq->load.weight / rq->nr_running;
> 
> The runqueue lock is not held here, and there is nothing that 
> prevents the rq->nr_running from going to zero after it passes the 
> if condition.
> 
> The branch profiler simply made the race window bigger.
> 
> This patch saves off the rq->nr_running to a local variable and uses 
> that for both the condition and the division.

good catch! Applied to tip/sched/urgent, thanks Steve!

the rebalancer scans remote runqueues without holding the runqueue 
lock for performance reasons, so nr_running indeed has to be loaded 
into a local variable here.

I think it could hit anywhere upstream as well, even without the 
branch tracer: depends on the register pressure and GCC's choices for 
reloading that register. If say FRAME_POINTER is enabled and we are 
running with some more agressive optimization or just an older/suckier 
GCC that does a spurious reload, then this might happen too.

it's not a classic race, it's an SMP bug that will either trigger or 
not trigger at all, depending on compiler behavior.

	Ingo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task
  2008-11-27  2:04 ` [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task Steven Rostedt
  2008-11-27  9:29   ` Ingo Molnar
@ 2008-11-29 19:19   ` Linus Torvalds
  2008-11-29 19:50     ` Ingo Molnar
  1 sibling, 1 reply; 5+ messages in thread
From: Linus Torvalds @ 2008-11-29 19:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra,
	Thomas Gleixner, Gregory Haskins, Steven Rostedt



On Wed, 26 Nov 2008, Steven Rostedt wrote:
>  {
>  	struct rq *rq = cpu_rq(cpu);
> +	unsigned long nr_running = rq->nr_running;
>  
> -	if (rq->nr_running)
> -		rq->avg_load_per_task = rq->load.weight / rq->nr_running;
> +	if (nr_running)
> +		rq->avg_load_per_task = rq->load.weight / nr_running;
>  	else
>  		rq->avg_load_per_task = 0;

I don't think this necessarily fixes it.

There's nothing that keeps gcc from deciding not to reload rq->nr_running. 

Of course, in _practice_, I don't think gcc ever will (if it decides that 
it will spill, gcc is likely going to decide that it will literally spill 
the local variable to the stack rather than decide to reload off the 
pointer), but it's a valid compiler optimization, and it even has a name 
(rematerialization).

So I suspect that your patch does fix the bug, but it still leaves the 
fairly unlikely _potential_ for it to re-appear at some point.

We have ACCESS_ONCE() as a macro to guarantee that the compiler doesn't 
rematerialize a pointer access. That also would clarify the fact that we 
access something unsafe outside a lock.

			Linus

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task
  2008-11-29 19:19   ` Linus Torvalds
@ 2008-11-29 19:50     ` Ingo Molnar
  0 siblings, 0 replies; 5+ messages in thread
From: Ingo Molnar @ 2008-11-29 19:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, linux-kernel, Andrew Morton, Peter Zijlstra,
	Thomas Gleixner, Gregory Haskins, Steven Rostedt


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, 26 Nov 2008, Steven Rostedt wrote:
> >  {
> >  	struct rq *rq = cpu_rq(cpu);
> > +	unsigned long nr_running = rq->nr_running;
> >  
> > -	if (rq->nr_running)
> > -		rq->avg_load_per_task = rq->load.weight / rq->nr_running;
> > +	if (nr_running)
> > +		rq->avg_load_per_task = rq->load.weight / nr_running;
> >  	else
> >  		rq->avg_load_per_task = 0;
> 
> I don't think this necessarily fixes it.
> 
> There's nothing that keeps gcc from deciding not to reload 
> rq->nr_running.
> 
> Of course, in _practice_, I don't think gcc ever will (if it decides 
> that it will spill, gcc is likely going to decide that it will 
> literally spill the local variable to the stack rather than decide to 
> reload off the pointer), but it's a valid compiler optimization, and it 
> even has a name (rematerialization).
> 
> So I suspect that your patch does fix the bug, but it still leaves the 
> fairly unlikely _potential_ for it to re-appear at some point.
> 
> We have ACCESS_ONCE() as a macro to guarantee that the compiler doesn't 
> rematerialize a pointer access. That also would clarify the fact that 
> we access something unsafe outside a lock.

Okay - i've queued up the fix below, to be on the safe side.

	Ingo

---------------->
>From af6d596fd603219b054c1c90fb16672a9fd441bd Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Sat, 29 Nov 2008 20:45:15 +0100
Subject: [PATCH] sched: prevent divide by zero error in cpu_avg_load_per_task, update

Regarding the bug addressed in:

  4cd4262: sched: prevent divide by zero error in cpu_avg_load_per_task

Linus points out that the fix is not complete:

> There's nothing that keeps gcc from deciding not to reload
> rq->nr_running.
>
> Of course, in _practice_, I don't think gcc ever will (if it decides
> that it will spill, gcc is likely going to decide that it will
> literally spill the local variable to the stack rather than decide to
> reload off the pointer), but it's a valid compiler optimization, and
> it even has a name (rematerialization).
>
> So I suspect that your patch does fix the bug, but it still leaves the
> fairly unlikely _potential_ for it to re-appear at some point.
>
> We have ACCESS_ONCE() as a macro to guarantee that the compiler
> doesn't rematerialize a pointer access. That also would clarify
> the fact that we access something unsafe outside a lock.

So make sure our nr_running value is immutable and cannot change
after we check it for nonzero.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 700aa9a..b7480fb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1453,7 +1453,7 @@ static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd);
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	unsigned long nr_running = rq->nr_running;
+	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
 
 	if (nr_running)
 		rq->avg_load_per_task = rq->load.weight / nr_running;

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-11-29 19:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-27  2:04 [PATCH 0/1] sched: divide by 0 error Steven Rostedt
2008-11-27  2:04 ` [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task Steven Rostedt
2008-11-27  9:29   ` Ingo Molnar
2008-11-29 19:19   ` Linus Torvalds
2008-11-29 19:50     ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox