* [PATCH 0/1] sched: divide by 0 error
@ 2008-11-27 2:04 Steven Rostedt
2008-11-27 2:04 ` [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task Steven Rostedt
0 siblings, 1 reply; 5+ messages in thread
From: Steven Rostedt @ 2008-11-27 2:04 UTC (permalink / raw)
To: linux-kernel
Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Peter Zijlstra,
Thomas Gleixner, Gregory Haskins
There exists a race where we can hit a divide by zero in the scheduler
and crash the system.
The following patches are in:
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
branch: urgent
Steven Rostedt (1):
sched: prevent divide by zero error in cpu_avg_load_per_task
----
kernel/sched.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)
--
^ permalink raw reply [flat|nested] 5+ messages in thread* [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task 2008-11-27 2:04 [PATCH 0/1] sched: divide by 0 error Steven Rostedt @ 2008-11-27 2:04 ` Steven Rostedt 2008-11-27 9:29 ` Ingo Molnar 2008-11-29 19:19 ` Linus Torvalds 0 siblings, 2 replies; 5+ messages in thread From: Steven Rostedt @ 2008-11-27 2:04 UTC (permalink / raw) To: linux-kernel Cc: Ingo Molnar, Andrew Morton, Linus Torvalds, Peter Zijlstra, Thomas Gleixner, Gregory Haskins, Steven Rostedt [-- Attachment #1: 0001-sched-prevent-divide-by-zero-error-in-cpu_avg_load_.patch --] [-- Type: text/plain, Size: 1925 bytes --] From: Steven Rostedt <rostedt@goodmis.org> Impact: fix to divide by zero While testing the branch profiler, I hit this crash: divide error: 0000 [#1] PREEMPT SMP [...] RIP: 0010:[<ffffffff8024a008>] [<ffffffff8024a008>] cpu_avg_load_per_task+0x50/0x7f [...] Call Trace: <IRQ> <0> [<ffffffff8024fd43>] find_busiest_group+0x3e5/0xcaa [<ffffffff8025da75>] rebalance_domains+0x2da/0xa21 [<ffffffff80478769>] ? find_next_bit+0x1b2/0x1e6 [<ffffffff8025e2ce>] run_rebalance_domains+0x112/0x19f [<ffffffff8026d7c2>] __do_softirq+0xa8/0x232 [<ffffffff8020ea7c>] call_softirq+0x1c/0x3e [<ffffffff8021047a>] do_softirq+0x94/0x1cd [<ffffffff8026d5eb>] irq_exit+0x6b/0x10e [<ffffffff8022e6ec>] smp_apic_timer_interrupt+0xd3/0xff [<ffffffff8020e4b3>] apic_timer_interrupt+0x13/0x20 The code for cpu_avg_load_per_task has: if (rq->nr_running) rq->avg_load_per_task = rq->load.weight / rq->nr_running; The runqueue lock is not held here, and there is nothing that prevents the rq->nr_running from going to zero after it passes the if condition. The branch profiler simply made the race window bigger. This patch saves off the rq->nr_running to a local variable and uses that for both the condition and the division. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- kernel/sched.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 9b1e793..700aa9a 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1453,9 +1453,10 @@ static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd); static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); + unsigned long nr_running = rq->nr_running; - if (rq->nr_running) - rq->avg_load_per_task = rq->load.weight / rq->nr_running; + if (nr_running) + rq->avg_load_per_task = rq->load.weight / nr_running; else rq->avg_load_per_task = 0; -- 1.5.6.5 -- ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task 2008-11-27 2:04 ` [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task Steven Rostedt @ 2008-11-27 9:29 ` Ingo Molnar 2008-11-29 19:19 ` Linus Torvalds 1 sibling, 0 replies; 5+ messages in thread From: Ingo Molnar @ 2008-11-27 9:29 UTC (permalink / raw) To: Steven Rostedt Cc: linux-kernel, Andrew Morton, Linus Torvalds, Peter Zijlstra, Thomas Gleixner, Gregory Haskins, Steven Rostedt * Steven Rostedt <rostedt@goodmis.org> wrote: > From: Steven Rostedt <rostedt@goodmis.org> > > Impact: fix to divide by zero > > While testing the branch profiler, I hit this crash: > > divide error: 0000 [#1] PREEMPT SMP > [...] > Call Trace: > <IRQ> <0> [<ffffffff8024fd43>] find_busiest_group+0x3e5/0xcaa > [<ffffffff8025da75>] rebalance_domains+0x2da/0xa21 > The code for cpu_avg_load_per_task has: > > if (rq->nr_running) > rq->avg_load_per_task = rq->load.weight / rq->nr_running; > > The runqueue lock is not held here, and there is nothing that > prevents the rq->nr_running from going to zero after it passes the > if condition. > > The branch profiler simply made the race window bigger. > > This patch saves off the rq->nr_running to a local variable and uses > that for both the condition and the division. good catch! Applied to tip/sched/urgent, thanks Steve! the rebalancer scans remote runqueues without holding the runqueue lock for performance reasons, so nr_running indeed has to be loaded into a local variable here. I think it could hit anywhere upstream as well, even without the branch tracer: depends on the register pressure and GCC's choices for reloading that register. If say FRAME_POINTER is enabled and we are running with some more agressive optimization or just an older/suckier GCC that does a spurious reload, then this might happen too. it's not a classic race, it's an SMP bug that will either trigger or not trigger at all, depending on compiler behavior. Ingo ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task 2008-11-27 2:04 ` [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task Steven Rostedt 2008-11-27 9:29 ` Ingo Molnar @ 2008-11-29 19:19 ` Linus Torvalds 2008-11-29 19:50 ` Ingo Molnar 1 sibling, 1 reply; 5+ messages in thread From: Linus Torvalds @ 2008-11-29 19:19 UTC (permalink / raw) To: Steven Rostedt Cc: linux-kernel, Ingo Molnar, Andrew Morton, Peter Zijlstra, Thomas Gleixner, Gregory Haskins, Steven Rostedt On Wed, 26 Nov 2008, Steven Rostedt wrote: > { > struct rq *rq = cpu_rq(cpu); > + unsigned long nr_running = rq->nr_running; > > - if (rq->nr_running) > - rq->avg_load_per_task = rq->load.weight / rq->nr_running; > + if (nr_running) > + rq->avg_load_per_task = rq->load.weight / nr_running; > else > rq->avg_load_per_task = 0; I don't think this necessarily fixes it. There's nothing that keeps gcc from deciding not to reload rq->nr_running. Of course, in _practice_, I don't think gcc ever will (if it decides that it will spill, gcc is likely going to decide that it will literally spill the local variable to the stack rather than decide to reload off the pointer), but it's a valid compiler optimization, and it even has a name (rematerialization). So I suspect that your patch does fix the bug, but it still leaves the fairly unlikely _potential_ for it to re-appear at some point. We have ACCESS_ONCE() as a macro to guarantee that the compiler doesn't rematerialize a pointer access. That also would clarify the fact that we access something unsafe outside a lock. Linus ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task 2008-11-29 19:19 ` Linus Torvalds @ 2008-11-29 19:50 ` Ingo Molnar 0 siblings, 0 replies; 5+ messages in thread From: Ingo Molnar @ 2008-11-29 19:50 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, linux-kernel, Andrew Morton, Peter Zijlstra, Thomas Gleixner, Gregory Haskins, Steven Rostedt * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, 26 Nov 2008, Steven Rostedt wrote: > > { > > struct rq *rq = cpu_rq(cpu); > > + unsigned long nr_running = rq->nr_running; > > > > - if (rq->nr_running) > > - rq->avg_load_per_task = rq->load.weight / rq->nr_running; > > + if (nr_running) > > + rq->avg_load_per_task = rq->load.weight / nr_running; > > else > > rq->avg_load_per_task = 0; > > I don't think this necessarily fixes it. > > There's nothing that keeps gcc from deciding not to reload > rq->nr_running. > > Of course, in _practice_, I don't think gcc ever will (if it decides > that it will spill, gcc is likely going to decide that it will > literally spill the local variable to the stack rather than decide to > reload off the pointer), but it's a valid compiler optimization, and it > even has a name (rematerialization). > > So I suspect that your patch does fix the bug, but it still leaves the > fairly unlikely _potential_ for it to re-appear at some point. > > We have ACCESS_ONCE() as a macro to guarantee that the compiler doesn't > rematerialize a pointer access. That also would clarify the fact that > we access something unsafe outside a lock. Okay - i've queued up the fix below, to be on the safe side. Ingo ----------------> >From af6d596fd603219b054c1c90fb16672a9fd441bd Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Sat, 29 Nov 2008 20:45:15 +0100 Subject: [PATCH] sched: prevent divide by zero error in cpu_avg_load_per_task, update Regarding the bug addressed in: 4cd4262: sched: prevent divide by zero error in cpu_avg_load_per_task Linus points out that the fix is not complete: > There's nothing that keeps gcc from deciding not to reload > rq->nr_running. > > Of course, in _practice_, I don't think gcc ever will (if it decides > that it will spill, gcc is likely going to decide that it will > literally spill the local variable to the stack rather than decide to > reload off the pointer), but it's a valid compiler optimization, and > it even has a name (rematerialization). > > So I suspect that your patch does fix the bug, but it still leaves the > fairly unlikely _potential_ for it to re-appear at some point. > > We have ACCESS_ONCE() as a macro to guarantee that the compiler > doesn't rematerialize a pointer access. That also would clarify > the fact that we access something unsafe outside a lock. So make sure our nr_running value is immutable and cannot change after we check it for nonzero. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 700aa9a..b7480fb 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1453,7 +1453,7 @@ static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd); static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); - unsigned long nr_running = rq->nr_running; + unsigned long nr_running = ACCESS_ONCE(rq->nr_running); if (nr_running) rq->avg_load_per_task = rq->load.weight / nr_running; ^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2008-11-29 19:51 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-11-27 2:04 [PATCH 0/1] sched: divide by 0 error Steven Rostedt 2008-11-27 2:04 ` [PATCH 1/1] sched: prevent divide by zero error in cpu_avg_load_per_task Steven Rostedt 2008-11-27 9:29 ` Ingo Molnar 2008-11-29 19:19 ` Linus Torvalds 2008-11-29 19:50 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox