* cpu_clock() in NMIs @ 2009-12-14 2:25 David Miller 2009-12-14 9:02 ` Peter Zijlstra 2009-12-15 8:11 ` [tip:sched/urgent] sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK tip-bot for David Miller 0 siblings, 2 replies; 7+ messages in thread From: David Miller @ 2009-12-14 2:25 UTC (permalink / raw) To: mingo; +Cc: tglx, peterz, linux-kernel The background is that I was trying to resolve a sparc64 perf issue when I discovered this problem. On sparc64 I implement pseudo NMIs by simply running the kernel at IRQ level 14 when local_irq_disable() is called, this allows performance counter events to still come in at IRQ level 15. This doesn't work if any code in an NMI handler does local_irq_save() or local_irq_disable() since the "disable" will kick us back to cpu IRQ level 14 thus letting NMIs back in and we recurse. The only path which that does that in the perf event IRQ handling path is the code supporting frequency based events. It uses cpu_clock(). cpu_clock() simply invokes sched_clock() with IRQs disabled. And that's a fundamental bug all on it's own, particularly for the HAVE_UNSTABLE_SCHED_CLOCK case. NMIs can thus get into the sched_clock() code interrupting the local IRQ disable code sections of it. Furthermore, for the not-HAVE_UNSTABLE_SCHED_CLOCK case, the IRQ disabling done by cpu_clock() is just pure overhead and completely unnecessary. So the core problem is that sched_clock() is not NMI safe, but we are invoking it from NMI contexts in the perf events code (via cpu_clock()). A less important issue is the overhead of IRQ disabling when it isn't necessary in cpu_clock(). Maybe something simple like the patch below to handle that. diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c index 479ce56..5b49613 100644 --- a/kernel/sched_clock.c +++ b/kernel/sched_clock.c @@ -236,6 +236,18 @@ void sched_clock_idle_wakeup_event(u64 delta_ns) } EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event); +unsigned long long cpu_clock(int cpu) +{ + unsigned long long clock; + unsigned long flags; + + local_irq_save(flags); + clock = sched_clock_cpu(cpu); + local_irq_restore(flags); + + return clock; +} + #else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */ void sched_clock_init(void) @@ -251,17 +263,12 @@ u64 sched_clock_cpu(int cpu) return sched_clock(); } -#endif /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */ unsigned long long cpu_clock(int cpu) { - unsigned long long clock; - unsigned long flags; + return sched_clock_cpu(cpu); +} - local_irq_save(flags); - clock = sched_clock_cpu(cpu); - local_irq_restore(flags); +#endif /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */ - return clock; -} EXPORT_SYMBOL_GPL(cpu_clock); ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: cpu_clock() in NMIs 2009-12-14 2:25 cpu_clock() in NMIs David Miller @ 2009-12-14 9:02 ` Peter Zijlstra 2009-12-14 19:09 ` David Miller 2009-12-15 8:11 ` [tip:sched/urgent] sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK tip-bot for David Miller 1 sibling, 1 reply; 7+ messages in thread From: Peter Zijlstra @ 2009-12-14 9:02 UTC (permalink / raw) To: David Miller; +Cc: mingo, tglx, linux-kernel On Sun, 2009-12-13 at 18:25 -0800, David Miller wrote: > The background is that I was trying to resolve a sparc64 perf > issue when I discovered this problem. > > On sparc64 I implement pseudo NMIs by simply running the kernel > at IRQ level 14 when local_irq_disable() is called, this allows > performance counter events to still come in at IRQ level 15. > > This doesn't work if any code in an NMI handler does local_irq_save() > or local_irq_disable() since the "disable" will kick us back to cpu > IRQ level 14 thus letting NMIs back in and we recurse. > > The only path which that does that in the perf event IRQ handling path > is the code supporting frequency based events. It uses cpu_clock(). > > cpu_clock() simply invokes sched_clock() with IRQs disabled. > > And that's a fundamental bug all on it's own, particularly for the > HAVE_UNSTABLE_SCHED_CLOCK case. NMIs can thus get into the > sched_clock() code interrupting the local IRQ disable code sections > of it. > > Furthermore, for the not-HAVE_UNSTABLE_SCHED_CLOCK case, the IRQ > disabling done by cpu_clock() is just pure overhead and completely > unnecessary. > > So the core problem is that sched_clock() is not NMI safe, but we > are invoking it from NMI contexts in the perf events code (via > cpu_clock()). > > A less important issue is the overhead of IRQ disabling when it isn't > necessary in cpu_clock(). Maybe something simple like the patch below > to handle that. I'm not sure, traditionally sched_clock() was always called with IRQs disabled, and on eg. x86 that is needed because we read the TSC and then scale that value depending on CPUfreq state, which can be changed by a CPUfreq interrupt. Allowing NMIs in between isn't really a problem, allowing IRQs in is. Now, the SPARC implementation might be good without IRQs disabled, but we should at least look at all other arches before we do what you propose below. As it removes the IRQ disable from the callsites whereas it previously always had that. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cpu_clock() in NMIs 2009-12-14 9:02 ` Peter Zijlstra @ 2009-12-14 19:09 ` David Miller 2009-12-14 19:32 ` Peter Zijlstra 2009-12-14 19:57 ` Mike Frysinger 0 siblings, 2 replies; 7+ messages in thread From: David Miller @ 2009-12-14 19:09 UTC (permalink / raw) To: peterz; +Cc: mingo, tglx, linux-kernel From: Peter Zijlstra <peterz@infradead.org> Date: Mon, 14 Dec 2009 10:02:39 +0100 > I'm not sure, traditionally sched_clock() was always called with IRQs > disabled, and on eg. x86 that is needed because we read the TSC and then > scale that value depending on CPUfreq state, which can be changed by a > CPUfreq interrupt. Allowing NMIs in between isn't really a problem, > allowing IRQs in is. Ok, that looks fine then. But, speaking more generally, any local_irq_{disable,save}() done in an NMI handler has a high probability of being a bug. And it is a bug if the IRQ disabling is there to prevent re-entry of the code on the local cpu. > Now, the SPARC implementation might be good without IRQs disabled, but > we should at least look at all other arches before we do what you > propose below. As it removes the IRQ disable from the callsites whereas > it previously always had that. Here's a quick audit: 1) Generic kernel/sched_clock.c implementation does a subtraction of jiffies with a constant, then does some constant math on it. Should be OK. 2) sparc64 is fine, just reads a register and multiplies with a boot time calculated value, then shifts down by a constant. Should be OK. 3) ARM mach-map, same situation as sparc64 4) ARM mach-pxa, same situation as sparc64 5) ARM mach-realview, multiplies counter a constant then divides by one, should be OK. 6) ARM mach-sa1100, same as mach-realview 7) ARM mach-u300, uses boot time computed multiplier and shift, should be OK. 8) ARM mach-versatile, same as mach-realview 9) ARM plat-IOP, same as mach-u300 10) ARM plat-omap, same as mach-u300 11) ARM plat-orion, same as sparc64 12) Blackfin, TS version is same as sparc64 non-TS version is unnecessary duplication of generic weak function version in kernel/sched_clock.c and could be deleted 13) CRIS, another dup of kernel/sched_clock.c weak function 14) FRV, another dup of kernel/sched_clock.c weak function 15) IA64, same as sparc64 for native version. Paravirt version uses preemption disable, but also relies on IA64 always setting HAVE_UNSTABLE_SCHED_CLOCK which it does, and therefore IA64 would still disable interrupts with my change 16) m68knommu coldfire, multiplies by a constant then shifts down by one, should be OK 17) mn10300, same as sparc64 18) powerpc, non-__USE_RTC() case is same as sparc64 __USE_RTC() case also looks fine 19) s390, depends upon preemption being disabled so that stop_machine does not interrupt the sched_clock() call should be OK 20) x86 sets HAVE_UNSTABLE_SCHED_CLOCK It should be safe. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cpu_clock() in NMIs 2009-12-14 19:09 ` David Miller @ 2009-12-14 19:32 ` Peter Zijlstra 2009-12-15 5:36 ` David Miller 2009-12-14 19:57 ` Mike Frysinger 1 sibling, 1 reply; 7+ messages in thread From: Peter Zijlstra @ 2009-12-14 19:32 UTC (permalink / raw) To: David Miller; +Cc: mingo, tglx, linux-kernel On Mon, 2009-12-14 at 11:09 -0800, David Miller wrote: > From: Peter Zijlstra <peterz@infradead.org> > Date: Mon, 14 Dec 2009 10:02:39 +0100 > > > I'm not sure, traditionally sched_clock() was always called with IRQs > > disabled, and on eg. x86 that is needed because we read the TSC and then > > scale that value depending on CPUfreq state, which can be changed by a > > CPUfreq interrupt. Allowing NMIs in between isn't really a problem, > > allowing IRQs in is. > > Ok, that looks fine then. > > But, speaking more generally, any local_irq_{disable,save}() done in > an NMI handler has a high probability of being a bug. And it is a > bug if the IRQ disabling is there to prevent re-entry of the code > on the local cpu. > > > Now, the SPARC implementation might be good without IRQs disabled, but > > we should at least look at all other arches before we do what you > > propose below. As it removes the IRQ disable from the callsites whereas > > it previously always had that. > > Here's a quick audit: > > 1) Generic kernel/sched_clock.c implementation does a subtraction > of jiffies with a constant, then does some constant math on it. > Should be OK. > > 2) sparc64 is fine, just reads a register and multiplies with a > boot time calculated value, then shifts down by a constant. > Should be OK. > > 3) ARM mach-map, same situation as sparc64 > > 4) ARM mach-pxa, same situation as sparc64 > > 5) ARM mach-realview, multiplies counter a constant then divides by > one, should be OK. > > 6) ARM mach-sa1100, same as mach-realview > > 7) ARM mach-u300, uses boot time computed multiplier and shift, > should be OK. > > 8) ARM mach-versatile, same as mach-realview > > 9) ARM plat-IOP, same as mach-u300 > > 10) ARM plat-omap, same as mach-u300 > > 11) ARM plat-orion, same as sparc64 > > 12) Blackfin, TS version is same as sparc64 > > non-TS version is unnecessary duplication of generic weak > function version in kernel/sched_clock.c and could be deleted > > 13) CRIS, another dup of kernel/sched_clock.c weak function > > 14) FRV, another dup of kernel/sched_clock.c weak function > > 15) IA64, same as sparc64 for native version. > Paravirt version uses preemption disable, but also relies on > IA64 always setting HAVE_UNSTABLE_SCHED_CLOCK which it does, > and therefore IA64 would still disable interrupts with my change > > 16) m68knommu coldfire, multiplies by a constant then shifts down by > one, should be OK > > 17) mn10300, same as sparc64 > > 18) powerpc, non-__USE_RTC() case is same as sparc64 > > __USE_RTC() case also looks fine > > 19) s390, depends upon preemption being disabled so that > stop_machine does not interrupt the sched_clock() call > > should be OK > > 20) x86 sets HAVE_UNSTABLE_SCHED_CLOCK > > It should be safe. OK, you convinced me plenty ;-) I guess we need some debug code to ensure we don't grow any local_irq_save/restore/disable code in NMI paths, and maybe document this some place. Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cpu_clock() in NMIs 2009-12-14 19:32 ` Peter Zijlstra @ 2009-12-15 5:36 ` David Miller 0 siblings, 0 replies; 7+ messages in thread From: David Miller @ 2009-12-15 5:36 UTC (permalink / raw) To: peterz; +Cc: mingo, tglx, linux-kernel From: Peter Zijlstra <peterz@infradead.org> Date: Mon, 14 Dec 2009 20:32:50 +0100 > OK, you convinced me plenty ;-) > > I guess we need some debug code to ensure we don't grow any > local_irq_save/restore/disable code in NMI paths, and maybe document > this some place. > > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Ingo and others. I'd like to get this kernel/sched_clock.c IRQ disabling patch into Linus's tree and then into -stable because the fix for sparc64 perf depends upon it. Thanks. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: cpu_clock() in NMIs 2009-12-14 19:09 ` David Miller 2009-12-14 19:32 ` Peter Zijlstra @ 2009-12-14 19:57 ` Mike Frysinger 1 sibling, 0 replies; 7+ messages in thread From: Mike Frysinger @ 2009-12-14 19:57 UTC (permalink / raw) To: David Miller; +Cc: peterz, mingo, tglx, linux-kernel On Mon, Dec 14, 2009 at 14:09, David Miller wrote: > 12) Blackfin, TS version is same as sparc64 > > non-TS version is unnecessary duplication of generic weak > function version in kernel/sched_clock.c and could be deleted thanks, i'll take care of this -mike ^ permalink raw reply [flat|nested] 7+ messages in thread
* [tip:sched/urgent] sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK 2009-12-14 2:25 cpu_clock() in NMIs David Miller 2009-12-14 9:02 ` Peter Zijlstra @ 2009-12-15 8:11 ` tip-bot for David Miller 1 sibling, 0 replies; 7+ messages in thread From: tip-bot for David Miller @ 2009-12-15 8:11 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, a.p.zijlstra, efault, davem, tglx, mingo Commit-ID: b9f8fcd55bbdb037e5332dbdb7b494f0b70861ac Gitweb: http://git.kernel.org/tip/b9f8fcd55bbdb037e5332dbdb7b494f0b70861ac Author: David Miller <davem@davemloft.net> AuthorDate: Sun, 13 Dec 2009 18:25:02 -0800 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Tue, 15 Dec 2009 09:04:36 +0100 sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK Relax stable-sched-clock architectures to not save/disable/restore hardirqs in cpu_clock(). The background is that I was trying to resolve a sparc64 perf issue when I discovered this problem. On sparc64 I implement pseudo NMIs by simply running the kernel at IRQ level 14 when local_irq_disable() is called, this allows performance counter events to still come in at IRQ level 15. This doesn't work if any code in an NMI handler does local_irq_save() or local_irq_disable() since the "disable" will kick us back to cpu IRQ level 14 thus letting NMIs back in and we recurse. The only path which that does that in the perf event IRQ handling path is the code supporting frequency based events. It uses cpu_clock(). cpu_clock() simply invokes sched_clock() with IRQs disabled. And that's a fundamental bug all on it's own, particularly for the HAVE_UNSTABLE_SCHED_CLOCK case. NMIs can thus get into the sched_clock() code interrupting the local IRQ disable code sections of it. Furthermore, for the not-HAVE_UNSTABLE_SCHED_CLOCK case, the IRQ disabling done by cpu_clock() is just pure overhead and completely unnecessary. So the core problem is that sched_clock() is not NMI safe, but we are invoking it from NMI contexts in the perf events code (via cpu_clock()). A less important issue is the overhead of IRQ disabling when it isn't necessary in cpu_clock(). CONFIG_HAVE_UNSTABLE_SCHED_CLOCK architectures are not affected by this patch. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20091213.182502.215092085.davem@davemloft.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_clock.c | 23 +++++++++++++++-------- 1 files changed, 15 insertions(+), 8 deletions(-) diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c index 479ce56..5b49613 100644 --- a/kernel/sched_clock.c +++ b/kernel/sched_clock.c @@ -236,6 +236,18 @@ void sched_clock_idle_wakeup_event(u64 delta_ns) } EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event); +unsigned long long cpu_clock(int cpu) +{ + unsigned long long clock; + unsigned long flags; + + local_irq_save(flags); + clock = sched_clock_cpu(cpu); + local_irq_restore(flags); + + return clock; +} + #else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */ void sched_clock_init(void) @@ -251,17 +263,12 @@ u64 sched_clock_cpu(int cpu) return sched_clock(); } -#endif /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */ unsigned long long cpu_clock(int cpu) { - unsigned long long clock; - unsigned long flags; + return sched_clock_cpu(cpu); +} - local_irq_save(flags); - clock = sched_clock_cpu(cpu); - local_irq_restore(flags); +#endif /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */ - return clock; -} EXPORT_SYMBOL_GPL(cpu_clock); ^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-12-15 8:12 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-12-14 2:25 cpu_clock() in NMIs David Miller 2009-12-14 9:02 ` Peter Zijlstra 2009-12-14 19:09 ` David Miller 2009-12-14 19:32 ` Peter Zijlstra 2009-12-15 5:36 ` David Miller 2009-12-14 19:57 ` Mike Frysinger 2009-12-15 8:11 ` [tip:sched/urgent] sched: Fix cpu_clock() in NMIs, on !CONFIG_HAVE_UNSTABLE_SCHED_CLOCK tip-bot for David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox