From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753524AbYDWHwD (ORCPT ); Wed, 23 Apr 2008 03:52:03 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751667AbYDWHvy (ORCPT ); Wed, 23 Apr 2008 03:51:54 -0400 Received: from mx2.mail.elte.hu ([157.181.151.9]:45567 "EHLO mx2.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751604AbYDWHvw (ORCPT ); Wed, 23 Apr 2008 03:51:52 -0400 Date: Wed, 23 Apr 2008 09:51:40 +0200 From: Ingo Molnar To: David Miller Cc: linux-kernel@vger.kernel.org, tglx@linutronix.de, Peter Zijlstra Subject: Re: Soft lockup regression from today's sched.git merge. Message-ID: <20080423075140.GA20980@elte.hu> References: <20080422.015931.70144614.davem@davemloft.net> <20080422091456.GC9939@elte.hu> <20080422.224212.32058127.davem@davemloft.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20080422.224212.32058127.davem@davemloft.net> User-Agent: Mutt/1.5.17 (2007-11-01) X-ELTE-VirusStatus: clean X-ELTE-SpamScore: -1.5 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-1.5 required=5.9 tests=BAYES_00 autolearn=no SpamAssassin version=3.2.3 -1.5 BAYES_00 BODY: Bayesian spam probability is 0 to 1% [score: 0.0000] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * David Miller wrote: > From: Ingo Molnar > Date: Tue, 22 Apr 2008 11:14:56 +0200 > > > thanks for reporting it. I havent seen this false positive happen in a > > long time - but then again, PC CPUs are a lot less idle than a 128-CPU > > Niagara2 :-/ I'm wondering what the best method would be to provoke a > > CPU to stay idle that long - to make sure this bug is fixed. > > I looked more closely at this. > > There is no way the patch in question can work properly. > > The algorithm is, essentialy "if time - prev_cpu_time is large enough, > call __sync_cpu_clock()" which if fine, except that nothing ever sets > prev_cpu_time. you are right - and this causes us to hit that global spinlock every time cpu_clock() is called. Note that only debugging code uses cpu_clock() though (softlockup watchdog, blktrace, etc.) - but you are right that the performance bug should be fixed - the patch below fixes the bogosity. the second patch below makes time_sync_thresh available to architecture code - that way, if your platform has a guaranteed-synchronous sched_clock(), you can set that to a larger value (or just -1LL for infinite). this problem cannot explain the softlockup bug though. Ingo ----------------------------> Subject: sched: fix cpu clock From: Ingo Molnar Date: Wed Apr 23 09:24:06 CEST 2008 David Miller pointed it out that nothing in cpu_clock() sets prev_cpu_time. This caused __sync_cpu_clock() to be called all the time - against the intention of this code. The result was that in practice we hit a global spinlock every time cpu_clock() is called - which - even though cpu_clock() is used for tracing and debugging, is suboptimal. Reported-by: David Miller Signed-off-by: Ingo Molnar --- kernel/sched.c | 2 ++ 1 file changed, 2 insertions(+) Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -1001,6 +1001,8 @@ unsigned long long notrace cpu_clock(int if (unlikely(delta_time > time_sync_thresh)) time = __sync_cpu_clock(time, cpu); + per_cpu(prev_cpu_time, cpu) = time; + return time; } EXPORT_SYMBOL_GPL(cpu_clock); Subject: sched: make clock sync tunable by architecture code From: Ingo Molnar Date: Wed Apr 23 09:31:35 CEST 2008 make time_sync_thresh tunable to architecture code. Signed-off-by: Ingo Molnar --- include/linux/sched.h | 2 ++ kernel/sched.c | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) Index: linux/include/linux/sched.h =================================================================== --- linux.orig/include/linux/sched.h +++ linux/include/linux/sched.h @@ -159,6 +159,8 @@ print_cfs_rq(struct seq_file *m, int cpu } #endif +extern unsigned long long time_sync_thresh; + /* * Task state bitmask. NOTE! These bits are also * encoded in fs/proc/array.c: get_task_state(). Index: linux/kernel/sched.c =================================================================== --- linux.orig/kernel/sched.c +++ linux/kernel/sched.c @@ -924,7 +924,7 @@ static inline u64 global_rt_runtime(void return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC; } -static const unsigned long long time_sync_thresh = 100000; +unsigned long long time_sync_thresh = 100000; static DEFINE_PER_CPU(unsigned long long, time_offset); static DEFINE_PER_CPU(unsigned long long, prev_cpu_time);