From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755314Ab0LDOdR (ORCPT ); Sat, 4 Dec 2010 09:33:17 -0500 Received: from mail-yx0-f174.google.com ([209.85.213.174]:34591 "EHLO mail-yx0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755013Ab0LDOdQ (ORCPT ); Sat, 4 Dec 2010 09:33:16 -0500 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=date:from:to:cc:subject:message-id:reply-to:references:mime-version :content-type:content-disposition:content-transfer-encoding :in-reply-to:user-agent; b=CJ+zpKprBNHywuNUfMQ1JkEMGBDc+Tl8LVdUVRJu85YhYsp5sFj5i4HTv8Q65NY8KR yI2NRPINLsAPl5o338lUBianG2uDY8nM5Jj0/qGGYQI8720DjX/3JsYpR4Y+a73RD2Ly UJKDbzGpSNr2oOPxExAwg128MjDTrO4i7TtEY= Date: Sat, 4 Dec 2010 22:33:05 +0800 From: Yong Zhang To: "Bjoern B. Brandenburg" Cc: Mike Galbraith , Peter Zijlstra , Ingo Molnar , Andrea Bastoni , "James H. Anderson" , linux-kernel@vger.kernel.org Subject: Re: Scheduler bug related to rq->skip_clock_update? Message-ID: <20101204143305.GD2295@zhy> Reply-To: Yong Zhang References: <1290359641.4816.69.camel@maggy.simson.net> <1290442781.16393.22.camel@maggy.simson.net> <20101204140816.GC2295@zhy> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20101204140816.GC2295@zhy> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Dec 04, 2010 at 10:08:16PM +0800, Yong Zhang wrote: > On Mon, Nov 22, 2010 at 01:14:47PM -0500, Bjoern B. Brandenburg wrote: > > On Mon, 22 Nov 2010, Mike Galbraith wrote: > > > > > On Sun, 2010-11-21 at 23:29 -0500, Bjoern B. Brandenburg wrote: > > > > On Sun, 21 Nov 2010, Mike Galbraith wrote: > > > > > > > > > On Sat, 2010-11-20 at 23:22 -0500, Bjoern B. Brandenburg wrote: > > > > > > > > > > > I was under the impression that, as an invariant, tasks should not have > > > > > > TIF_NEED_RESCHED set after they've blocked. In this case, the idle load > > > > > > balancer should not mark the task that's on its way out with > > > > > > set_tsk_need_resched(). > > > > > > > > > > Nice find. > > > > > > > > > > > In any case, check_preempt_curr() seems to assume that a resuming task cannot > > > > > > have TIF_NEED_RESCHED already set. Setting skip_clock_update on a remote CPU > > > > > > that hasn't even been notified via IPI seems wrong. > > > > > > > > > > Yes. Does the below fix it up for you? > > > > > > > > The patch definitely changes the behavior, but it doesn't seem to solve (all > > > > of) the root cause(s). The failsafe kicks in and clears the flag the next > > > > time that update_rq_clock() is called, but there can still be a significant > > > > delay between setting and clearing the flag. Right after boot, I'm now seeing > > > > values that go up to ~21ms. > > > > > > A pull isn't the only vulnerability. Since idle_balance() drops > > > rq->lock, so another cpu can wake to this rq. > > > > > > > Please let me know if there is something else that I should test. > > > > > > Sched: clear_tsk_need_resched() after NEWIDLE balancing > > > > > > idle_balance() drops/retakes rq->lock, leaving the previous task > > > vulnerable to set_tsk_need_resched() from another CPU. Clear it > > > after NEWIDLE balancing to maintain the invariant that descheduled > > > tasks are NOT marked for resched. > > > > > > This also confuses the skip_clock_update logic, which assumes that > > > the next call to update_rq_clock() will come nearly ĩmmediately after > > > being set. Make the optimization more robust by clearing before we > > > balance and in update_rq_clock(). > > > > Unfortunately that doesn't seem to do it yet. > > > > After running five 'find /' instances to completion on the ARM platform, > > I'm still seeing delays close to 10ms. > > Because skip_clock_recent_max/skip_clock_max is based on nanosecond, > so the delay should be 10us, right? Ohhhh. My head is twisted. Sorry for the noise. I should go to sleep now to make me clear. ;) Thanks, Yong > > > > > bbb@district10:~$ egrep 'cpu#|skip' /proc/sched_debug > > cpu#0 > > .skip_clock_count : 89606 > > .skip_clock_recent_max : 9817250 > > .skip_clock_max : 21992375 > > cpu#1 > > .skip_clock_count : 81978 > > .skip_clock_recent_max : 9582500 > > .skip_clock_max : 17201750 > > cpu#2 > > .skip_clock_count : 74565 > > .skip_clock_recent_max : 9678000 > > .skip_clock_max : 9879250 > > cpu#3 > > .skip_clock_count : 81685 > > .skip_clock_recent_max : 9300125 > > .skip_clock_max : 14115750 > > > > On the x86_64 host, I've changed to HZ=100 and am now also seeing delays > > close to 10ms after 'make clean && make -j8 bzImage'. > > ditto. > > > > > bbb@koruna:~$ egrep 'cpu#|skip' /proc/sched_debug > > cpu#0, 2493.476 MHz > > .skip_clock_count : 29703 > > .skip_clock_recent_max : 9999858 > > .skip_clock_max : 40645942 > > cpu#1, 2493.476 MHz > > .skip_clock_count : 32696 > > .skip_clock_recent_max : 9959118 > > .skip_clock_max : 35074771 > > cpu#2, 2493.476 MHz > > .skip_clock_count : 31742 > > .skip_clock_recent_max : 9788654 > > .skip_clock_max : 24821765 > > cpu#3, 2493.476 MHz > > .skip_clock_count : 31123 > > .skip_clock_recent_max : 9858546 > > .skip_clock_max : 44276033 > > cpu#4, 2493.476 MHz > > .skip_clock_count : 28346 > > .skip_clock_recent_max : 10000775 > > .skip_clock_max : 18681753 > > cpu#5, 2493.476 MHz > > .skip_clock_count : 29421 > > .skip_clock_recent_max : 9997656 > > .skip_clock_max : 138473407 > > cpu#6, 2493.476 MHz > > .skip_clock_count : 27721 > > .skip_clock_recent_max : 9992074 > > .skip_clock_max : 53436918 > > cpu#7, 2493.476 MHz > > .skip_clock_count : 29637 > > .skip_clock_recent_max : 9994516 > > .skip_clock_max : 566793528 > > > > These numbers were recorded with the below patch. > > > > Please let me know if I can help by testing or tracing something else. > > > > Thanks, > > Bjoern > > > > --- > > sched: show length of runqueue clock deactivation in /proc/sched_debug > > > > The runqueue clock update should obviously not be skipped for > > prolonged times (otherwise the consumed time is not correctly kept > > track of). This patch measures the time between setting and clearing > > the rq->skip_clock_update flag. The maximum observed value is exported > > in /proc/sched_debug. > > --- > > kernel/sched.c | 30 +++++++++++++++++++++++++++--- > > kernel/sched_debug.c | 4 ++++ > > 2 files changed, 31 insertions(+), 3 deletions(-) > > > > diff --git a/kernel/sched.c b/kernel/sched.c > > index 29509e1..08839ce 100644 > > --- a/kernel/sched.c > > +++ b/kernel/sched.c > > @@ -541,6 +541,11 @@ struct rq { > > unsigned long long rq_cpu_time; > > /* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */ > > > > + u64 skip_clock_max; > > + u64 skip_clock_recent_max; > > + u64 skip_clock_set; > > + unsigned int skip_clock_count; > > + > > /* sys_sched_yield() stats */ > > unsigned int yld_count; > > > > @@ -639,6 +644,22 @@ static inline struct task_group *task_group(struct task_struct *p) > > static u64 irq_time_cpu(int cpu); > > static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time); > > > > +static void clear_skip_clock_update(struct rq *rq) > > +{ > > +#ifdef CONFIG_SCHEDSTATS > > + if (unlikely(rq->skip_clock_update)) { > > + u64 skipped = sched_clock_cpu(cpu_of(rq)) - rq->skip_clock_set; > > + rq->skip_clock_max = max(rq->skip_clock_max, skipped); > > + /* reset infrequently to expose changes */ > > + if (!(++rq->skip_clock_count % 10000)) > > + rq->skip_clock_recent_max = 0; > > + rq->skip_clock_recent_max = max(rq->skip_clock_recent_max, > > + skipped); > > + } > > +#endif > > + rq->skip_clock_update = 0; > > +} > > + > > inline void update_rq_clock(struct rq *rq) > > { > > if (!rq->skip_clock_update) { > > @@ -652,7 +673,7 @@ inline void update_rq_clock(struct rq *rq) > > > > sched_irq_time_avg_update(rq, irq_time); > > } > > - rq->skip_clock_update = 0; > > + clear_skip_clock_update(rq); > > } > > > > /* > > @@ -2130,8 +2151,11 @@ static void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) > > * A queue event has occurred, and we're going to schedule. In > > * this case, we can save a useless back to back clock update. > > */ > > - if (test_tsk_need_resched(rq->curr)) > > + if (test_tsk_need_resched(rq->curr)) { > > + schedstat_set(rq->skip_clock_set, > > + sched_clock_cpu(cpu_of(rq))); > > rq->skip_clock_update = 1; > > + } > > } > > > > #ifdef CONFIG_SMP > > @@ -3938,7 +3962,7 @@ need_resched_nonpreemptible: > > * Clock updates should not be skipped while we're away. > > */ > > clear_tsk_need_resched(prev); > > - rq->skip_clock_update = 0; > > + clear_skip_clock_update(rq); > > } > > > > put_prev_task(rq, prev); > > diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c > > index 2e1b0d1..8924be2 100644 > > --- a/kernel/sched_debug.c > > +++ b/kernel/sched_debug.c > > @@ -298,6 +298,10 @@ static void print_cpu(struct seq_file *m, int cpu) > > > > P(bkl_count); > > > > + P(skip_clock_count); > > + P64(skip_clock_recent_max); > > + P64(skip_clock_max); > > + > > #undef P > > #endif > > print_cfs_stats(m, cpu); > > -- > > 1.7.3.1 >