From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758272Ab1KSEl3 (ORCPT ); Fri, 18 Nov 2011 23:41:29 -0500 Received: from mailout-de.gmx.net ([213.165.64.23]:45827 "HELO mailout-de.gmx.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1758200Ab1KSEl0 (ORCPT ); Fri, 18 Nov 2011 23:41:26 -0500 X-Authenticated: #14349625 X-Provags-ID: V01U2FsdGVkX19Ir29D47hIw7BoiarSCAybBMzU1FrwPwz9LaEhKH q1uGCHynmmD3qY Subject: Re: [patch 5/6] sched: disable sched feature TTWU_QUEUE by default From: Mike Galbraith To: Suresh Siddha Cc: Peter Zijlstra , Ingo Molnar , Venki Pallipadi , Srivatsa Vaddagiri , linux-kernel , Tim Chen , alex.shi@intel.com In-Reply-To: <1321677009.6307.13.camel@marge.simson.net> References: <20111118230323.592022417@sbsiddha-desk.sc.intel.com> <20111118230554.105376150@sbsiddha-desk.sc.intel.com> <1321677009.6307.13.camel@marge.simson.net> Content-Type: text/plain; charset="UTF-8" Date: Sat, 19 Nov 2011 05:41:20 +0100 Message-ID: <1321677680.6307.15.camel@marge.simson.net> Mime-Version: 1.0 X-Mailer: Evolution 2.32.1 Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 2011-11-19 at 05:30 +0100, Mike Galbraith wrote: > On Fri, 2011-11-18 at 15:03 -0800, Suresh Siddha wrote: > > plain text document attachment (disable_sched_ttwu_queue.patch) > > Context-switch intensive microbenchmark on a 8-socket system had > > ~600K times more resched IPI's on each logical CPU with this feature enabled > > by default. Disabling this features makes that microbenchmark perform 5 times > > better. > > > > Also disabling this feature showed 2% performance improvement on a 8-socket > > OLTP workload. > > > > More heurestics are needed when and how to use this feature by default. > > For now, disable it by default. > > Yeah, the overhead for very hefty switchers is high enough to increase > TCP_RR latency up to 13% in my testing. I used a trylock() to generally > not eat that, but leave the contended case improvement intact. > > Peter suggested trying doing the IPI only when crossing cache > boundaries, which worked for me as well. On a related TTWU_QUEUE note, I was pondering idle_balance(). --- kernel/sched_fair.c | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-) Index: linux-3.0/kernel/sched_fair.c =================================================================== --- linux-3.0.orig/kernel/sched_fair.c +++ linux-3.0/kernel/sched_fair.c @@ -3500,8 +3500,7 @@ out: static void idle_balance(int this_cpu, struct rq *this_rq) { struct sched_domain *sd; - int pulled_task = 0; - unsigned long next_balance = jiffies + HZ; + unsigned long next_balance; if (this_rq->avg_idle < sysctl_sched_migration_cost) return; @@ -3512,33 +3511,41 @@ static void idle_balance(int this_cpu, s raw_spin_unlock(&this_rq->lock); update_shares(this_cpu); + next_balance = jiffies + HZ; rcu_read_lock(); for_each_domain(this_cpu, sd) { unsigned long interval; int balance = 1; + if (this_rq->nr_running || this_rq->wake_list) + break; + if (!(sd->flags & SD_LOAD_BALANCE)) continue; - if (sd->flags & SD_BALANCE_NEWIDLE) { - /* If we've pulled tasks over stop searching: */ - pulled_task = load_balance(this_cpu, this_rq, - sd, CPU_NEWLY_IDLE, &balance); - } + if (!(sd->flags & SD_BALANCE_NEWIDLE)) + continue; + + load_balance(this_cpu, this_rq, sd, CPU_NEWLY_IDLE, &balance); interval = msecs_to_jiffies(sd->balance_interval); if (time_after(next_balance, sd->last_balance + interval)) next_balance = sd->last_balance + interval; - if (pulled_task) { + if (this_rq->nr_running || this_rq->wake_list) { this_rq->idle_stamp = 0; break; } } rcu_read_unlock(); + /* IPI in flighht? Let the it happen */ + if (unlikely(this_rq->wake_list)) { + local_irq_enable(); + local_irq_disable(); + } raw_spin_lock(&this_rq->lock); - if (pulled_task || time_after(jiffies, this_rq->next_balance)) { + if (this_rq->nr_running || time_after(jiffies, this_rq->next_balance)) { /* * We are going idle. next_balance may be set based on * a busy processor. So reset next_balance.