Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jason Low <jason.low2@hp.com>
To: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Mike Galbraith <efault@gmx.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	Paul Turner <pjt@google.com>, Alex Shi <alex.shi@intel.com>,
	Preeti U Murthy <preeti@linux.vnet.ibm.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	Namhyung Kim <namhyung@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Kees Cook <keescook@chromium.org>, Mel Gorman <mgorman@suse.de>,
	aswin@hp.com, scott.norton@hp.com, chegu_vinod@hp.com
Subject: Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
Date: Thu, 18 Jul 2013 12:06:39 -0700	[thread overview]
Message-ID: <1374174399.1792.42.camel@j-VirtualBox> (raw)
In-Reply-To: <51E7D89A.8010009@redhat.com>

On Thu, 2013-07-18 at 07:59 -0400, Rik van Riel wrote:
> On 07/18/2013 05:32 AM, Peter Zijlstra wrote:
> > On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote:
> >
> >> I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
> >> to set N to more than 20 in order to get the big performance gains.
> >>
> >> One thing that I thought of was to have N be based on how often idle
> >> balance attempts does not pull task(s).
> >>
> >> For example, N can be calculated based on the number of idle balance
> >> attempts for the CPU  since the last "successful" idle balance attempt.
> >> So if the previous 30 idle balance attempts resulted in no tasks moved,
> >> then n = 30 / 5. So idle balance gets less time to run as the number of
> >> unneeded idle balance attempts increases, and thus N will not be set too
> >> high during situations where idle balancing is "successful" more often.
> >> Any comments on this idea?
> >
> > It would be good to get a solid explanation for why we need such high N.
> > But yes that might work.
> 
> I have some idea, though no proof :)
> 
> I suspect a lot of the idle balancing time is spent waiting for
> and acquiring the runqueue locks of remote CPUs.
> 
> If we spend half our idle time causing contention to remote
> runqueue locks, we could be a big factor in keeping those other
> CPUs from getting work done.

I collected some perf samples when running fserver when N=1 and N=60.

N = 1
-----
19.21%  reaim  [k] __read_lock_failed                     
14.79%  reaim  [k] mspin_lock                             
12.19%  reaim  [k] __write_lock_failed                    
7.87%   reaim  [k] _raw_spin_lock                          
2.03%   reaim  [k] start_this_handle                       
1.98%   reaim  [k] update_sd_lb_stats                      
1.92%   reaim  [k] mutex_spin_on_owner                     
1.86%   reaim  [k] update_cfs_rq_blocked_load              
1.14%   swapper  [k] intel_idle                              
1.10%   reaim  [.] add_long                                
1.09%   reaim  [.] add_int                                 
1.08%   reaim  [k] load_balance                            

N = 60
------
7.70%  reaim  [k] _raw_spin_lock                             
7.25%  reaim  [k] mspin_lock                                 
6.30%  reaim  [.] add_long                                   
6.26%  reaim  [.] add_int                                    
4.05%  reaim  [.] strncat                                    
3.81%  reaim  [.] string_rtns_1                              
3.66%  reaim  [.] div_long                                   
3.44%  reaim  [k] mutex_spin_on_owner                        
2.91%  reaim  [.] add_short                                  
2.73%  swapper  [k] intel_idle                                 
2.65%  reaim  [k] __read_lock_failed

With idle_balance(), we get more contention in kernel functions such as
update_sd_lb_stats(), load_balance(), and spin_lock() for the rq lock.
Additionally, it increases the time spent in mutex's mspin_lock(),
__read_lock_failed() and __write_lock_failed() by a lot.

N needs to be large because avg_idle time is still a lot higher than the
avg time spent in each load_balance() call per sched domain. Despite the
high ratio of avg_idle time to time spent in load_balance(),
load_balance() still increases the time spent in the kernel by quite a
bit, probably because of how often it is being used.

Jason

next prev parent reply	other threads:[~2013-07-18 19:06 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-16 19:21 [RFC] sched: Limit idle_balance() when it is being used too frequently Jason Low
2013-07-16 19:27 ` Rik van Riel
2013-07-16 20:20 ` Peter Zijlstra
2013-07-16 22:48   ` Jason Low
2013-07-17  7:25     ` Peter Zijlstra
2013-07-17  7:48       ` Peter Zijlstra
2013-07-17  8:11       ` Jason Low
2013-07-17  9:39         ` Peter Zijlstra
2013-07-17 15:59           ` Jason Low
2013-07-17 16:18             ` Peter Zijlstra
2013-07-17 17:51               ` Rik van Riel
2013-07-17 18:01                 ` Peter Zijlstra
2013-07-17 18:48                   ` Jason Low
2013-07-18  4:02                   ` Jason Low
2013-07-18  9:32                     ` Peter Zijlstra
2013-07-18 11:59                       ` Rik van Riel
2013-07-18 12:15                         ` Srikar Dronamraju
2013-07-18 12:35                           ` Peter Zijlstra
2013-07-18 13:06                             ` Srikar Dronamraju
2013-07-18 19:06                         ` Jason Low [this message]
2013-07-19 18:37                           ` Peter Zijlstra
2013-07-19 19:15                             ` Jason Low
2013-07-18 12:12                     ` Srikar Dronamraju
2013-07-18 19:03                       ` Jason Low

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1374174399.1792.42.camel@j-VirtualBox \
    --to=jason.low2@hp.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=aswin@hp.com \
    --cc=chegu_vinod@hp.com \
    --cc=efault@gmx.de \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=preeti@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=scott.norton@hp.com \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).