Re: [patch v3 0/8] sched: use runnable avg in load balance

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Michael Wang <wangyun@linux.vnet.ibm.com>
To: Alex Shi <alex.shi@intel.com>
Cc: mingo@redhat.com, peterz@infradead.org, tglx@linutronix.de,
	akpm@linux-foundation.org, arjan@linux.intel.com, bp@alien8.de,
	pjt@google.com, namhyung@kernel.org, efault@gmx.de,
	morten.rasmussen@arm.com, vincent.guittot@linaro.org,
	gregkh@linuxfoundation.org, preeti@linux.vnet.ibm.com,
	viresh.kumar@linaro.org, linux-kernel@vger.kernel.org,
	len.brown@intel.com, rafael.j.wysocki@intel.com, jkosina@suse.cz,
	clark.williams@gmail.com, tony.luck@intel.com,
	keescook@chromium.org, mgorman@suse.de, riel@redhat.com
Subject: Re: [patch v3 0/8] sched: use runnable avg in load balance
Date: Sun, 07 Apr 2013 11:09:09 +0800	[thread overview]
Message-ID: <5160E355.6000701@linux.vnet.ibm.com> (raw)
In-Reply-To: <515BEC5F.60001@intel.com>

On 04/03/2013 04:46 PM, Alex Shi wrote:
> On 04/02/2013 03:23 PM, Michael Wang wrote:
>> | 15 GB   |      12 | 45393 |   | 43986 |
>> | 15 GB   |      16 | 45110 |   | 45719 |
>> | 15 GB   |      24 | 41415 |   | 36813 |	-11.11%
>> | 15 GB   |      32 | 35988 |   | 34025 |
>>
>> The reason may caused by wake_affine()'s higher overhead, and pgbench is
>> really sensitive to this stuff...
> 
> Michael:
> I changed the threshold to 0.1ms it has same effect on aim7.
> So could you try the following on pgbench?

Here the results of different threshold, too many data so I just list 22
MB 32 clients item:

threshold(us)			tps

base				43420
500				40694
250				40591
120				41468		-4.50%
60				47785		+10.05%
30				51389
15				52844
6				54539
3				52674
1.5				52885

Since 120~60us seems to be the inflection, I made more test in this range:

threshold(us)			tps
base				43420
110				41772
100				42246
90				43471		0%
80				44920
70				46341

According to these data, 90us == 90000 is the inflection point on my box
for 22 MB 32 clients item, other test items show different float, so
80~90us is the conclusion.

Now the concern is how to deal with this issue, the results may changed
on different deployment, static value is not acceptable, so we need
another new knob here?

I'm not sure whether you have take a look at the wake-affine throttle
patch I sent weeks ago, it's purpose is throttle the wake-affine to not
work too frequently.

And since the aim problem is caused by the imbalance which is the
side-effect of frequently succeed wake-affine, may be the throttle patch
could help to address that issue too, if it is, then we only need to add
one new knob.

BTW, the benefit your patch set bring is not conflict with wake-affine
throttle patch, which means with your patch set, the 1ms throttle will
also show 25% improvement now (used to be <5%), and it also increase the
maximum benefit from 40% to 45% on my box ;-)

Regards,
Michael Wang

> 
> 
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index bf8086b..a3c3d43 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -53,6 +53,7 @@ extern unsigned int sysctl_numa_balancing_settle_count;
> 
>  #ifdef CONFIG_SCHED_DEBUG
>  extern unsigned int sysctl_sched_migration_cost;
> +extern unsigned int sysctl_sched_burst_threshold;
>  extern unsigned int sysctl_sched_nr_migrate;
>  extern unsigned int sysctl_sched_time_avg;
>  extern unsigned int sysctl_timer_migration;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dbaa8ca..dd5a324 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -91,6 +91,7 @@ unsigned int sysctl_sched_wakeup_granularity = 1000000UL;
>  unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL;
> 
>  const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
> +const_debug unsigned int sysctl_sched_burst_threshold = 100000UL;
> 
>  /*
>   * The exponential sliding  window over which load is averaged for shares
> @@ -3103,12 +3104,24 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  	unsigned long weight;
>  	int balanced;
>  	int runnable_avg;
> +	int burst = 0;
> 
>  	idx	  = sd->wake_idx;
>  	this_cpu  = smp_processor_id();
>  	prev_cpu  = task_cpu(p);
> -	load	  = source_load(prev_cpu, idx);
> -	this_load = target_load(this_cpu, idx);
> +
> +	if (cpu_rq(this_cpu)->avg_idle < sysctl_sched_burst_threshold ||
> +		cpu_rq(prev_cpu)->avg_idle < sysctl_sched_burst_threshold)
> +		burst= 1;
> +
> +	/* use instant load for bursty waking up */
> +	if (!burst) {
> +		load = source_load(prev_cpu, idx);
> +		this_load = target_load(this_cpu, idx);
> +	} else {
> +		load = cpu_rq(prev_cpu)->load.weight;
> +		this_load = cpu_rq(this_cpu)->load.weight;
> +	}
> 
>  	/*
>  	 * If sync wakeup then subtract the (maximum possible)
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index afc1dc6..1f23457 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -327,6 +327,13 @@ static struct ctl_table kern_table[] = {
>  		.proc_handler	= proc_dointvec,
>  	},
>  	{
> +		.procname	= "sched_burst_threshold_ns",
> +		.data		= &sysctl_sched_burst_threshold,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +	{
>  		.procname	= "sched_nr_migrate",
>  		.data		= &sysctl_sched_nr_migrate,
>  		.maxlen		= sizeof(unsigned int),
>

next prev parent reply	other threads:[~2013-04-07  3:09 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-02  3:23 [patch v3 0/8] sched: use runnable avg in load balance Alex Shi
2013-04-02  3:23 ` [patch v3 1/8] Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" Alex Shi
2013-04-02  3:23 ` [patch v3 2/8] sched: set initial value of runnable avg for new forked task Alex Shi
2013-04-02  3:23 ` [patch v3 3/8] sched: only count runnable avg on cfs_rq's nr_running Alex Shi
2013-04-03  3:19   ` Alex Shi
2013-04-02  3:23 ` [patch v3 4/8] sched: update cpu load after task_tick Alex Shi
2013-04-02  3:23 ` [patch v3 5/8] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task Alex Shi
2013-04-02  3:23 ` [patch v3 6/8] sched: consider runnable load average in move_tasks Alex Shi
2013-04-09  7:08   ` Vincent Guittot
2013-04-09  8:05     ` Alex Shi
2013-04-09  8:58       ` Vincent Guittot
2013-04-09 10:38         ` Alex Shi
2013-04-09 11:56           ` Vincent Guittot
2013-04-09 14:48             ` Alex Shi
2013-04-09 15:16               ` Vincent Guittot
2013-04-10  2:31                 ` Alex Shi
2013-04-10  6:07     ` Michael Wang
2013-04-10  6:55       ` Vincent Guittot
2013-04-02  3:23 ` [patch v3 7/8] sched: consider runnable load average in effective_load Alex Shi
2013-04-02  3:23 ` [patch v3 8/8] sched: use instant load for burst wake up Alex Shi
2013-04-02  7:23 ` [patch v3 0/8] sched: use runnable avg in load balance Michael Wang
2013-04-02  8:34   ` Mike Galbraith
2013-04-02  9:13     ` Michael Wang
2013-04-02  8:35   ` Alex Shi
2013-04-02  9:45     ` Michael Wang
2013-04-03  2:46     ` Michael Wang
2013-04-03  2:56       ` Alex Shi
2013-04-03  3:23         ` Michael Wang
2013-04-03  4:28           ` Alex Shi
2013-04-03  5:38             ` Michael Wang
2013-04-03  5:53               ` Michael Wang
2013-04-03  6:01               ` Alex Shi
2013-04-03  6:22             ` Michael Wang
2013-04-03  6:53               ` Alex Shi
2013-04-03  7:18                 ` Michael Wang
2013-04-03  7:28                   ` Alex Shi
2013-04-03  8:46   ` Alex Shi
2013-04-03  9:37     ` Michael Wang
2013-04-03 11:17       ` Alex Shi
2013-04-07  3:09     ` Michael Wang [this message]
2013-04-07  7:30       ` Alex Shi
2013-04-07  8:56         ` Michael Wang
2013-04-09  5:08 ` Alex Shi
2013-04-10 13:12   ` Alex Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5160E355.6000701@linux.vnet.ibm.com \
    --to=wangyun@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=arjan@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=clark.williams@gmail.com \
    --cc=efault@gmx.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=jkosina@suse.cz \
    --cc=keescook@chromium.org \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=preeti@linux.vnet.ibm.com \
    --cc=rafael.j.wysocki@intel.com \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=viresh.kumar@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox