Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Michael wang <wangyun@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Mike Galbraith <umgwanakikbuti@gmail.com>,
	Rik van Riel <riel@redhat.com>, Ingo Molnar <mingo@kernel.org>,
	Alex Shi <alex.shi@linaro.org>, Paul Turner <pjt@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance
Date: Mon, 30 Jun 2014 15:36:39 +0800	[thread overview]
Message-ID: <53B11387.9020001@linux.vnet.ibm.com> (raw)
In-Reply-To: <53A11A89.5000602@linux.vnet.ibm.com>

On 06/18/2014 12:50 PM, Michael wang wrote:
> By testing we found that after put benchmark (dbench) in to deep cpu-group,
> tasks (dbench routines) start to gathered on one CPU, which lead to that the
> benchmark could only get around 100% CPU whatever how big it's task-group's
> share is, here is the link of the way to reproduce the issue:

Hi, Peter

We thought that involved too much factors will make things too
complicated, we are trying to start over and get rid of the concepts of
'deep-group' and 'GENTLE_FAIR_SLEEPERS' in the idea, wish this could
make things more easier...

Let's put down the prev-discussions, for now we just want to proposal a
cpu-group feature which could help dbench to gain enough CPU while
stress is running, in a gentle way which hasn't yet been provided by
current scheduler.

I'll post a new patch on that later, we're looking forward your comments
on it :)

Regards,
Michael Wang

> 
> 	https://lkml.org/lkml/2014/5/16/4
> 
> Please note that our comparison was based on the same workload, the only
> difference is we put the workload one level deeper, and dbench could only
> got 1/3 of the CPU% it used to have, the throughput dropped to half.
> 
> The dbench got less CPU since all it's instances start to gathering on the
> same CPU more often than before, and in such cases, whatever how big their
> share is, only one CPU they could occupy.
> 
> This is caused by that when dbench is in deep-group, the balance between
> it's gathering speed (depends on wake-affine) and spreading speed (depends
> on load-balance) was broken, that is more gathering chances while less
> spreading chances.
> 
> Since after put dbench into deep group, it's representive load in root-group
> become less, which make it harder to break the load balance of system, this
> is a comparison between dbench root-load and system-tasks (besides dbench)
> load, for eg:
> 
> 	sg0					sg1
> 	cpu0		cpu1			cpu2		cpu3
> 
> 	kworker/0:0	kworker/1:0		kworker/2:0	kworker/3:0
> 	kworker/0:1	kworker/1:1		kworker/2:1	kworker/3:1
> 	dbench
> 	dbench
> 	dbench
> 	dbench
> 	dbench
> 	dbench
> 
> Here without dbench, the load between sg is already balanced, which is:
> 
> 	4096:4096
> 
> When dbench is in one of the three cpu-cgroups on level 1, it's root-load
> is 1024/6, so we have:
> 
> 	sg0
> 		4096 + 6 * (1024 / 6)
> 	sg1
> 		4096
> 
> 	sg0 : sg1 == 5120 : 4096 == 125%
> 
> 	bigger than imbalance-pct (117% for eg), dbench spread to sg1
> 
> 
> When dbench is in one of the three cpu-cgroups on level 2, it's root-load
> is 1024/18, now we have:
> 
> 	sg0
> 		4096 + 6 * (1024 / 18)
> 	sg1
> 		4096
> 
> 	sg0 : sg1 ~= 4437 : 4096 ~= 108%
> 
> 	smaller than imbalance-pct (same the 117%), dbench keep gathering in sg0
> 
> Thus load-balance routine become inactively on spreading dbench to other CPU,
> and it's routine keep gathering on CPU more longer than before.
> 
> This patch try to select 'idle' cfs_rq inside task's cpu-group when there is no
> idle CPU located by select_idle_sibling(), instead of return the 'target'
> arbitrarily, this recheck help us to reserve the effect of load-balance longer,
> and help to make the system more balance.
> 
> Like in the example above, the fix now will make things as:
> 	1. dbench instances will be 'balanced' inside tg, ideally each cpu will
> 	   have one instance.
> 	2. if 1 do make the load become imbalance, load-balance routine will do
> 	   it's job and move instances to proper CPU.
> 	3. after 2 was done, the target CPU will always be preferred as long as
> 	   it only got one instance.
> 
> Although for tasks like dbench, 2 is rarely happened, while combined with 3, we
> will finally locate a good CPU for each instance which make both internal and
> external balanced.
> 
> After applied this patch, the behaviour of dbench in deep cpu-group become
> normal, the dbench throughput was back.
> 
> Tested benchmarks like ebizzy, kbench, dbench on X86 12-CPU server, the patch
> works well and no regression showup.
> 
> Highlight:
> 	With out a fix, any similar workload like dbench will face the same
> 	issue that the cpu-cgroup share lost it's effect
> 
> 	This may not just be an issue of cgroup, whenever we have tasks which
> 	with small-load, play quick flip on each other, they may gathering.
> 
> Please let me know if you have any questions on whatever the issue or the fix,
> comments are welcomed ;-)
> 
> CC: Ingo Molnar <mingo@kernel.org>
> CC: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
> ---
>  kernel/sched/fair.c |   81 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 81 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fea7d33..e1381cd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4409,6 +4409,62 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>  	return idlest;
>  }
>  
> +static inline int tg_idle_cpu(struct task_group *tg, int cpu)
> +{
> +	return !tg->cfs_rq[cpu]->nr_running;
> +}
> +
> +/*
> + * Try and locate an idle CPU in the sched_domain from tg's view.
> + *
> + * Although gathered on same CPU and spread accross CPUs could make
> + * no difference from highest group's view, this will cause the tasks
> + * starving, even they have enough share to fight for CPU, they only
> + * got one battle filed, which means whatever how big their weight is,
> + * they totally got one CPU at maximum.
> + *
> + * Thus when system is busy, we filtered out those tasks which couldn't
> + * gain help from balance routine, and try to balance them internally
> + * by this func, so they could stand a chance to show their power.
> + *
> + */
> +static int tg_idle_sibling(struct task_struct *p, int target)
> +{
> +	struct sched_domain *sd;
> +	struct sched_group *sg;
> +	int i = task_cpu(p);
> +	struct task_group *tg = task_group(p);
> +
> +	if (tg_idle_cpu(tg, target))
> +		goto done;
> +
> +	sd = rcu_dereference(per_cpu(sd_llc, target));
> +	for_each_lower_domain(sd) {
> +		sg = sd->groups;
> +		do {
> +			if (!cpumask_intersects(sched_group_cpus(sg),
> +						tsk_cpus_allowed(p)))
> +				goto next;
> +
> +			for_each_cpu(i, sched_group_cpus(sg)) {
> +				if (i == target || !tg_idle_cpu(tg, i))
> +					goto next;
> +			}
> +
> +			target = cpumask_first_and(sched_group_cpus(sg),
> +					tsk_cpus_allowed(p));
> +
> +			goto done;
> +next:
> +			sg = sg->next;
> +		} while (sg != sd->groups);
> +	}
> +
> +done:
> +
> +	return target;
> +}
> +
>  /*
>   * Try and locate an idle CPU in the sched_domain.
>   */
> @@ -4417,6 +4473,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  	struct sched_domain *sd;
>  	struct sched_group *sg;
>  	int i = task_cpu(p);
> +	struct sched_entity *se = task_group(p)->se[i];
>  
>  	if (idle_cpu(target))
>  		return target;
> @@ -4451,6 +4508,30 @@ next:
>  		} while (sg != sd->groups);
>  	}
>  done:
> +
> +	if (!idle_cpu(target)) {
> +		/*
> +		 * No idle cpu located imply the system is somewhat busy,
> +		 * usually we count on load balance routine's help and
> +		 * just pick the target whatever how busy it is.
> +		 *
> +		 * However, when task belong to a deep group (harder to
> +		 * make root imbalance) and flip frequently (harder to be
> +		 * caught during balance), load balance routine could help
> +		 * nothing, and these tasks will eventually gathered on same
> +		 * cpu when they wakeup each other, that is the chance of
> +		 * gathered stand far more higher than the chance of spread.
> +		 *
> +		 * Thus for such tasks, we need to handle them carefully
> +		 * during wakeup, since it's the very rarely chance for
> +		 * them to spread.
> +		 *
> +		 */
> +		if (se && se->depth &&
> +				p->wakee_flips > this_cpu_read(sd_llc_size))
> +			return tg_idle_sibling(p, target);
> +	}
> +
>  	return target;
>  }
>  
>

next prev parent reply	other threads:[~2014-06-30  7:36 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-18  4:50 [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance Michael wang
2014-06-23  9:42 ` Peter Zijlstra
2014-06-24  3:34   ` Michael wang
2014-07-01  8:20     ` Peter Zijlstra
2014-07-01  8:38       ` Michael wang
2014-07-01  8:56         ` Peter Zijlstra
2014-07-02  2:47           ` Michael wang
2014-07-02 12:49             ` Peter Zijlstra
2014-07-02 13:15               ` Peter Zijlstra
2014-07-03 16:29                 ` Nicolas Pitre
2014-07-03  2:09               ` Michael wang
2014-07-02 14:47         ` Rik van Riel
2014-07-03  2:16           ` Michael wang
2014-07-03  3:51           ` Mike Galbraith
2014-07-11 16:11             ` Rik van Riel
2014-07-14  1:29               ` Michael wang
2014-06-30  7:36 ` Michael wang [this message]
2014-06-30  8:06   ` Mike Galbraith
2014-06-30  8:47     ` Michael wang
2014-06-30  9:27       ` Mike Galbraith
2014-07-01  2:57         ` Michael wang
2014-07-01  5:41           ` Mike Galbraith
2014-07-01  6:10             ` Michael wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53B11387.9020001@linux.vnet.ibm.com \
    --to=wangyun@linux.vnet.ibm.com \
    --cc=alex.shi@linaro.org \
    --cc=daniel.lezcano@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=riel@redhat.com \
    --cc=umgwanakikbuti@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.