From: Michael wang <wangyun@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>,
Mike Galbraith <umgwanakikbuti@gmail.com>,
Rik van Riel <riel@redhat.com>, Ingo Molnar <mingo@kernel.org>,
Alex Shi <alex.shi@linaro.org>, Paul Turner <pjt@google.com>,
Mel Gorman <mgorman@suse.de>,
Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance
Date: Mon, 30 Jun 2014 15:36:39 +0800 [thread overview]
Message-ID: <53B11387.9020001@linux.vnet.ibm.com> (raw)
In-Reply-To: <53A11A89.5000602@linux.vnet.ibm.com>
On 06/18/2014 12:50 PM, Michael wang wrote:
> By testing we found that after put benchmark (dbench) in to deep cpu-group,
> tasks (dbench routines) start to gathered on one CPU, which lead to that the
> benchmark could only get around 100% CPU whatever how big it's task-group's
> share is, here is the link of the way to reproduce the issue:
Hi, Peter
We thought that involved too much factors will make things too
complicated, we are trying to start over and get rid of the concepts of
'deep-group' and 'GENTLE_FAIR_SLEEPERS' in the idea, wish this could
make things more easier...
Let's put down the prev-discussions, for now we just want to proposal a
cpu-group feature which could help dbench to gain enough CPU while
stress is running, in a gentle way which hasn't yet been provided by
current scheduler.
I'll post a new patch on that later, we're looking forward your comments
on it :)
Regards,
Michael Wang
>
> https://lkml.org/lkml/2014/5/16/4
>
> Please note that our comparison was based on the same workload, the only
> difference is we put the workload one level deeper, and dbench could only
> got 1/3 of the CPU% it used to have, the throughput dropped to half.
>
> The dbench got less CPU since all it's instances start to gathering on the
> same CPU more often than before, and in such cases, whatever how big their
> share is, only one CPU they could occupy.
>
> This is caused by that when dbench is in deep-group, the balance between
> it's gathering speed (depends on wake-affine) and spreading speed (depends
> on load-balance) was broken, that is more gathering chances while less
> spreading chances.
>
> Since after put dbench into deep group, it's representive load in root-group
> become less, which make it harder to break the load balance of system, this
> is a comparison between dbench root-load and system-tasks (besides dbench)
> load, for eg:
>
> sg0 sg1
> cpu0 cpu1 cpu2 cpu3
>
> kworker/0:0 kworker/1:0 kworker/2:0 kworker/3:0
> kworker/0:1 kworker/1:1 kworker/2:1 kworker/3:1
> dbench
> dbench
> dbench
> dbench
> dbench
> dbench
>
> Here without dbench, the load between sg is already balanced, which is:
>
> 4096:4096
>
> When dbench is in one of the three cpu-cgroups on level 1, it's root-load
> is 1024/6, so we have:
>
> sg0
> 4096 + 6 * (1024 / 6)
> sg1
> 4096
>
> sg0 : sg1 == 5120 : 4096 == 125%
>
> bigger than imbalance-pct (117% for eg), dbench spread to sg1
>
>
> When dbench is in one of the three cpu-cgroups on level 2, it's root-load
> is 1024/18, now we have:
>
> sg0
> 4096 + 6 * (1024 / 18)
> sg1
> 4096
>
> sg0 : sg1 ~= 4437 : 4096 ~= 108%
>
> smaller than imbalance-pct (same the 117%), dbench keep gathering in sg0
>
> Thus load-balance routine become inactively on spreading dbench to other CPU,
> and it's routine keep gathering on CPU more longer than before.
>
> This patch try to select 'idle' cfs_rq inside task's cpu-group when there is no
> idle CPU located by select_idle_sibling(), instead of return the 'target'
> arbitrarily, this recheck help us to reserve the effect of load-balance longer,
> and help to make the system more balance.
>
> Like in the example above, the fix now will make things as:
> 1. dbench instances will be 'balanced' inside tg, ideally each cpu will
> have one instance.
> 2. if 1 do make the load become imbalance, load-balance routine will do
> it's job and move instances to proper CPU.
> 3. after 2 was done, the target CPU will always be preferred as long as
> it only got one instance.
>
> Although for tasks like dbench, 2 is rarely happened, while combined with 3, we
> will finally locate a good CPU for each instance which make both internal and
> external balanced.
>
> After applied this patch, the behaviour of dbench in deep cpu-group become
> normal, the dbench throughput was back.
>
> Tested benchmarks like ebizzy, kbench, dbench on X86 12-CPU server, the patch
> works well and no regression showup.
>
> Highlight:
> With out a fix, any similar workload like dbench will face the same
> issue that the cpu-cgroup share lost it's effect
>
> This may not just be an issue of cgroup, whenever we have tasks which
> with small-load, play quick flip on each other, they may gathering.
>
> Please let me know if you have any questions on whatever the issue or the fix,
> comments are welcomed ;-)
>
> CC: Ingo Molnar <mingo@kernel.org>
> CC: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
> ---
> kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 81 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fea7d33..e1381cd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4409,6 +4409,62 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
> return idlest;
> }
>
> +static inline int tg_idle_cpu(struct task_group *tg, int cpu)
> +{
> + return !tg->cfs_rq[cpu]->nr_running;
> +}
> +
> +/*
> + * Try and locate an idle CPU in the sched_domain from tg's view.
> + *
> + * Although gathered on same CPU and spread accross CPUs could make
> + * no difference from highest group's view, this will cause the tasks
> + * starving, even they have enough share to fight for CPU, they only
> + * got one battle filed, which means whatever how big their weight is,
> + * they totally got one CPU at maximum.
> + *
> + * Thus when system is busy, we filtered out those tasks which couldn't
> + * gain help from balance routine, and try to balance them internally
> + * by this func, so they could stand a chance to show their power.
> + *
> + */
> +static int tg_idle_sibling(struct task_struct *p, int target)
> +{
> + struct sched_domain *sd;
> + struct sched_group *sg;
> + int i = task_cpu(p);
> + struct task_group *tg = task_group(p);
> +
> + if (tg_idle_cpu(tg, target))
> + goto done;
> +
> + sd = rcu_dereference(per_cpu(sd_llc, target));
> + for_each_lower_domain(sd) {
> + sg = sd->groups;
> + do {
> + if (!cpumask_intersects(sched_group_cpus(sg),
> + tsk_cpus_allowed(p)))
> + goto next;
> +
> + for_each_cpu(i, sched_group_cpus(sg)) {
> + if (i == target || !tg_idle_cpu(tg, i))
> + goto next;
> + }
> +
> + target = cpumask_first_and(sched_group_cpus(sg),
> + tsk_cpus_allowed(p));
> +
> + goto done;
> +next:
> + sg = sg->next;
> + } while (sg != sd->groups);
> + }
> +
> +done:
> +
> + return target;
> +}
> +
> /*
> * Try and locate an idle CPU in the sched_domain.
> */
> @@ -4417,6 +4473,7 @@ static int select_idle_sibling(struct task_struct *p, int target)
> struct sched_domain *sd;
> struct sched_group *sg;
> int i = task_cpu(p);
> + struct sched_entity *se = task_group(p)->se[i];
>
> if (idle_cpu(target))
> return target;
> @@ -4451,6 +4508,30 @@ next:
> } while (sg != sd->groups);
> }
> done:
> +
> + if (!idle_cpu(target)) {
> + /*
> + * No idle cpu located imply the system is somewhat busy,
> + * usually we count on load balance routine's help and
> + * just pick the target whatever how busy it is.
> + *
> + * However, when task belong to a deep group (harder to
> + * make root imbalance) and flip frequently (harder to be
> + * caught during balance), load balance routine could help
> + * nothing, and these tasks will eventually gathered on same
> + * cpu when they wakeup each other, that is the chance of
> + * gathered stand far more higher than the chance of spread.
> + *
> + * Thus for such tasks, we need to handle them carefully
> + * during wakeup, since it's the very rarely chance for
> + * them to spread.
> + *
> + */
> + if (se && se->depth &&
> + p->wakee_flips > this_cpu_read(sd_llc_size))
> + return tg_idle_sibling(p, target);
> + }
> +
> return target;
> }
>
>
next prev parent reply other threads:[~2014-06-30 7:36 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-06-18 4:50 [PATCH] sched: select 'idle' cfs_rq per task-group to prevent tg-internal imbalance Michael wang
2014-06-23 9:42 ` Peter Zijlstra
2014-06-24 3:34 ` Michael wang
2014-07-01 8:20 ` Peter Zijlstra
2014-07-01 8:38 ` Michael wang
2014-07-01 8:56 ` Peter Zijlstra
2014-07-02 2:47 ` Michael wang
2014-07-02 12:49 ` Peter Zijlstra
2014-07-02 13:15 ` Peter Zijlstra
2014-07-03 16:29 ` Nicolas Pitre
2014-07-03 2:09 ` Michael wang
2014-07-02 14:47 ` Rik van Riel
2014-07-03 2:16 ` Michael wang
2014-07-03 3:51 ` Mike Galbraith
2014-07-11 16:11 ` Rik van Riel
2014-07-14 1:29 ` Michael wang
2014-06-30 7:36 ` Michael wang [this message]
2014-06-30 8:06 ` Mike Galbraith
2014-06-30 8:47 ` Michael wang
2014-06-30 9:27 ` Mike Galbraith
2014-07-01 2:57 ` Michael wang
2014-07-01 5:41 ` Mike Galbraith
2014-07-01 6:10 ` Michael wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53B11387.9020001@linux.vnet.ibm.com \
--to=wangyun@linux.vnet.ibm.com \
--cc=alex.shi@linaro.org \
--cc=daniel.lezcano@linaro.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=pjt@google.com \
--cc=riel@redhat.com \
--cc=umgwanakikbuti@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox