* [PATCH V3 0/2] load_balance() fixes for affinity
@ 2017-05-18 19:36 Jeffrey Hugo
2017-05-18 19:36 ` [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Jeffrey Hugo @ 2017-05-18 19:36 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, linux-kernel
Cc: Dietmar Eggemann, Austin Christ, Tyler Baicar, Timur Tabi,
Jeffrey Hugo
We noticed in testing that affined workloads do not provide consistent
performance under certain circumstances. To isolate the issue, we began
testing with a representative CPU workload. In some cases the CPU workload
results would vary by a large percentage from run to run. We used JTAG and
scheduler trace to try and profile the issue and noticed that at least one
core was left idle for the duration of the test in the low score runs.
The specific test methodology used in our investigation involved using an
executable that was compiled to fork a specific number of workers. The
executable was then affined to a number of cores equal to the number of
workers forked, using the userspace taskset utility. The expectation was
that all cores in the mask of runnable CPUs would run a process until the
work was complete. In practice, for several affinity masks, one to two
cores would not receive work until late in the test or not at all.
Our first observation was that it appeared initial cpu assignment of the
forked threads appeared suboptimal. We observed that it was highly
probable that many or all of the threads would be initially assigned to the
same CPU. Our limited investigation into this indicated that the forking of
the threads occurs fast enough that load estimates for CPUs have not
adjusted to the work they have been assigned. The function
select_task_rq_fair() does not take number of running process into
consideration and thus can easily assign several tasks spawned close
together to the same CPU within the allowed CPUs for a workload. While this
seems suboptimal, the load_balance() algorithm run on other idle cores in
the affinity mask should resolve imbalances that result from it.
As our test case was leaving cores idle for many seconds at a time, we dug
into load_balance() code and specifically the group_imbalance path. We
added ftrace to the path in question and noticed that two branches in
load_balance() conflict in their assessments of the ability of the
algorithm to migrate load. The group_imbalance path correctly sets the flag
to indicate the group can not be properly balanced due to affinity, but the
redo condition right after this branch incorrectly assumes that there may
be other cores with work to be pulled by considering cores outside of the
scheduling domain in question.
Consider the following circumstance where (T) represents tasks and (*)
represents the allowed CPUs:
A B C
{ 0 1 } { 2 3 } { 4 5 }
T T T T
T
* * * * *
In this case, core 0 in scheduling group 'A' should run load balance on its
local scheduling domain and try to pull work from core 1. When it fails to
move any work due to affinity, it will mark scheduling group 'A' as
"group_imbalance". However, right after doing so, core 0 load_balance()
evaluates the redo path with core 1 removed from consideration. This should
leave no other cores to consider and result in a jump to "out_all_pinned",
leaving the "group_imbalance" flag to be seen by future runs of the
algorithm on core 3. As the code stands currently, core 0 evaluates all
cores instead of those in the scheduling domain under consideration.
It incorrectly sees other cores in the system and attempts a redo.
With core 1 removed from the mask, the redo identifies no load imbalance
and results in a jump to "out_balanced", clearing the "group_imbalance"
flag.
We propose correcting this logic with patch 1 of this series.
Making this correction revealed an issue in the calculate_imbalance() path.
Patch 2 removes a branch that does not make sense with the current
load_balance() algorithm because it has no scenario where it benifits the
"group_imbalance" case in calculate_imbalance() and which causes problems
in systems with affined workloads and many idle or lightly loaded CPUs.
Co-authored-by: Austin Christ <austinwc@codeaurora.org>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
[V3]
-fixed botched subject lines
[V2]
-Corrected use of Signed-off-by tags
-Removed temp cpumask variable from stack
Jeffrey Hugo (2):
sched/fair: Fix load_balance() affinity redo path
sched/fair: Remove group imbalance from calculate_imbalance()
kernel/sched/fair.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.
^ permalink raw reply [flat|nested] 10+ messages in thread* [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path 2017-05-18 19:36 [PATCH V3 0/2] load_balance() fixes for affinity Jeffrey Hugo @ 2017-05-18 19:36 ` Jeffrey Hugo 2017-05-19 13:31 ` Dietmar Eggemann 2017-05-18 19:36 ` [PATCH V3 2/2] sched/fair: Remove group imbalance from calculate_imbalance() Jeffrey Hugo 2017-05-22 15:52 ` [PATCH V3 0/2] load_balance() fixes for affinity Peter Zijlstra 2 siblings, 1 reply; 10+ messages in thread From: Jeffrey Hugo @ 2017-05-18 19:36 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, linux-kernel Cc: Dietmar Eggemann, Austin Christ, Tyler Baicar, Timur Tabi, Jeffrey Hugo If load_balance() fails to migrate any tasks because all tasks were affined, load_balance() removes the source cpu from consideration and attempts to redo and balance among the new subset of cpus. There is a bug in this code path where the algorithm considers all active cpus in the system (minus the source that was just masked out). This is not valid for two reasons: some active cpus may not be in the current scheduling domain and one of the active cpus is dst_cpu. These cpus should not be considered, as we cannot pull load from them. Instead of failing out of load_balance(), we may end up redoing the search with no valid cpus and incorrectly concluding the domain is balanced. Additionally, if the group_imbalance flag was just set, it may also be incorrectly unset, thus the flag will not be seen by other cpus in future load_balance() runs as that algorithm intends. Fix the check by removing cpus not in the current domain and the dst_cpu from considertation, thus limiting the evaluation to valid remaining cpus from which load might be migrated. Co-authored-by: Austin Christ <austinwc@codeaurora.org> Co-authored-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org> Tested-by: Tyler Baicar <tbaicar@codeaurora.org> --- kernel/sched/fair.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d711093..a5d41b1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8220,7 +8220,24 @@ static int load_balance(int this_cpu, struct rq *this_rq, /* All tasks on this runqueue were pinned by CPU affinity */ if (unlikely(env.flags & LBF_ALL_PINNED)) { cpumask_clear_cpu(cpu_of(busiest), cpus); - if (!cpumask_empty(cpus)) { + /* + * dst_cpu is not a valid busiest cpu in the following + * check since load cannot be pulled from dst_cpu to be + * put on dst_cpu. + */ + cpumask_clear_cpu(env.dst_cpu, cpus); + /* + * Go back to "redo" iff the load-balance cpumask + * contains other potential busiest cpus for the + * current sched domain. + */ + if (cpumask_intersects(cpus, sched_domain_span(env.sd))) { + /* + * Now that the check has passed, reenable + * dst_cpu so that load can be calculated on + * it in the redo path. + */ + cpumask_set_cpu(env.dst_cpu, cpus); env.loop = 0; env.loop_break = sched_nr_migrate_break; goto redo; -- Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path 2017-05-18 19:36 ` [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo @ 2017-05-19 13:31 ` Dietmar Eggemann 2017-05-22 9:48 ` Dietmar Eggemann 0 siblings, 1 reply; 10+ messages in thread From: Dietmar Eggemann @ 2017-05-19 13:31 UTC (permalink / raw) To: Jeffrey Hugo, Ingo Molnar, Peter Zijlstra, linux-kernel Cc: Austin Christ, Tyler Baicar, Timur Tabi On 18/05/17 20:36, Jeffrey Hugo wrote: [...] > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index d711093..a5d41b1 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8220,7 +8220,24 @@ static int load_balance(int this_cpu, struct rq *this_rq, > /* All tasks on this runqueue were pinned by CPU affinity */ > if (unlikely(env.flags & LBF_ALL_PINNED)) { > cpumask_clear_cpu(cpu_of(busiest), cpus); > - if (!cpumask_empty(cpus)) { > + /* > + * dst_cpu is not a valid busiest cpu in the following > + * check since load cannot be pulled from dst_cpu to be > + * put on dst_cpu. > + */ > + cpumask_clear_cpu(env.dst_cpu, cpus); > + /* > + * Go back to "redo" iff the load-balance cpumask > + * contains other potential busiest cpus for the > + * current sched domain. > + */ > + if (cpumask_intersects(cpus, sched_domain_span(env.sd))) { > + /* > + * Now that the check has passed, reenable > + * dst_cpu so that load can be calculated on > + * it in the redo path. > + */ > + cpumask_set_cpu(env.dst_cpu, cpus); IMHO, this will work nicely and its way easier. Another idea might be to check if the LBF_ALL_PINNED is set when we check if we should clean the imbalance flag. @@ -8307,14 +8307,13 @@ static int load_balance(int this_cpu, struct rq *this_rq, * We reach balance although we may have faced some affinity * constraints. Clear the imbalance flag if it was set. */ - if (sd_parent) { + if (sd_parent && !(env.flags & LBF_ALL_PINNED)) { int *group_imbalance = &sd_parent->groups->sgc->imbalance; if (*group_imbalance) *group_imbalance = 0; } But I think preventing a needless redo loop is even better ... ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path 2017-05-19 13:31 ` Dietmar Eggemann @ 2017-05-22 9:48 ` Dietmar Eggemann 2017-05-22 19:57 ` Christ, Austin 0 siblings, 1 reply; 10+ messages in thread From: Dietmar Eggemann @ 2017-05-22 9:48 UTC (permalink / raw) To: Jeffrey Hugo, Ingo Molnar, Peter Zijlstra, linux-kernel Cc: Austin Christ, Tyler Baicar, Timur Tabi On 19/05/17 14:31, Dietmar Eggemann wrote: > On 18/05/17 20:36, Jeffrey Hugo wrote: > > [...] > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index d711093..a5d41b1 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -8220,7 +8220,24 @@ static int load_balance(int this_cpu, struct rq *this_rq, >> /* All tasks on this runqueue were pinned by CPU affinity */ >> if (unlikely(env.flags & LBF_ALL_PINNED)) { >> cpumask_clear_cpu(cpu_of(busiest), cpus); >> - if (!cpumask_empty(cpus)) { >> + /* >> + * dst_cpu is not a valid busiest cpu in the following >> + * check since load cannot be pulled from dst_cpu to be >> + * put on dst_cpu. >> + */ >> + cpumask_clear_cpu(env.dst_cpu, cpus); >> + /* >> + * Go back to "redo" iff the load-balance cpumask >> + * contains other potential busiest cpus for the >> + * current sched domain. >> + */ >> + if (cpumask_intersects(cpus, sched_domain_span(env.sd))) { >> + /* >> + * Now that the check has passed, reenable >> + * dst_cpu so that load can be calculated on >> + * it in the redo path. >> + */ >> + cpumask_set_cpu(env.dst_cpu, cpus); > > IMHO, this will work nicely and its way easier. This was too quick ... if we still have other potential dst cpus available and cpu_of(busiest) is the latest src cpu then this will fail. It does work on sd with 'group_weight == 1', e.g. your MC sd 'sd->child == NULL'. But IMHO 'group_imbalance' propagation has to work on higher sd levels as well. > Another idea might be to check if the LBF_ALL_PINNED is set when we > check if we should clean the imbalance flag. > > @@ -8307,14 +8307,13 @@ static int load_balance(int this_cpu, struct rq *this_rq, > * We reach balance although we may have faced some affinity > * constraints. Clear the imbalance flag if it was set. > */ > - if (sd_parent) { > + if (sd_parent && !(env.flags & LBF_ALL_PINNED)) { > int *group_imbalance = &sd_parent->groups->sgc->imbalance; > > if (*group_imbalance) > *group_imbalance = 0; > } [...] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path 2017-05-22 9:48 ` Dietmar Eggemann @ 2017-05-22 19:57 ` Christ, Austin 2017-05-23 11:45 ` Dietmar Eggemann 0 siblings, 1 reply; 10+ messages in thread From: Christ, Austin @ 2017-05-22 19:57 UTC (permalink / raw) To: Dietmar Eggemann, Jeffrey Hugo, Ingo Molnar, Peter Zijlstra, linux-kernel Cc: Tyler Baicar, Timur Tabi Hey Dietmar, On 5/22/2017 3:48 AM, Dietmar Eggemann wrote: > On 19/05/17 14:31, Dietmar Eggemann wrote: >> On 18/05/17 20:36, Jeffrey Hugo wrote: >> >> [...] >> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index d711093..a5d41b1 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -8220,7 +8220,24 @@ static int load_balance(int this_cpu, struct rq *this_rq, >>> /* All tasks on this runqueue were pinned by CPU affinity */ >>> if (unlikely(env.flags & LBF_ALL_PINNED)) { >>> cpumask_clear_cpu(cpu_of(busiest), cpus); >>> - if (!cpumask_empty(cpus)) { >>> + /* >>> + * dst_cpu is not a valid busiest cpu in the following >>> + * check since load cannot be pulled from dst_cpu to be >>> + * put on dst_cpu. >>> + */ >>> + cpumask_clear_cpu(env.dst_cpu, cpus); >>> + /* >>> + * Go back to "redo" iff the load-balance cpumask >>> + * contains other potential busiest cpus for the >>> + * current sched domain. >>> + */ >>> + if (cpumask_intersects(cpus, sched_domain_span(env.sd))) { >>> + /* >>> + * Now that the check has passed, reenable >>> + * dst_cpu so that load can be calculated on >>> + * it in the redo path. >>> + */ >>> + cpumask_set_cpu(env.dst_cpu, cpus); >> IMHO, this will work nicely and its way easier. > This was too quick ... if we still have other potential dst cpus > available and cpu_of(busiest) is the latest src cpu then this will fail. > > It does work on sd with 'group_weight == 1', e.g. your MC sd 'sd->child > == NULL'. > > But IMHO 'group_imbalance' propagation has to work on higher sd levels > as well. Can you clarify the fail case you are seeing? We are only aware of dst_cpu being changed under [1] where a dst_cpu will try to move work to one of its sched_group siblings. I'm also not entirely sure I understand what you mean about the flag being propagated to higher sd levels. >> Another idea might be to check if the LBF_ALL_PINNED is set when we >> check if we should clean the imbalance flag. >> >> @@ -8307,14 +8307,13 @@ static int load_balance(int this_cpu, struct rq *this_rq, >> * We reach balance although we may have faced some affinity >> * constraints. Clear the imbalance flag if it was set. >> */ >> - if (sd_parent) { >> + if (sd_parent && !(env.flags & LBF_ALL_PINNED)) { >> int *group_imbalance = &sd_parent->groups->sgc->imbalance; >> >> if (*group_imbalance) >> *group_imbalance = 0; >> } > [...] [1] - http://elixir.free-electrons.com/linux/latest/source/kernel/sched/fair.c#L8140 -- Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path 2017-05-22 19:57 ` Christ, Austin @ 2017-05-23 11:45 ` Dietmar Eggemann 2017-05-23 22:05 ` Jeffrey Hugo 0 siblings, 1 reply; 10+ messages in thread From: Dietmar Eggemann @ 2017-05-23 11:45 UTC (permalink / raw) To: Christ, Austin, Jeffrey Hugo, Ingo Molnar, Peter Zijlstra, linux-kernel Cc: Tyler Baicar, Timur Tabi Hey Austin, On 22/05/17 20:57, Christ, Austin wrote: > Hey Dietmar, > > > On 5/22/2017 3:48 AM, Dietmar Eggemann wrote: >> On 19/05/17 14:31, Dietmar Eggemann wrote: >>> On 18/05/17 20:36, Jeffrey Hugo wrote: >>> >>> [...] >>> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>> index d711093..a5d41b1 100644 >>>> --- a/kernel/sched/fair.c >>>> +++ b/kernel/sched/fair.c >>>> @@ -8220,7 +8220,24 @@ static int load_balance(int this_cpu, struct >>>> rq *this_rq, >>>> /* All tasks on this runqueue were pinned by CPU affinity */ >>>> if (unlikely(env.flags & LBF_ALL_PINNED)) { >>>> cpumask_clear_cpu(cpu_of(busiest), cpus); >>>> - if (!cpumask_empty(cpus)) { >>>> + /* >>>> + * dst_cpu is not a valid busiest cpu in the following >>>> + * check since load cannot be pulled from dst_cpu to be >>>> + * put on dst_cpu. >>>> + */ >>>> + cpumask_clear_cpu(env.dst_cpu, cpus); >>>> + /* >>>> + * Go back to "redo" iff the load-balance cpumask >>>> + * contains other potential busiest cpus for the >>>> + * current sched domain. >>>> + */ >>>> + if (cpumask_intersects(cpus, sched_domain_span(env.sd))) { >>>> + /* >>>> + * Now that the check has passed, reenable >>>> + * dst_cpu so that load can be calculated on >>>> + * it in the redo path. >>>> + */ >>>> + cpumask_set_cpu(env.dst_cpu, cpus); >>> IMHO, this will work nicely and its way easier. >> This was too quick ... if we still have other potential dst cpus >> available and cpu_of(busiest) is the latest src cpu then this will fail. >> >> It does work on sd with 'group_weight == 1', e.g. your MC sd 'sd->child >> == NULL'. >> >> But IMHO 'group_imbalance' propagation has to work on higher sd levels >> as well. > Can you clarify the fail case you are seeing? We are only aware of > dst_cpu being changed under [1] where a dst_cpu will try to move work to > one of its sched_group siblings. > > I'm also not entirely sure I understand what you mean about the flag > being propagated to higher sd levels. The propagation of 'imbalance' information should not only happen between lowest sd (sd->child == NULL) and its parent (MC->DIE in your example) but between all {sd, sd->parent} pairs. Imagine your machine had another sd on top of DIE. I recreated the issue I pointed out on my hikey board (2*4) (w/o this extra sd on top of DIE), hotplug-ed out cpu 2,3,6,7 so I have a system with the following DIE sched_groups (sg): sg1(0,1) and sg2(4,5) <- the DIE level sg's contain more than 1 logical cpu. As a workload I run 4 25% tasks affine to [0,1]. These tasks are 'SOURCE' PINNED for a DIE lb sg2<-sg1. With: if (unlikely(env.flags & LBF_ALL_PINNED)) { cpumask_clear_cpu(cpu_of(busiest), cpus); if (!cpumask_empty(cpus)) { ... printk("goto redo: sd=%s dst_cpu=%d src_cpu=%d cpus=%*pbl dst_grpmask=%*pbl\n", sd->name, env.dst_cpu, cpu_of(busiest), cpumask_pr_args(cpus), cpumask_pr_args(env.dst_grpmask)); goto redo; } While running the workload I sometimes get: ... goto redo: sd=DIE dst_cpu=4 src_cpu=1 cpus=0,4-5 dst_grpmask=4-5 goto redo: sd=DIE dst_cpu=4 src_cpu=0 cpus=4-5 dst_grpmask=4-5 ... So even though 'redo' handling has tried both possible src_cpu's we would still enter another 'redo' path even you remove dst_cpu=4 temporarily because of cpu=5. You could replace: cpumask_clear_cpu(env.dst_cpu, cpus) cpumask_set_cpu(env.dst_cpu, cpus) with cpumask_andnot(cpus, cpus, env.dst_grpmask) cpumask_or(cpus, cpus, env.dst_grpmask) but then env.dst_grpmask can't be set to NULL for CPU_NEWLY_IDLE and you're almost at the snippet I sent out for v1: https://marc.info/?l=linux-kernel&m=149486020010389&w=2 [...] ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path 2017-05-23 11:45 ` Dietmar Eggemann @ 2017-05-23 22:05 ` Jeffrey Hugo 0 siblings, 0 replies; 10+ messages in thread From: Jeffrey Hugo @ 2017-05-23 22:05 UTC (permalink / raw) To: Dietmar Eggemann, Christ, Austin, Ingo Molnar, Peter Zijlstra, linux-kernel Cc: Tyler Baicar, Timur Tabi On 5/23/2017 5:45 AM, Dietmar Eggemann wrote: > Hey Austin, > > On 22/05/17 20:57, Christ, Austin wrote: >> Hey Dietmar, >> >> >> On 5/22/2017 3:48 AM, Dietmar Eggemann wrote: >>> On 19/05/17 14:31, Dietmar Eggemann wrote: >>>> On 18/05/17 20:36, Jeffrey Hugo wrote: >>>> >>>> [...] >>>> >>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>> index d711093..a5d41b1 100644 >>>>> --- a/kernel/sched/fair.c >>>>> +++ b/kernel/sched/fair.c >>>>> @@ -8220,7 +8220,24 @@ static int load_balance(int this_cpu, struct >>>>> rq *this_rq, >>>>> /* All tasks on this runqueue were pinned by CPU affinity */ >>>>> if (unlikely(env.flags & LBF_ALL_PINNED)) { >>>>> cpumask_clear_cpu(cpu_of(busiest), cpus); >>>>> - if (!cpumask_empty(cpus)) { >>>>> + /* >>>>> + * dst_cpu is not a valid busiest cpu in the following >>>>> + * check since load cannot be pulled from dst_cpu to be >>>>> + * put on dst_cpu. >>>>> + */ >>>>> + cpumask_clear_cpu(env.dst_cpu, cpus); >>>>> + /* >>>>> + * Go back to "redo" iff the load-balance cpumask >>>>> + * contains other potential busiest cpus for the >>>>> + * current sched domain. >>>>> + */ >>>>> + if (cpumask_intersects(cpus, sched_domain_span(env.sd))) { >>>>> + /* >>>>> + * Now that the check has passed, reenable >>>>> + * dst_cpu so that load can be calculated on >>>>> + * it in the redo path. >>>>> + */ >>>>> + cpumask_set_cpu(env.dst_cpu, cpus); >>>> IMHO, this will work nicely and its way easier. >>> This was too quick ... if we still have other potential dst cpus >>> available and cpu_of(busiest) is the latest src cpu then this will fail. >>> >>> It does work on sd with 'group_weight == 1', e.g. your MC sd 'sd->child >>> == NULL'. >>> >>> But IMHO 'group_imbalance' propagation has to work on higher sd levels >>> as well. >> Can you clarify the fail case you are seeing? We are only aware of >> dst_cpu being changed under [1] where a dst_cpu will try to move work to >> one of its sched_group siblings. >> >> I'm also not entirely sure I understand what you mean about the flag >> being propagated to higher sd levels. > > The propagation of 'imbalance' information should not only happen > between lowest sd (sd->child == NULL) and its parent (MC->DIE in your > example) but between all {sd, sd->parent} pairs. > > Imagine your machine had another sd on top of DIE. > > I recreated the issue I pointed out on my hikey board (2*4) (w/o this > extra sd on top of DIE), hotplug-ed out cpu 2,3,6,7 so I have a system > with the following DIE sched_groups (sg): > > sg1(0,1) and sg2(4,5) <- the DIE level sg's contain more than 1 logical cpu. > > As a workload I run 4 25% tasks affine to [0,1]. These tasks are > 'SOURCE' PINNED for a DIE lb sg2<-sg1. > > With: > > if (unlikely(env.flags & LBF_ALL_PINNED)) { > cpumask_clear_cpu(cpu_of(busiest), cpus); > if (!cpumask_empty(cpus)) { > ... > > printk("goto redo: sd=%s dst_cpu=%d src_cpu=%d cpus=%*pbl > dst_grpmask=%*pbl\n", > sd->name, env.dst_cpu, cpu_of(busiest), > cpumask_pr_args(cpus), > cpumask_pr_args(env.dst_grpmask)); > > goto redo; > } > > While running the workload I sometimes get: > > ... > goto redo: sd=DIE dst_cpu=4 src_cpu=1 cpus=0,4-5 dst_grpmask=4-5 > goto redo: sd=DIE dst_cpu=4 src_cpu=0 cpus=4-5 dst_grpmask=4-5 > ... > > So even though 'redo' handling has tried both possible src_cpu's we > would still enter another 'redo' path even you remove dst_cpu=4 > temporarily because of cpu=5. > > You could replace: > > cpumask_clear_cpu(env.dst_cpu, cpus) > cpumask_set_cpu(env.dst_cpu, cpus) > > with > > cpumask_andnot(cpus, cpus, env.dst_grpmask) > cpumask_or(cpus, cpus, env.dst_grpmask) > > but then env.dst_grpmask can't be set to NULL for CPU_NEWLY_IDLE and > you're almost at the snippet I sent out for v1: > https://marc.info/?l=linux-kernel&m=149486020010389&w=2 > > [...] > I see what you mean. You are right, our current proposal is flawed in this regard. We'll take a look for a bit and see if we come up with any other ideas. -- Jeffrey Hugo Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH V3 2/2] sched/fair: Remove group imbalance from calculate_imbalance() 2017-05-18 19:36 [PATCH V3 0/2] load_balance() fixes for affinity Jeffrey Hugo 2017-05-18 19:36 ` [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo @ 2017-05-18 19:36 ` Jeffrey Hugo 2017-05-22 15:52 ` [PATCH V3 0/2] load_balance() fixes for affinity Peter Zijlstra 2 siblings, 0 replies; 10+ messages in thread From: Jeffrey Hugo @ 2017-05-18 19:36 UTC (permalink / raw) To: Ingo Molnar, Peter Zijlstra, linux-kernel Cc: Dietmar Eggemann, Austin Christ, Tyler Baicar, Timur Tabi, Jeffrey Hugo The group_imbalance path in calculate_imbalance() made sense when it was added back in 2007 with commit 908a7c1b9b80 ("sched: fix improper load balance across sched domain") because busiest->load_per_task factored into the amount of imbalance that was calculated. That is not the case today. The group_imbalance path can only affect the outcome of calculate_imbalance() when the average load of the domain is less than the original busiest->load_per_task. In this case, busiest->load_per_task is overwritten with the scheduling domain load average. Thus busiest->load_per_task no longer represents actual load that can be moved. At the final comparison between env->imbalance and busiest->load_per_task, imbalance may be larger than the new busiest->load_per_task causing the check to fail under the assumption that there is a task that could be migrated to satisfy the imbalance. However env->imbalance may still be smaller than the original busiest->load_per_task, thus it is unlikely that there is a task that can be migrated to satisfy the imbalance. Calculate_imbalance() would not choose to run fix_small_imbalance() when we expect it should. In the worst case, this can result in idle cpus. Since the group imbalance path in calculate_imbalance() is at best a NOP but otherwise harmful, remove it. Co-authored-by: Austin Christ <austinwc@codeaurora.org> Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org> Tested-by: Tyler Baicar <tbaicar@codeaurora.org> --- kernel/sched/fair.c | 9 --------- 1 file changed, 9 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8f783ba..3283561 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7760,15 +7760,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s local = &sds->local_stat; busiest = &sds->busiest_stat; - if (busiest->group_type == group_imbalanced) { - /* - * In the group_imb case we cannot rely on group-wide averages - * to ensure cpu-load equilibrium, look at wider averages. XXX - */ - busiest->load_per_task = - min(busiest->load_per_task, sds->avg_load); - } - /* * Avg load of busiest sg can be less and avg load of local sg can * be greater than avg load across all sgs of sd because avg load -- Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH V3 0/2] load_balance() fixes for affinity 2017-05-18 19:36 [PATCH V3 0/2] load_balance() fixes for affinity Jeffrey Hugo 2017-05-18 19:36 ` [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo 2017-05-18 19:36 ` [PATCH V3 2/2] sched/fair: Remove group imbalance from calculate_imbalance() Jeffrey Hugo @ 2017-05-22 15:52 ` Peter Zijlstra 2017-05-22 20:17 ` Christ, Austin 2 siblings, 1 reply; 10+ messages in thread From: Peter Zijlstra @ 2017-05-22 15:52 UTC (permalink / raw) To: Jeffrey Hugo Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, Austin Christ, Tyler Baicar, Timur Tabi On Thu, May 18, 2017 at 01:36:01PM -0600, Jeffrey Hugo wrote: > The group_imbalance path correctly sets the flag > to indicate the group can not be properly balanced due to affinity, but the > redo condition right after this branch incorrectly assumes that there may > be other cores with work to be pulled by considering cores outside of the > scheduling domain in question. So its been a while since I looked at any of this, but from a quick look, env->cpus appears to only be applied to group/balance masks. In which case, we can easily do something like the below. Did I miss something? --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 219fe58e3023..1724e4433f89 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8104,7 +8104,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, if (idle == CPU_NEWLY_IDLE) env.dst_grpmask = NULL; - cpumask_copy(cpus, cpu_active_mask); + cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask); schedstat_inc(sd->lb_count[idle]); ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH V3 0/2] load_balance() fixes for affinity 2017-05-22 15:52 ` [PATCH V3 0/2] load_balance() fixes for affinity Peter Zijlstra @ 2017-05-22 20:17 ` Christ, Austin 0 siblings, 0 replies; 10+ messages in thread From: Christ, Austin @ 2017-05-22 20:17 UTC (permalink / raw) To: Peter Zijlstra, Jeffrey Hugo Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, Tyler Baicar, Timur Tabi Hey Peter, On 5/22/2017 9:52 AM, Peter Zijlstra wrote: > On Thu, May 18, 2017 at 01:36:01PM -0600, Jeffrey Hugo wrote: > >> The group_imbalance path correctly sets the flag >> to indicate the group can not be properly balanced due to affinity, but the >> redo condition right after this branch incorrectly assumes that there may >> be other cores with work to be pulled by considering cores outside of the >> scheduling domain in question. > So its been a while since I looked at any of this, but from a quick > look, env->cpus appears to only be applied to group/balance masks. > > In which case, we can easily do something like the below. Did I miss > something? We have looked through and agree with your proposed change; however, we would still need to mask out the dst_cpu when considering the redo path. We will include this modification in the next patch set. > > --- > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 219fe58e3023..1724e4433f89 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8104,7 +8104,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, > if (idle == CPU_NEWLY_IDLE) > env.dst_grpmask = NULL; > > - cpumask_copy(cpus, cpu_active_mask); > + cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask); > > schedstat_inc(sd->lb_count[idle]); > -- Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-05-23 22:05 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-05-18 19:36 [PATCH V3 0/2] load_balance() fixes for affinity Jeffrey Hugo 2017-05-18 19:36 ` [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo 2017-05-19 13:31 ` Dietmar Eggemann 2017-05-22 9:48 ` Dietmar Eggemann 2017-05-22 19:57 ` Christ, Austin 2017-05-23 11:45 ` Dietmar Eggemann 2017-05-23 22:05 ` Jeffrey Hugo 2017-05-18 19:36 ` [PATCH V3 2/2] sched/fair: Remove group imbalance from calculate_imbalance() Jeffrey Hugo 2017-05-22 15:52 ` [PATCH V3 0/2] load_balance() fixes for affinity Peter Zijlstra 2017-05-22 20:17 ` Christ, Austin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).