[PATCH V3 0/2] load_balance() fixes for affinity

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V3 0/2] load_balance() fixes for affinity
@ 2017-05-18 19:36 Jeffrey Hugo
  2017-05-18 19:36 ` [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Jeffrey Hugo @ 2017-05-18 19:36 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: Dietmar Eggemann, Austin Christ, Tyler Baicar, Timur Tabi,
	Jeffrey Hugo

We noticed in testing that affined workloads do not provide consistent
performance under certain circumstances. To isolate the issue, we began
testing with a representative CPU workload. In some cases the CPU workload
results would vary by a large percentage from run to run. We used JTAG and
scheduler trace to try and profile the issue and noticed that at least one
core was left idle for the duration of the test in the low score runs.

The specific test methodology used in our investigation involved using an
executable that was compiled to fork a specific number of workers. The
executable was then affined to a number of cores equal to the number of
workers forked, using the userspace taskset utility.  The expectation was
that all cores in the mask of runnable CPUs would run a process until the
work was complete. In practice, for several affinity masks, one to two
cores would not receive work until late in the test or not at all.

Our first observation was that it appeared initial cpu assignment of the
forked threads appeared suboptimal.  We observed that it was highly
probable that many or all of the threads would be initially assigned to the
same CPU. Our limited investigation into this indicated that the forking of
the threads occurs fast enough that load estimates for CPUs have not
adjusted to the work they have been assigned. The function
select_task_rq_fair() does not take number of running process into
consideration and thus can easily assign several tasks spawned close
together to the same CPU within the allowed CPUs for a workload. While this
seems suboptimal, the load_balance() algorithm run on other idle cores in
the affinity mask should resolve imbalances that result from it.

As our test case was leaving cores idle for many seconds at a time, we dug
into load_balance() code and specifically the group_imbalance path. We
added ftrace to the path in question and noticed that two branches in
load_balance() conflict in their assessments of the ability of the
algorithm to migrate load. The group_imbalance path correctly sets the flag
to indicate the group can not be properly balanced due to affinity, but the
redo condition right after this branch incorrectly assumes that there may
be other cores with work to be pulled by considering cores outside of the
scheduling domain in question.

Consider the following circumstance where (T) represents tasks and (*)
represents the allowed CPUs:

   A       B       C
{ 0 1 } { 2 3 } { 4 5 }
    T     T       T T
    T
    *     * *     * *

In this case, core 0 in scheduling group 'A' should run load balance on its
local scheduling domain and try to pull work from core 1. When it fails to
move any work due to affinity, it will mark scheduling group 'A' as
"group_imbalance". However, right after doing so, core 0 load_balance()
evaluates the redo path with core 1 removed from consideration. This should
leave no other cores to consider and result in a jump to "out_all_pinned",
leaving the "group_imbalance" flag to be seen by future runs of the
algorithm on core 3. As the code stands currently, core 0 evaluates all
cores instead of those in the scheduling domain under consideration.
It incorrectly sees other cores in the system and attempts a redo.
With core 1 removed from the mask, the redo identifies no load imbalance
and results in a jump to "out_balanced", clearing the "group_imbalance"
flag.

We propose correcting this logic with patch 1 of this series.

Making this correction revealed an issue in the calculate_imbalance() path.
Patch 2 removes a branch that does not make sense with the current
load_balance() algorithm because it has no scenario where it benifits the
"group_imbalance" case in calculate_imbalance() and which causes problems
in systems with affined workloads and many idle or lightly loaded CPUs.

Co-authored-by: Austin Christ <austinwc@codeaurora.org>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>

[V3]
-fixed botched subject lines

[V2]
-Corrected use of Signed-off-by tags
-Removed temp cpumask variable from stack

Jeffrey Hugo (2):
  sched/fair: Fix load_balance() affinity redo path
  sched/fair: Remove group imbalance from calculate_imbalance()

 kernel/sched/fair.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-05-18 19:36 [PATCH V3 0/2] load_balance() fixes for affinity Jeffrey Hugo
@ 2017-05-18 19:36 ` Jeffrey Hugo
  2017-05-19 13:31   ` Dietmar Eggemann
  2017-05-18 19:36 ` [PATCH V3 2/2] sched/fair: Remove group imbalance from calculate_imbalance() Jeffrey Hugo
  2017-05-22 15:52 ` [PATCH V3 0/2] load_balance() fixes for affinity Peter Zijlstra
  2 siblings, 1 reply; 10+ messages in thread
From: Jeffrey Hugo @ 2017-05-18 19:36 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: Dietmar Eggemann, Austin Christ, Tyler Baicar, Timur Tabi,
	Jeffrey Hugo

If load_balance() fails to migrate any tasks because all tasks were
affined, load_balance() removes the source cpu from consideration and
attempts to redo and balance among the new subset of cpus.

There is a bug in this code path where the algorithm considers all active
cpus in the system (minus the source that was just masked out).  This is
not valid for two reasons: some active cpus may not be in the current
scheduling domain and one of the active cpus is dst_cpu. These cpus should
not be considered, as we cannot pull load from them.

Instead of failing out of load_balance(), we may end up redoing the search
with no valid cpus and incorrectly concluding the domain is balanced.
Additionally, if the group_imbalance flag was just set, it may also be
incorrectly unset, thus the flag will not be seen by other cpus in future
load_balance() runs as that algorithm intends.

Fix the check by removing cpus not in the current domain and the dst_cpu
from considertation, thus limiting the evaluation to valid remaining cpus
from which load might be migrated.

Co-authored-by: Austin Christ <austinwc@codeaurora.org>
Co-authored-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 kernel/sched/fair.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d711093..a5d41b1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8220,7 +8220,24 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(env.flags & LBF_ALL_PINNED)) {
 			cpumask_clear_cpu(cpu_of(busiest), cpus);
-			if (!cpumask_empty(cpus)) {
+			/*
+			 * dst_cpu is not a valid busiest cpu in the following
+			 * check since load cannot be pulled from dst_cpu to be
+			 * put on dst_cpu.
+			 */
+			cpumask_clear_cpu(env.dst_cpu, cpus);
+			/*
+			 * Go back to "redo" iff the load-balance cpumask
+			 * contains other potential busiest cpus for the
+			 * current sched domain.
+			 */
+			if (cpumask_intersects(cpus, sched_domain_span(env.sd))) {
+				/*
+				 * Now that the check has passed, reenable
+				 * dst_cpu so that load can be calculated on
+				 * it in the redo path.
+				 */
+				cpumask_set_cpu(env.dst_cpu, cpus);
 				env.loop = 0;
 				env.loop_break = sched_nr_migrate_break;
 				goto redo;
-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-05-18 19:36 ` [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
@ 2017-05-19 13:31   ` Dietmar Eggemann
  2017-05-22  9:48     ` Dietmar Eggemann
  0 siblings, 1 reply; 10+ messages in thread
From: Dietmar Eggemann @ 2017-05-19 13:31 UTC (permalink / raw)
  To: Jeffrey Hugo, Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: Austin Christ, Tyler Baicar, Timur Tabi

On 18/05/17 20:36, Jeffrey Hugo wrote:

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d711093..a5d41b1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8220,7 +8220,24 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  		/* All tasks on this runqueue were pinned by CPU affinity */
>  		if (unlikely(env.flags & LBF_ALL_PINNED)) {
>  			cpumask_clear_cpu(cpu_of(busiest), cpus);
> -			if (!cpumask_empty(cpus)) {
> +			/*
> +			 * dst_cpu is not a valid busiest cpu in the following
> +			 * check since load cannot be pulled from dst_cpu to be
> +			 * put on dst_cpu.
> +			 */
> +			cpumask_clear_cpu(env.dst_cpu, cpus);
> +			/*
> +			 * Go back to "redo" iff the load-balance cpumask
> +			 * contains other potential busiest cpus for the
> +			 * current sched domain.
> +			 */
> +			if (cpumask_intersects(cpus, sched_domain_span(env.sd))) {
> +				/*
> +				 * Now that the check has passed, reenable
> +				 * dst_cpu so that load can be calculated on
> +				 * it in the redo path.
> +				 */
> +				cpumask_set_cpu(env.dst_cpu, cpus);

IMHO, this will work nicely and its way easier.

Another idea might be to check if the LBF_ALL_PINNED is set when we
check if we should clean the imbalance flag.

@@ -8307,14 +8307,13 @@ static int load_balance(int this_cpu, struct rq *this_rq,
         * We reach balance although we may have faced some affinity
         * constraints. Clear the imbalance flag if it was set.
         */
-       if (sd_parent) {
+       if (sd_parent && !(env.flags & LBF_ALL_PINNED)) {
                int *group_imbalance = &sd_parent->groups->sgc->imbalance;

                if (*group_imbalance)
                        *group_imbalance = 0;
        }

But I think preventing a needless redo loop is even better ...

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-05-19 13:31   ` Dietmar Eggemann
@ 2017-05-22  9:48     ` Dietmar Eggemann
  2017-05-22 19:57       ` Christ, Austin
  0 siblings, 1 reply; 10+ messages in thread
From: Dietmar Eggemann @ 2017-05-22  9:48 UTC (permalink / raw)
  To: Jeffrey Hugo, Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: Austin Christ, Tyler Baicar, Timur Tabi

On 19/05/17 14:31, Dietmar Eggemann wrote:
> On 18/05/17 20:36, Jeffrey Hugo wrote:
> 
> [...]
> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d711093..a5d41b1 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8220,7 +8220,24 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>  		/* All tasks on this runqueue were pinned by CPU affinity */
>>  		if (unlikely(env.flags & LBF_ALL_PINNED)) {
>>  			cpumask_clear_cpu(cpu_of(busiest), cpus);
>> -			if (!cpumask_empty(cpus)) {
>> +			/*
>> +			 * dst_cpu is not a valid busiest cpu in the following
>> +			 * check since load cannot be pulled from dst_cpu to be
>> +			 * put on dst_cpu.
>> +			 */
>> +			cpumask_clear_cpu(env.dst_cpu, cpus);
>> +			/*
>> +			 * Go back to "redo" iff the load-balance cpumask
>> +			 * contains other potential busiest cpus for the
>> +			 * current sched domain.
>> +			 */
>> +			if (cpumask_intersects(cpus, sched_domain_span(env.sd))) {
>> +				/*
>> +				 * Now that the check has passed, reenable
>> +				 * dst_cpu so that load can be calculated on
>> +				 * it in the redo path.
>> +				 */
>> +				cpumask_set_cpu(env.dst_cpu, cpus);
> 
> IMHO, this will work nicely and its way easier.

This was too quick ... if we still have other potential dst cpus
available and cpu_of(busiest) is the latest src cpu then this will fail.

It does work on sd with 'group_weight == 1', e.g. your MC sd 'sd->child
== NULL'.

But IMHO 'group_imbalance' propagation has to work on higher sd levels
as well.

> Another idea might be to check if the LBF_ALL_PINNED is set when we
> check if we should clean the imbalance flag.
> 
> @@ -8307,14 +8307,13 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>          * We reach balance although we may have faced some affinity
>          * constraints. Clear the imbalance flag if it was set.
>          */
> -       if (sd_parent) {
> +       if (sd_parent && !(env.flags & LBF_ALL_PINNED)) {
>                 int *group_imbalance = &sd_parent->groups->sgc->imbalance;
> 
>                 if (*group_imbalance)
>                         *group_imbalance = 0;
>         }

[...]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-05-22  9:48     ` Dietmar Eggemann
@ 2017-05-22 19:57       ` Christ, Austin
  2017-05-23 11:45         ` Dietmar Eggemann
  0 siblings, 1 reply; 10+ messages in thread
From: Christ, Austin @ 2017-05-22 19:57 UTC (permalink / raw)
  To: Dietmar Eggemann, Jeffrey Hugo, Ingo Molnar, Peter Zijlstra,
	linux-kernel
  Cc: Tyler Baicar, Timur Tabi

Hey Dietmar,


On 5/22/2017 3:48 AM, Dietmar Eggemann wrote:
> On 19/05/17 14:31, Dietmar Eggemann wrote:
>> On 18/05/17 20:36, Jeffrey Hugo wrote:
>>
>> [...]
>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index d711093..a5d41b1 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8220,7 +8220,24 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>>   		/* All tasks on this runqueue were pinned by CPU affinity */
>>>   		if (unlikely(env.flags & LBF_ALL_PINNED)) {
>>>   			cpumask_clear_cpu(cpu_of(busiest), cpus);
>>> -			if (!cpumask_empty(cpus)) {
>>> +			/*
>>> +			 * dst_cpu is not a valid busiest cpu in the following
>>> +			 * check since load cannot be pulled from dst_cpu to be
>>> +			 * put on dst_cpu.
>>> +			 */
>>> +			cpumask_clear_cpu(env.dst_cpu, cpus);
>>> +			/*
>>> +			 * Go back to "redo" iff the load-balance cpumask
>>> +			 * contains other potential busiest cpus for the
>>> +			 * current sched domain.
>>> +			 */
>>> +			if (cpumask_intersects(cpus, sched_domain_span(env.sd))) {
>>> +				/*
>>> +				 * Now that the check has passed, reenable
>>> +				 * dst_cpu so that load can be calculated on
>>> +				 * it in the redo path.
>>> +				 */
>>> +				cpumask_set_cpu(env.dst_cpu, cpus);
>> IMHO, this will work nicely and its way easier.
> This was too quick ... if we still have other potential dst cpus
> available and cpu_of(busiest) is the latest src cpu then this will fail.
>
> It does work on sd with 'group_weight == 1', e.g. your MC sd 'sd->child
> == NULL'.
>
> But IMHO 'group_imbalance' propagation has to work on higher sd levels
> as well.
Can you clarify the fail case you are seeing? We are only aware of 
dst_cpu being changed under [1] where a dst_cpu will try to move work to 
one of its sched_group siblings.

I'm also not entirely sure I understand what you mean about the flag 
being propagated to higher sd levels.
>> Another idea might be to check if the LBF_ALL_PINNED is set when we
>> check if we should clean the imbalance flag.
>>
>> @@ -8307,14 +8307,13 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>           * We reach balance although we may have faced some affinity
>>           * constraints. Clear the imbalance flag if it was set.
>>           */
>> -       if (sd_parent) {
>> +       if (sd_parent && !(env.flags & LBF_ALL_PINNED)) {
>>                  int *group_imbalance = &sd_parent->groups->sgc->imbalance;
>>
>>                  if (*group_imbalance)
>>                          *group_imbalance = 0;
>>          }
> [...]

[1] - 
http://elixir.free-electrons.com/linux/latest/source/kernel/sched/fair.c#L8140

--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-05-22 19:57       ` Christ, Austin
@ 2017-05-23 11:45         ` Dietmar Eggemann
  2017-05-23 22:05           ` Jeffrey Hugo
  0 siblings, 1 reply; 10+ messages in thread
From: Dietmar Eggemann @ 2017-05-23 11:45 UTC (permalink / raw)
  To: Christ, Austin, Jeffrey Hugo, Ingo Molnar, Peter Zijlstra,
	linux-kernel
  Cc: Tyler Baicar, Timur Tabi

Hey Austin,

On 22/05/17 20:57, Christ, Austin wrote:
> Hey Dietmar,
> 
> 
> On 5/22/2017 3:48 AM, Dietmar Eggemann wrote:
>> On 19/05/17 14:31, Dietmar Eggemann wrote:
>>> On 18/05/17 20:36, Jeffrey Hugo wrote:
>>>
>>> [...]
>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index d711093..a5d41b1 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -8220,7 +8220,24 @@ static int load_balance(int this_cpu, struct
>>>> rq *this_rq,
>>>>           /* All tasks on this runqueue were pinned by CPU affinity */
>>>>           if (unlikely(env.flags & LBF_ALL_PINNED)) {
>>>>               cpumask_clear_cpu(cpu_of(busiest), cpus);
>>>> -            if (!cpumask_empty(cpus)) {
>>>> +            /*
>>>> +             * dst_cpu is not a valid busiest cpu in the following
>>>> +             * check since load cannot be pulled from dst_cpu to be
>>>> +             * put on dst_cpu.
>>>> +             */
>>>> +            cpumask_clear_cpu(env.dst_cpu, cpus);
>>>> +            /*
>>>> +             * Go back to "redo" iff the load-balance cpumask
>>>> +             * contains other potential busiest cpus for the
>>>> +             * current sched domain.
>>>> +             */
>>>> +            if (cpumask_intersects(cpus, sched_domain_span(env.sd))) {
>>>> +                /*
>>>> +                 * Now that the check has passed, reenable
>>>> +                 * dst_cpu so that load can be calculated on
>>>> +                 * it in the redo path.
>>>> +                 */
>>>> +                cpumask_set_cpu(env.dst_cpu, cpus);
>>> IMHO, this will work nicely and its way easier.
>> This was too quick ... if we still have other potential dst cpus
>> available and cpu_of(busiest) is the latest src cpu then this will fail.
>>
>> It does work on sd with 'group_weight == 1', e.g. your MC sd 'sd->child
>> == NULL'.
>>
>> But IMHO 'group_imbalance' propagation has to work on higher sd levels
>> as well.
> Can you clarify the fail case you are seeing? We are only aware of
> dst_cpu being changed under [1] where a dst_cpu will try to move work to
> one of its sched_group siblings.
> 
> I'm also not entirely sure I understand what you mean about the flag
> being propagated to higher sd levels.

The propagation of 'imbalance' information should not only happen
between lowest sd (sd->child == NULL) and its parent (MC->DIE in your
example) but between all {sd, sd->parent} pairs.

Imagine your machine had another sd on top of DIE.

I recreated the issue I pointed out on my hikey board (2*4) (w/o this
extra sd on top of DIE), hotplug-ed out cpu 2,3,6,7 so I have a system
with the following DIE sched_groups (sg):

sg1(0,1) and sg2(4,5) <- the DIE level sg's contain more than 1 logical cpu.

As a workload I run 4 25% tasks affine to [0,1]. These tasks are
'SOURCE' PINNED for a DIE lb sg2<-sg1.

With:

if (unlikely(env.flags & LBF_ALL_PINNED)) {
  cpumask_clear_cpu(cpu_of(busiest), cpus);
    if (!cpumask_empty(cpus)) {
      ...

      printk("goto redo: sd=%s dst_cpu=%d src_cpu=%d cpus=%*pbl
              dst_grpmask=%*pbl\n",
              sd->name, env.dst_cpu, cpu_of(busiest),
              cpumask_pr_args(cpus),
              cpumask_pr_args(env.dst_grpmask));

      goto redo;
    }

While running the workload I sometimes get:

...
goto redo: sd=DIE dst_cpu=4 src_cpu=1 cpus=0,4-5 dst_grpmask=4-5
goto redo: sd=DIE dst_cpu=4 src_cpu=0 cpus=4-5 dst_grpmask=4-5
...

So even though 'redo' handling has tried both possible src_cpu's we
would still enter another 'redo' path even you remove dst_cpu=4
temporarily because of cpu=5.

You could replace:

cpumask_clear_cpu(env.dst_cpu, cpus)
cpumask_set_cpu(env.dst_cpu, cpus)

with

cpumask_andnot(cpus, cpus, env.dst_grpmask)
cpumask_or(cpus, cpus, env.dst_grpmask)

but then env.dst_grpmask can't be set to NULL for CPU_NEWLY_IDLE and
you're almost at the snippet I sent out for v1:
https://marc.info/?l=linux-kernel&m=149486020010389&w=2

[...]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-05-23 11:45         ` Dietmar Eggemann
@ 2017-05-23 22:05           ` Jeffrey Hugo
  0 siblings, 0 replies; 10+ messages in thread
From: Jeffrey Hugo @ 2017-05-23 22:05 UTC (permalink / raw)
  To: Dietmar Eggemann, Christ, Austin, Ingo Molnar, Peter Zijlstra,
	linux-kernel
  Cc: Tyler Baicar, Timur Tabi

On 5/23/2017 5:45 AM, Dietmar Eggemann wrote:
> Hey Austin,
> 
> On 22/05/17 20:57, Christ, Austin wrote:
>> Hey Dietmar,
>>
>>
>> On 5/22/2017 3:48 AM, Dietmar Eggemann wrote:
>>> On 19/05/17 14:31, Dietmar Eggemann wrote:
>>>> On 18/05/17 20:36, Jeffrey Hugo wrote:
>>>>
>>>> [...]
>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index d711093..a5d41b1 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -8220,7 +8220,24 @@ static int load_balance(int this_cpu, struct
>>>>> rq *this_rq,
>>>>>            /* All tasks on this runqueue were pinned by CPU affinity */
>>>>>            if (unlikely(env.flags & LBF_ALL_PINNED)) {
>>>>>                cpumask_clear_cpu(cpu_of(busiest), cpus);
>>>>> -            if (!cpumask_empty(cpus)) {
>>>>> +            /*
>>>>> +             * dst_cpu is not a valid busiest cpu in the following
>>>>> +             * check since load cannot be pulled from dst_cpu to be
>>>>> +             * put on dst_cpu.
>>>>> +             */
>>>>> +            cpumask_clear_cpu(env.dst_cpu, cpus);
>>>>> +            /*
>>>>> +             * Go back to "redo" iff the load-balance cpumask
>>>>> +             * contains other potential busiest cpus for the
>>>>> +             * current sched domain.
>>>>> +             */
>>>>> +            if (cpumask_intersects(cpus, sched_domain_span(env.sd))) {
>>>>> +                /*
>>>>> +                 * Now that the check has passed, reenable
>>>>> +                 * dst_cpu so that load can be calculated on
>>>>> +                 * it in the redo path.
>>>>> +                 */
>>>>> +                cpumask_set_cpu(env.dst_cpu, cpus);
>>>> IMHO, this will work nicely and its way easier.
>>> This was too quick ... if we still have other potential dst cpus
>>> available and cpu_of(busiest) is the latest src cpu then this will fail.
>>>
>>> It does work on sd with 'group_weight == 1', e.g. your MC sd 'sd->child
>>> == NULL'.
>>>
>>> But IMHO 'group_imbalance' propagation has to work on higher sd levels
>>> as well.
>> Can you clarify the fail case you are seeing? We are only aware of
>> dst_cpu being changed under [1] where a dst_cpu will try to move work to
>> one of its sched_group siblings.
>>
>> I'm also not entirely sure I understand what you mean about the flag
>> being propagated to higher sd levels.
> 
> The propagation of 'imbalance' information should not only happen
> between lowest sd (sd->child == NULL) and its parent (MC->DIE in your
> example) but between all {sd, sd->parent} pairs.
> 
> Imagine your machine had another sd on top of DIE.
> 
> I recreated the issue I pointed out on my hikey board (2*4) (w/o this
> extra sd on top of DIE), hotplug-ed out cpu 2,3,6,7 so I have a system
> with the following DIE sched_groups (sg):
> 
> sg1(0,1) and sg2(4,5) <- the DIE level sg's contain more than 1 logical cpu.
> 
> As a workload I run 4 25% tasks affine to [0,1]. These tasks are
> 'SOURCE' PINNED for a DIE lb sg2<-sg1.
> 
> With:
> 
> if (unlikely(env.flags & LBF_ALL_PINNED)) {
>    cpumask_clear_cpu(cpu_of(busiest), cpus);
>      if (!cpumask_empty(cpus)) {
>        ...
> 
>        printk("goto redo: sd=%s dst_cpu=%d src_cpu=%d cpus=%*pbl
>                dst_grpmask=%*pbl\n",
>                sd->name, env.dst_cpu, cpu_of(busiest),
>                cpumask_pr_args(cpus),
>                cpumask_pr_args(env.dst_grpmask));
> 
>        goto redo;
>      }
> 
> While running the workload I sometimes get:
> 
> ...
> goto redo: sd=DIE dst_cpu=4 src_cpu=1 cpus=0,4-5 dst_grpmask=4-5
> goto redo: sd=DIE dst_cpu=4 src_cpu=0 cpus=4-5 dst_grpmask=4-5
> ...
> 
> So even though 'redo' handling has tried both possible src_cpu's we
> would still enter another 'redo' path even you remove dst_cpu=4
> temporarily because of cpu=5.
> 
> You could replace:
> 
> cpumask_clear_cpu(env.dst_cpu, cpus)
> cpumask_set_cpu(env.dst_cpu, cpus)
> 
> with
> 
> cpumask_andnot(cpus, cpus, env.dst_grpmask)
> cpumask_or(cpus, cpus, env.dst_grpmask)
> 
> but then env.dst_grpmask can't be set to NULL for CPU_NEWLY_IDLE and
> you're almost at the snippet I sent out for v1:
> https://marc.info/?l=linux-kernel&m=149486020010389&w=2
> 
> [...]
> 

I see what you mean.  You are right, our current proposal is flawed in 
this regard.  We'll take a look for a bit and see if we come up with any 
other ideas.

-- 
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH V3 2/2] sched/fair: Remove group imbalance from calculate_imbalance()
  2017-05-18 19:36 [PATCH V3 0/2] load_balance() fixes for affinity Jeffrey Hugo
  2017-05-18 19:36 ` [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
@ 2017-05-18 19:36 ` Jeffrey Hugo
  2017-05-22 15:52 ` [PATCH V3 0/2] load_balance() fixes for affinity Peter Zijlstra
  2 siblings, 0 replies; 10+ messages in thread
From: Jeffrey Hugo @ 2017-05-18 19:36 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: Dietmar Eggemann, Austin Christ, Tyler Baicar, Timur Tabi,
	Jeffrey Hugo

The group_imbalance path in calculate_imbalance() made sense when it was
added back in 2007 with commit 908a7c1b9b80 ("sched: fix improper load
balance across sched domain") because busiest->load_per_task factored into
the amount of imbalance that was calculated. That is not the case today.

The group_imbalance path can only affect the outcome of
calculate_imbalance() when the average load of the domain is less than the
original busiest->load_per_task. In this case, busiest->load_per_task is
overwritten with the scheduling domain load average. Thus
busiest->load_per_task no longer represents actual load that can be moved.

At the final comparison between env->imbalance and busiest->load_per_task,
imbalance may be larger than the new busiest->load_per_task causing the
check to fail under the assumption that there is a task that could be
migrated to satisfy the imbalance. However env->imbalance may still be
smaller than the original busiest->load_per_task, thus it is unlikely that
there is a task that can be migrated to satisfy the imbalance.
Calculate_imbalance() would not choose to run fix_small_imbalance() when we
expect it should. In the worst case, this can result in idle cpus.

Since the group imbalance path in calculate_imbalance() is at best a NOP
but otherwise harmful, remove it.

Co-authored-by: Austin Christ <austinwc@codeaurora.org>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 kernel/sched/fair.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f783ba..3283561 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7760,15 +7760,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;

-	if (busiest->group_type == group_imbalanced) {
-		/*
-		 * In the group_imb case we cannot rely on group-wide averages
-		 * to ensure cpu-load equilibrium, look at wider averages. XXX
-		 */
-		busiest->load_per_task =
-			min(busiest->load_per_task, sds->avg_load);
-	}
-
 	/*
 	 * Avg load of busiest sg can be less and avg load of local sg can
 	 * be greater than avg load across all sgs of sd because avg load
-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V3 0/2] load_balance() fixes for affinity
  2017-05-18 19:36 [PATCH V3 0/2] load_balance() fixes for affinity Jeffrey Hugo
  2017-05-18 19:36 ` [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
  2017-05-18 19:36 ` [PATCH V3 2/2] sched/fair: Remove group imbalance from calculate_imbalance() Jeffrey Hugo
@ 2017-05-22 15:52 ` Peter Zijlstra
  2017-05-22 20:17   ` Christ, Austin
  2 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2017-05-22 15:52 UTC (permalink / raw)
  To: Jeffrey Hugo
  Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, Austin Christ,
	Tyler Baicar, Timur Tabi

On Thu, May 18, 2017 at 01:36:01PM -0600, Jeffrey Hugo wrote:

> The group_imbalance path correctly sets the flag
> to indicate the group can not be properly balanced due to affinity, but the
> redo condition right after this branch incorrectly assumes that there may
> be other cores with work to be pulled by considering cores outside of the
> scheduling domain in question.

So its been a while since I looked at any of this, but from a quick
look, env->cpus appears to only be applied to group/balance masks.

In which case, we can easily do something like the below. Did I miss
something?

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 219fe58e3023..1724e4433f89 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8104,7 +8104,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	if (idle == CPU_NEWLY_IDLE)
 		env.dst_grpmask = NULL;
 
-	cpumask_copy(cpus, cpu_active_mask);
+	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V3 0/2] load_balance() fixes for affinity
  2017-05-22 15:52 ` [PATCH V3 0/2] load_balance() fixes for affinity Peter Zijlstra
@ 2017-05-22 20:17   ` Christ, Austin
  0 siblings, 0 replies; 10+ messages in thread
From: Christ, Austin @ 2017-05-22 20:17 UTC (permalink / raw)
  To: Peter Zijlstra, Jeffrey Hugo
  Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, Tyler Baicar,
	Timur Tabi

Hey Peter,


On 5/22/2017 9:52 AM, Peter Zijlstra wrote:
> On Thu, May 18, 2017 at 01:36:01PM -0600, Jeffrey Hugo wrote:
>
>> The group_imbalance path correctly sets the flag
>> to indicate the group can not be properly balanced due to affinity, but the
>> redo condition right after this branch incorrectly assumes that there may
>> be other cores with work to be pulled by considering cores outside of the
>> scheduling domain in question.
> So its been a while since I looked at any of this, but from a quick
> look, env->cpus appears to only be applied to group/balance masks.
>
> In which case, we can easily do something like the below. Did I miss
> something?
We have looked through and agree with your proposed change; however, we 
would still need to mask out the dst_cpu when considering the redo path. 
We will include this modification in the next patch set.
>
> ---
>   kernel/sched/fair.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 219fe58e3023..1724e4433f89 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8104,7 +8104,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>   	if (idle == CPU_NEWLY_IDLE)
>   		env.dst_grpmask = NULL;
>   
> -	cpumask_copy(cpus, cpu_active_mask);
> +	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>   
>   	schedstat_inc(sd->lb_count[idle]);
>   

-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-05-23 22:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-05-18 19:36 [PATCH V3 0/2] load_balance() fixes for affinity Jeffrey Hugo
2017-05-18 19:36 ` [PATCH V3 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
2017-05-19 13:31   ` Dietmar Eggemann
2017-05-22  9:48     ` Dietmar Eggemann
2017-05-22 19:57       ` Christ, Austin
2017-05-23 11:45         ` Dietmar Eggemann
2017-05-23 22:05           ` Jeffrey Hugo
2017-05-18 19:36 ` [PATCH V3 2/2] sched/fair: Remove group imbalance from calculate_imbalance() Jeffrey Hugo
2017-05-22 15:52 ` [PATCH V3 0/2] load_balance() fixes for affinity Peter Zijlstra
2017-05-22 20:17   ` Christ, Austin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).