[RFC 0/2] load_balance() fixes for affinity

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC 0/2] load_balance() fixes for affinity
@ 2017-05-12 17:01 Jeffrey Hugo
  2017-05-12 17:01 ` [RFC 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
  2017-05-12 17:01 ` [RFC 2/2] sched/fair: Remove group imbalance from calculate_imbalance() Jeffrey Hugo
  0 siblings, 2 replies; 11+ messages in thread
From: Jeffrey Hugo @ 2017-05-12 17:01 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: Dietmar Eggemann, Austin Christ, Tyler Baicar, Jeffrey Hugo

We noticed in testing that affined workloads do not provide consistent
performance under certain circumstances. To isolate the issue, we began
testing with a representative CPU workload. In some cases the CPU workload
results would vary by a large percentage from run to run. We used JTAG and
scheduler trace to try and profile the issue and noticed that at least one
core was left idle for the duration of the test in the low score runs.

The specific test methodology used in our investigation involved using an
executable that was compiled to fork a specific number of workers. The
executable was then affined to a number of cores equal to the number of
workers forked, using the userspace taskset utility.  The expectation was
that all cores in the mask of runnable CPUs would run a process until the
work was complete. In practice, for several affinity masks, one to two
cores would not receive work until late in the test or not at all.

Our first observation was that it appeared initial cpu assignment of the
forked threads appeared suboptimal.  We observed that it was highly
probable that many or all of the threads would be initially assigned to the
same CPU. Our limited investigation into this indicated that the forking of
the threads occurs fast enough that load estimates for CPUs have not
adjusted to the work they have been assigned. The function
select_task_rq_fair() does not take number of running process into
consideration and thus can easily assign several tasks spawned close
together to the same CPU within the allowed CPUs for a workload. While this
seems suboptimal, the load_balance() algorithm run on other idle cores in
the affinity mask should resolve imbalances that result from it.

As our test case was leaving cores idle for many seconds at a time, we dug
into load_balance() code and specifically the group_imbalance path. We
added ftrace to the path in question and noticed that two branches in
load_balance() conflict in their assessments of the ability of the
algorithm to migrate load. The group_imbalance path correctly sets the flag
to indicate the group can not be properly balanced due to affinity, but the
redo condition right after this branch incorrectly assumes that there may
be other cores with work to be pulled by considering cores outside of the
scheduling domain in question.

Consider the following circumstance where (T) represents tasks and (*)
represents the allowed CPUs:

   A       B       C
{ 0 1 } { 2 3 } { 4 5 }
    T     T       T T
    T
    *     * *     * *

In this case, core 0 in scheduling group 'A' should run load balance on its
local scheduling domain and try to pull work from core 1. When it fails to
move any work due to affinity, it will mark scheduling group 'A' as
"group_imbalance". However, right after doing so, core 0 load_balance()
evaluates the redo path with core 1 removed from consideration. This should
leave no other cores to consider and result in a jump to "out_all_pinned",
leaving the "group_imbalance" flag to be seen by future runs of the
algorithm on core 3. As the code stands currently, core 0 evaluates all
cores instead of those in the scheduling domain under consideration.
It incorrectly sees other cores in the system and attempts a redo.
With core 1 removed from the mask, the redo identifies no load imbalance
and results in a jump to "out_balanced", clearing the "group_imbalance"
flag.

We propose correcting this logic with patch 1 of this series.

Making this correction revealed an issue in the calculate_imbalance() path.
Patch 2 removes a branch that does not make sense with the current
load_balance() algorithm because it has no scenario where it benifits the
"group_imbalance" case in calculate_imbalance() and which causes problems
in systems with affined workloads and many idle or lightly loaded CPUs.

Signed-off-by: Austin Christ <austinwc@codeaurora.org>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>

Jeffrey Hugo (2):
  sched/fair: Fix load_balance() affinity redo path
  sched/fair: Remove group imbalance from calculate_imbalance()

 kernel/sched/fair.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-05-12 17:01 [RFC 0/2] load_balance() fixes for affinity Jeffrey Hugo
@ 2017-05-12 17:01 ` Jeffrey Hugo
  2017-05-12 17:23   ` Peter Zijlstra
  2017-05-12 20:47   ` Peter Zijlstra
  2017-05-12 17:01 ` [RFC 2/2] sched/fair: Remove group imbalance from calculate_imbalance() Jeffrey Hugo
  1 sibling, 2 replies; 11+ messages in thread
From: Jeffrey Hugo @ 2017-05-12 17:01 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: Dietmar Eggemann, Austin Christ, Tyler Baicar, Jeffrey Hugo

If load_balance() fails to migrate any tasks because all tasks were
affined, load_balance() removes the source cpu from consideration and
attempts to redo and balance among the new subset of cpus.

There is a bug in this code path where the algorithm considers all active
cpus in the system (minus the source that was just masked out).  This is
not valid for two reasons: some active cpus may not be in the current
scheduling domain and one of the active cpus is dst_cpu. These cpus should
not be considered, as we cannot pull load from them.

Instead of failing out of load_balance(), we may end up redoing the search
with no valid cpus and incorrectly concluding the domain is balanced.
Additionally, if the group_imbalance flag was just set, it may also be
incorrectly unset, thus the flag will not be seen by other cpus in future
load_balance() runs as that algorithm intends.

Fix the check by removing cpus not in the current domain and the dst_cpu
from considertation, thus limiting the evaluation to valid remaining cpus
from which load might be migrated.

Signed-off-by: Austin Christ <austinwc@codeaurora.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 kernel/sched/fair.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d711093..8f783ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8219,8 +8219,19 @@ static int load_balance(int this_cpu, struct rq *this_rq,

 		/* All tasks on this runqueue were pinned by CPU affinity */
 		if (unlikely(env.flags & LBF_ALL_PINNED)) {
+			struct cpumask tmp;
+
+			/* Cpumask of all initially possible busiest cpus. */
+			cpumask_copy(&tmp, sched_domain_span(env.sd));
+			cpumask_clear_cpu(env.dst_cpu, &tmp);
+
 			cpumask_clear_cpu(cpu_of(busiest), cpus);
-			if (!cpumask_empty(cpus)) {
+			/*
+			 * Go back to "redo" iff the load-balance cpumask
+			 * contains other potential busiest cpus for the
+			 * current sched domain.
+			 */
+			if (cpumask_intersects(cpus, &tmp)) {
 				env.loop = 0;
 				env.loop_break = sched_nr_migrate_break;
 				goto redo;
-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-05-12 17:01 ` [RFC 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
@ 2017-05-12 17:23   ` Peter Zijlstra
  2017-05-12 17:29     ` Jeffrey Hugo
  2017-05-12 20:47   ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2017-05-12 17:23 UTC (permalink / raw)
  To: Jeffrey Hugo
  Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, Austin Christ,
	Tyler Baicar

On Fri, May 12, 2017 at 11:01:37AM -0600, Jeffrey Hugo wrote:

> Signed-off-by: Austin Christ <austinwc@codeaurora.org>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>

So per that Chain Austin wrote the patch, who handed it to Dietmar, who
handed it to you. Except I don't see a From: Austin on.

What gives?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-05-12 17:23   ` Peter Zijlstra
@ 2017-05-12 17:29     ` Jeffrey Hugo
  2017-05-12 20:44       ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Jeffrey Hugo @ 2017-05-12 17:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, Austin Christ,
	Tyler Baicar

On 5/12/2017 11:23 AM, Peter Zijlstra wrote:
> On Fri, May 12, 2017 at 11:01:37AM -0600, Jeffrey Hugo wrote:
> 
>> Signed-off-by: Austin Christ <austinwc@codeaurora.org>
>> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
> 
> So per that Chain Austin wrote the patch, who handed it to Dietmar, who
> handed it to you. Except I don't see a From: Austin on.
> 
> What gives?
> 

Austin and I did the investigations and wrote the initial version.  We 
discussed it with Dietmar, who suggested some significant rewrites which 
we felt added to the readability of the code.  The current version 
posted on the list you've seen was basically written by all three of us, 
so I listed the authors in alphabetical order to properly give credit to 
all involved.

Is there a better way to handle patches which have authorship from 
multiple people?

-- 
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-05-12 17:29     ` Jeffrey Hugo
@ 2017-05-12 20:44       ` Peter Zijlstra
  2017-05-12 20:54         ` Jeffrey Hugo
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2017-05-12 20:44 UTC (permalink / raw)
  To: Jeffrey Hugo
  Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, Austin Christ,
	Tyler Baicar

On Fri, May 12, 2017 at 11:29:05AM -0600, Jeffrey Hugo wrote:
> On 5/12/2017 11:23 AM, Peter Zijlstra wrote:
> >On Fri, May 12, 2017 at 11:01:37AM -0600, Jeffrey Hugo wrote:
> >
> >>Signed-off-by: Austin Christ <austinwc@codeaurora.org>
> >>Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >>Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
> >
> >So per that Chain Austin wrote the patch, who handed it to Dietmar, who
> >handed it to you. Except I don't see a From: Austin on.
> >
> >What gives?
> >
> 
> Austin and I did the investigations and wrote the initial version.  We
> discussed it with Dietmar, who suggested some significant rewrites which we
> felt added to the readability of the code.  The current version posted on
> the list you've seen was basically written by all three of us, so I listed
> the authors in alphabetical order to properly give credit to all involved.
> 
> Is there a better way to handle patches which have authorship from multiple
> people?

Well, Signed-off-by is only a chain of custody thing. It says who
handled the patches and that they have the right to publish and that
sorts of thing. We have a document describing this.

It does _NOT_ however imply any kind of authorship what so ever. Of
course, the author must be the first in the custody chain, how else
could the patch 'escape'.

Authorship comes from the Author: header, and there's only 1 of those.

Just mention the people by name in the Changelog or something.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-05-12 20:44       ` Peter Zijlstra
@ 2017-05-12 20:54         ` Jeffrey Hugo
  0 siblings, 0 replies; 11+ messages in thread
From: Jeffrey Hugo @ 2017-05-12 20:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, Austin Christ,
	Tyler Baicar

On 5/12/2017 2:44 PM, Peter Zijlstra wrote:
> On Fri, May 12, 2017 at 11:29:05AM -0600, Jeffrey Hugo wrote:
>> On 5/12/2017 11:23 AM, Peter Zijlstra wrote:
>>> On Fri, May 12, 2017 at 11:01:37AM -0600, Jeffrey Hugo wrote:
>>>
>>>> Signed-off-by: Austin Christ <austinwc@codeaurora.org>
>>>> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>>> Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
>>>
>>> So per that Chain Austin wrote the patch, who handed it to Dietmar, who
>>> handed it to you. Except I don't see a From: Austin on.
>>>
>>> What gives?
>>>
>>
>> Austin and I did the investigations and wrote the initial version.  We
>> discussed it with Dietmar, who suggested some significant rewrites which we
>> felt added to the readability of the code.  The current version posted on
>> the list you've seen was basically written by all three of us, so I listed
>> the authors in alphabetical order to properly give credit to all involved.
>>
>> Is there a better way to handle patches which have authorship from multiple
>> people?
> 
> Well, Signed-off-by is only a chain of custody thing. It says who
> handled the patches and that they have the right to publish and that
> sorts of thing. We have a document describing this.
> 
> It does _NOT_ however imply any kind of authorship what so ever. Of
> course, the author must be the first in the custody chain, how else
> could the patch 'escape'.
> 
> Authorship comes from the Author: header, and there's only 1 of those.
> 
> Just mention the people by name in the Changelog or something.
> 

I'm not entirely sure I agree with that assessment, but I'll discuss it 
with the folks offline and see what we want to roll into the next patch 
version after seeing what the other comments are.

-- 
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-05-12 17:01 ` [RFC 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
  2017-05-12 17:23   ` Peter Zijlstra
@ 2017-05-12 20:47   ` Peter Zijlstra
  2017-05-12 20:57     ` Jeffrey Hugo
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2017-05-12 20:47 UTC (permalink / raw)
  To: Jeffrey Hugo
  Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, Austin Christ,
	Tyler Baicar

On Fri, May 12, 2017 at 11:01:37AM -0600, Jeffrey Hugo wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d711093..8f783ba 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8219,8 +8219,19 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  
>  		/* All tasks on this runqueue were pinned by CPU affinity */
>  		if (unlikely(env.flags & LBF_ALL_PINNED)) {
> +			struct cpumask tmp;

You cannot have cpumask's on stack.

> +
> +			/* Cpumask of all initially possible busiest cpus. */
> +			cpumask_copy(&tmp, sched_domain_span(env.sd));
> +			cpumask_clear_cpu(env.dst_cpu, &tmp);

You forgot to mask with cpu_active_mask.

> +
>  			cpumask_clear_cpu(cpu_of(busiest), cpus);
> -			if (!cpumask_empty(cpus)) {
> +			/*
> +			 * Go back to "redo" iff the load-balance cpumask
> +			 * contains other potential busiest cpus for the
> +			 * current sched domain.
> +			 */
> +			if (cpumask_intersects(cpus, &tmp)) {
>  				env.loop = 0;
>  				env.loop_break = sched_nr_migrate_break;
>  				goto redo;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-05-12 20:47   ` Peter Zijlstra
@ 2017-05-12 20:57     ` Jeffrey Hugo
  2017-05-15 14:56       ` Dietmar Eggemann
  0 siblings, 1 reply; 11+ messages in thread
From: Jeffrey Hugo @ 2017-05-12 20:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Dietmar Eggemann, Austin Christ,
	Tyler Baicar

On 5/12/2017 2:47 PM, Peter Zijlstra wrote:
> On Fri, May 12, 2017 at 11:01:37AM -0600, Jeffrey Hugo wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d711093..8f783ba 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8219,8 +8219,19 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>   
>>   		/* All tasks on this runqueue were pinned by CPU affinity */
>>   		if (unlikely(env.flags & LBF_ALL_PINNED)) {
>> +			struct cpumask tmp;
> 
> You cannot have cpumask's on stack.

Well, we need a temp variable to store the intermediate values since the 
cpumask_* operations are somewhat limited, and require a "storage" 
parameter.

Do you have any suggestions to meet all of these requirements?

> 
>> +
>> +			/* Cpumask of all initially possible busiest cpus. */
>> +			cpumask_copy(&tmp, sched_domain_span(env.sd));
>> +			cpumask_clear_cpu(env.dst_cpu, &tmp);
> 
> You forgot to mask with cpu_active_mask.

cpus == cpu_active_mask, which we compare against below in just a few 
lines with the cpumask_intersects check.  So, no, I don't think we did 
forget to mask with cpu_active_mask.

> 
>> +
>>   			cpumask_clear_cpu(cpu_of(busiest), cpus);
>> -			if (!cpumask_empty(cpus)) {
>> +			/*
>> +			 * Go back to "redo" iff the load-balance cpumask
>> +			 * contains other potential busiest cpus for the
>> +			 * current sched domain.
>> +			 */
>> +			if (cpumask_intersects(cpus, &tmp)) {
>>   				env.loop = 0;
>>   				env.loop_break = sched_nr_migrate_break;
>>   				goto redo;


-- 
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-05-12 20:57     ` Jeffrey Hugo
@ 2017-05-15 14:56       ` Dietmar Eggemann
  2017-05-18 14:31         ` Jeffrey Hugo
  0 siblings, 1 reply; 11+ messages in thread
From: Dietmar Eggemann @ 2017-05-15 14:56 UTC (permalink / raw)
  To: Jeffrey Hugo, Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Austin Christ, Tyler Baicar

On 12/05/17 21:57, Jeffrey Hugo wrote:
> On 5/12/2017 2:47 PM, Peter Zijlstra wrote:
>> On Fri, May 12, 2017 at 11:01:37AM -0600, Jeffrey Hugo wrote:
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index d711093..8f783ba 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -8219,8 +8219,19 @@ static int load_balance(int this_cpu, struct
>>> rq *this_rq,
>>>             /* All tasks on this runqueue were pinned by CPU affinity */
>>>           if (unlikely(env.flags & LBF_ALL_PINNED)) {
>>> +            struct cpumask tmp;
>>
>> You cannot have cpumask's on stack.
> 
> Well, we need a temp variable to store the intermediate values since the
> cpumask_* operations are somewhat limited, and require a "storage"
> parameter.
> 
> Do you have any suggestions to meet all of these requirements?

What about we use env.dst_grpmask and check if cpus is an improper
subset of env.dst_grpmask? In this case we have to get rid of
setting env.dst_grpmask = NULL in case of CPU_NEWLY_IDLE which is
IMHO not an issue since it's idle is passed via env into
can_migrate_task().
And cpus has to be and'ed with sched_domain_span(env.sd).

I'm not sure if this will work with 'not fully connected NUMA' (SD_OVERLAP)
though ...

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a903276fcb62..2ede4c1c9db8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6737,10 +6737,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
                 * our sched_group. We may want to revisit it if we couldn't
                 * meet load balance goals by pulling other tasks on src_cpu.
                 *
-                * Also avoid computing new_dst_cpu if we have already computed
-                * one in current iteration.
+                * Avoid computing new_dst_cpu for NEWLY_IDLE or if we have
+                * already computed one in current iteration.
                 */
-               if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
+               if (env->idle == CPU_NEWLY_IDLE || (env->flags & LBF_DST_PINNED))
                        return 0;
 
                /* Prevent to re-select dst_cpu via env's cpus */
@@ -8091,14 +8091,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
                .tasks          = LIST_HEAD_INIT(env.tasks),
        };
 
-       /*
-        * For NEWLY_IDLE load_balancing, we don't need to consider
-        * other cpus in our group
-        */
-       if (idle == CPU_NEWLY_IDLE)
-               env.dst_grpmask = NULL;
-
-       cpumask_copy(cpus, cpu_active_mask);
+       cpumask_and(cpus, cpu_active_mask, sched_domain_span(env.sd));
 
        schedstat_inc(sd->lb_count[idle]);
 
@@ -8220,7 +8213,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
                /* All tasks on this runqueue were pinned by CPU affinity */
                if (unlikely(env.flags & LBF_ALL_PINNED)) {
                        cpumask_clear_cpu(cpu_of(busiest), cpus);
-                       if (!cpumask_empty(cpus)) {
+                       if (!cpumask_subset(cpus, env.dst_grpmask)) {
                                env.loop = 0;
                                env.loop_break = sched_nr_migrate_break;
                                goto redo;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] sched/fair: Fix load_balance() affinity redo path
  2017-05-15 14:56       ` Dietmar Eggemann
@ 2017-05-18 14:31         ` Jeffrey Hugo
  0 siblings, 0 replies; 11+ messages in thread
From: Jeffrey Hugo @ 2017-05-18 14:31 UTC (permalink / raw)
  To: Dietmar Eggemann, Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Austin Christ, Tyler Baicar

On 5/15/2017 8:56 AM, Dietmar Eggemann wrote:
> On 12/05/17 21:57, Jeffrey Hugo wrote:
>> On 5/12/2017 2:47 PM, Peter Zijlstra wrote:
>>> On Fri, May 12, 2017 at 11:01:37AM -0600, Jeffrey Hugo wrote:
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index d711093..8f783ba 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -8219,8 +8219,19 @@ static int load_balance(int this_cpu, struct
>>>> rq *this_rq,
>>>>              /* All tasks on this runqueue were pinned by CPU affinity */
>>>>            if (unlikely(env.flags & LBF_ALL_PINNED)) {
>>>> +            struct cpumask tmp;
>>>
>>> You cannot have cpumask's on stack.
>>
>> Well, we need a temp variable to store the intermediate values since the
>> cpumask_* operations are somewhat limited, and require a "storage"
>> parameter.
>>
>> Do you have any suggestions to meet all of these requirements?
> 
> What about we use env.dst_grpmask and check if cpus is an improper
> subset of env.dst_grpmask? In this case we have to get rid of
> setting env.dst_grpmask = NULL in case of CPU_NEWLY_IDLE which is
> IMHO not an issue since it's idle is passed via env into
> can_migrate_task().
> And cpus has to be and'ed with sched_domain_span(env.sd).
> 
> I'm not sure if this will work with 'not fully connected NUMA' (SD_OVERLAP)
> though ...

Hmm.  I follow the idea, but I'm not too confident in the SD_OVERLAP 
case, and looking at your proposed code, it seems invasive to me - 
changes are needed in what would otherwise be unrelated sections of 
code.  I'd prefer not to go in that direction.  Also, it appears that 
the dst_cpu is still considered as a source for load.

We've got a different idea to address the stack issue, and still keep 
the change "contained", which I'll roll into a V2 today or tomorrow.

-- 
Jeffrey Hugo
Qualcomm Datacenter Technologies as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 2/2] sched/fair: Remove group imbalance from calculate_imbalance()
  2017-05-12 17:01 [RFC 0/2] load_balance() fixes for affinity Jeffrey Hugo
  2017-05-12 17:01 ` [RFC 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
@ 2017-05-12 17:01 ` Jeffrey Hugo
  1 sibling, 0 replies; 11+ messages in thread
From: Jeffrey Hugo @ 2017-05-12 17:01 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel
  Cc: Dietmar Eggemann, Austin Christ, Tyler Baicar, Jeffrey Hugo

The group_imbalance path in calculate_imbalance() made sense when it was
added back in 2007 with commit 908a7c1b9b80 ("sched: fix improper load
balance across sched domain") because busiest->load_per_task factored into
the amount of imbalance that was calculated. That is not the case today.

The group_imbalance path can only affect the outcome of
calculate_imbalance() when the average load of the domain is less than the
original busiest->load_per_task. In this case, busiest->load_per_task is
overwritten with the scheduling domain load average. Thus
busiest->load_per_task no longer represents actual load that can be moved.

At the final comparison between env->imbalance and busiest->load_per_task,
imbalance may be larger than the new busiest->load_per_task causing the
check to fail under the assumption that there is a task that could be
migrated to satisfy the imbalance. However env->imbalance may still be
smaller than the original busiest->load_per_task, thus it is unlikely that
there is a task that can be migrated to satisfy the imbalance.
Calculate_imbalance() would not choose to run fix_small_imbalance() when we
expect it should. In the worst case, this can result in idle cpus.

Since the group imbalance path in calculate_imbalance() is at best a NOP
but otherwise harmful, remove it.

Signed-off-by: Austin Christ <austinwc@codeaurora.org>
Signed-off-by: Jeffrey Hugo <jhugo@codeaurora.org>
Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
---
 kernel/sched/fair.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f783ba..3283561 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7760,15 +7760,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;

-	if (busiest->group_type == group_imbalanced) {
-		/*
-		 * In the group_imb case we cannot rely on group-wide averages
-		 * to ensure cpu-load equilibrium, look at wider averages. XXX
-		 */
-		busiest->load_per_task =
-			min(busiest->load_per_task, sds->avg_load);
-	}
-
 	/*
 	 * Avg load of busiest sg can be less and avg load of local sg can
 	 * be greater than avg load across all sgs of sd because avg load
-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-05-18 14:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-05-12 17:01 [RFC 0/2] load_balance() fixes for affinity Jeffrey Hugo
2017-05-12 17:01 ` [RFC 1/2] sched/fair: Fix load_balance() affinity redo path Jeffrey Hugo
2017-05-12 17:23   ` Peter Zijlstra
2017-05-12 17:29     ` Jeffrey Hugo
2017-05-12 20:44       ` Peter Zijlstra
2017-05-12 20:54         ` Jeffrey Hugo
2017-05-12 20:47   ` Peter Zijlstra
2017-05-12 20:57     ` Jeffrey Hugo
2017-05-15 14:56       ` Dietmar Eggemann
2017-05-18 14:31         ` Jeffrey Hugo
2017-05-12 17:01 ` [RFC 2/2] sched/fair: Remove group imbalance from calculate_imbalance() Jeffrey Hugo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.