* [PATCH] sched: make update_sd_pick_busiest return true on a busier sd
@ 2014-07-22 18:45 Rik van Riel
2014-07-23 7:41 ` Vincent Guittot
2014-07-25 15:27 ` Peter Zijlstra
0 siblings, 2 replies; 19+ messages in thread
From: Rik van Riel @ 2014-07-22 18:45 UTC (permalink / raw)
To: linux-kernel
Cc: peterz, mikey, mingo, pjt, jhladky, ktkhai, tim.c.chen,
nicolas.pitre
Currently update_sd_pick_busiest only returns true when an sd
is overloaded, or for SD_ASYM_PACKING when a domain is busier
than average and a higher numbered domain than the target.
This breaks load balancing between domains that are not overloaded,
in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest
return true when the busiest sd yet is encountered.
On a 4 node system, this seems to result in the load balancer finally
putting 1 thread of a 4 thread test run of "perf bench numa mem" on
each node, where before the load was generally not spread across all
nodes.
Behaviour for SD_ASYM_PACKING does not seem to match the comment,
in that groups with below average load average are ignored, but I
have no hardware to test that so I have left the behaviour of that
code unchanged.
Cc: mikey@neuling.org
Cc: peterz@infradead.org
Signed-off-by: Rik van Riel <riel@redhat.com>
---
kernel/sched/fair.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d33..ff4ddba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5942,16 +5942,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* numbered CPUs in the group, therefore mark all groups
* higher than ourself as busy.
*/
- if ((env->sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running &&
- env->dst_cpu < group_first_cpu(sg)) {
- if (!sds->busiest)
- return true;
+ if (env->sd->flags & SD_ASYM_PACKING) {
+ if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) {
+ if (!sds->busiest)
+ return true;
- if (group_first_cpu(sds->busiest) > group_first_cpu(sg))
- return true;
+ if (group_first_cpu(sds->busiest) > group_first_cpu(sg))
+ return true;
+ }
+
+ return false;
}
- return false;
+ /* See above: sgs->avg_load > sds->busiest_stat.avg_load */
+ return true;
}
#ifdef CONFIG_NUMA_BALANCING
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-22 18:45 [PATCH] sched: make update_sd_pick_busiest return true on a busier sd Rik van Riel
@ 2014-07-23 7:41 ` Vincent Guittot
2014-07-25 13:33 ` Rik van Riel
2014-07-25 14:02 ` Rik van Riel
2014-07-25 15:27 ` Peter Zijlstra
1 sibling, 2 replies; 19+ messages in thread
From: Vincent Guittot @ 2014-07-23 7:41 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar,
Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre
On 22 July 2014 20:45, Rik van Riel <riel@redhat.com> wrote:
> Currently update_sd_pick_busiest only returns true when an sd
> is overloaded, or for SD_ASYM_PACKING when a domain is busier
> than average and a higher numbered domain than the target.
>
> This breaks load balancing between domains that are not overloaded,
> in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest
> return true when the busiest sd yet is encountered.
>
> On a 4 node system, this seems to result in the load balancer finally
> putting 1 thread of a 4 thread test run of "perf bench numa mem" on
> each node, where before the load was generally not spread across all
> nodes.
>
> Behaviour for SD_ASYM_PACKING does not seem to match the comment,
> in that groups with below average load average are ignored, but I
> have no hardware to test that so I have left the behaviour of that
> code unchanged.
>
> Cc: mikey@neuling.org
> Cc: peterz@infradead.org
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> kernel/sched/fair.c | 18 +++++++++++-------
> 1 file changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index fea7d33..ff4ddba 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5942,16 +5942,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> * numbered CPUs in the group, therefore mark all groups
> * higher than ourself as busy.
> */
> - if ((env->sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running &&
> - env->dst_cpu < group_first_cpu(sg)) {
> - if (!sds->busiest)
> - return true;
> + if (env->sd->flags & SD_ASYM_PACKING) {
> + if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) {
> + if (!sds->busiest)
> + return true;
>
> - if (group_first_cpu(sds->busiest) > group_first_cpu(sg))
> - return true;
> + if (group_first_cpu(sds->busiest) > group_first_cpu(sg))
> + return true;
> + }
> +
> + return false;
> }
>
> - return false;
> + /* See above: sgs->avg_load > sds->busiest_stat.avg_load */
> + return true;
Hi Rik,
I can see one issue with a default return set to true. You increase
the number of time where we will not effectively migrate a task
because we don't ensure that we will take the overloaded group if
there is one. We can be in a situation where a group is overloaded but
the load_balance will select a not overloaded group with an average
load higher than sched_domain average value just because it is checked
after.
Regarding your issue with "perf bench numa mem" that is not spread on
all nodes, SD_PREFER_SIBLING flag (of DIE level) should do the job by
reducing the capacity of "not local DIE" group at NUMA level to 1
task during the load balance computation. So you should have 1 task
per sched_group at NUMA level.
Vincent
> }
>
> #ifdef CONFIG_NUMA_BALANCING
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-23 7:41 ` Vincent Guittot
@ 2014-07-25 13:33 ` Rik van Riel
2014-07-25 14:29 ` Vincent Guittot
2014-07-25 14:02 ` Rik van Riel
1 sibling, 1 reply; 19+ messages in thread
From: Rik van Riel @ 2014-07-25 13:33 UTC (permalink / raw)
To: Vincent Guittot
Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar,
Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 07/23/2014 03:41 AM, Vincent Guittot wrote:
> On 22 July 2014 20:45, Rik van Riel <riel@redhat.com> wrote:
>> Currently update_sd_pick_busiest only returns true when an sd is
>> overloaded, or for SD_ASYM_PACKING when a domain is busier than
>> average and a higher numbered domain than the target.
>>
>> This breaks load balancing between domains that are not
>> overloaded, in the !SD_ASYM_PACKING case. This patch makes
>> update_sd_pick_busiest return true when the busiest sd yet is
>> encountered.
>>
>> On a 4 node system, this seems to result in the load balancer
>> finally putting 1 thread of a 4 thread test run of "perf bench
>> numa mem" on each node, where before the load was generally not
>> spread across all nodes.
>>
>> Behaviour for SD_ASYM_PACKING does not seem to match the
>> comment, in that groups with below average load average are
>> ignored, but I have no hardware to test that so I have left the
>> behaviour of that code unchanged.
>>
>> Cc: mikey@neuling.org Cc: peterz@infradead.org Signed-off-by: Rik
>> van Riel <riel@redhat.com> --- kernel/sched/fair.c | 18
>> +++++++++++------- 1 file changed, 11 insertions(+), 7
>> deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
>> fea7d33..ff4ddba 100644 --- a/kernel/sched/fair.c +++
>> b/kernel/sched/fair.c @@ -5942,16 +5942,20 @@ static bool
>> update_sd_pick_busiest(struct lb_env *env, * numbered CPUs in the
>> group, therefore mark all groups * higher than ourself as busy.
>> */ - if ((env->sd->flags & SD_ASYM_PACKING) &&
>> sgs->sum_nr_running && - env->dst_cpu <
>> group_first_cpu(sg)) { - if (!sds->busiest) -
>> return true; + if (env->sd->flags & SD_ASYM_PACKING) { +
>> if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) {
>> + if (!sds->busiest) +
>> return true;
>>
>> - if (group_first_cpu(sds->busiest) >
>> group_first_cpu(sg)) - return true; +
>> if (group_first_cpu(sds->busiest) > group_first_cpu(sg)) +
>> return true; + } + + return false; }
>>
>> - return false; + /* See above: sgs->avg_load >
>> sds->busiest_stat.avg_load */ + return true;
>
> Hi Rik,
>
> I can see one issue with a default return set to true. You
> increase the number of time where we will not effectively migrate a
> task because we don't ensure that we will take the overloaded group
> if there is one. We can be in a situation where a group is
> overloaded but the load_balance will select a not overloaded group
> with an average load higher than sched_domain average value just
> because it is checked after.
Look at the first line of update_sd_pick_busiest()
static bool update_sd_pick_busiest(struct lb_env *env,
struct sd_lb_stats *sds,
struct sched_group *sg,
struct sg_lb_stats *sgs)
{
if (sgs->avg_load <= sds->busiest_stat.avg_load)
return false;
If the group load is less than the busiest, we have already
returned false, and will not get to the code that I changed.
> Regarding your issue with "perf bench numa mem" that is not spread
> on all nodes, SD_PREFER_SIBLING flag (of DIE level) should do the
> job by reducing the capacity of "not local DIE" group at NUMA
> level to 1 task during the load balance computation. So you should
> have 1 task per sched_group at NUMA level.
That did not actually happen in my tests. I almost always
ended up having only 0 tasks on one node.
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJT0lzDAAoJEM553pKExN6D/8oH/3TmBwlIpj/H7pbs4ucvfigx
WBgjkA0U/snXA8D/oRicIpX2+N42wwnmME/E20mhVjqUAvrDLbfaWoJC3UJ6Qx08
GUeKxOBxbf1FypOmLyfKuQrMOojO585TX76n43MZnsfxzCJUIL6x6MQOE+Tbutx9
6Y0VCz1uw9BdwnEuP0fObMrExMOlmb2JSiWiCuf8uAorWiArv8TZvBxt5W093ONO
bRDywJ8sMFVwgQ0TZLPEBFsRAcGuPhHx/FJZuOb/F/NWopaaZD3tt4gM3VDaq4ir
z+Qvhboql2EdydoYZV+O4VBWI7gFtT2+vLpUteaYmFR3Zx5VtneSwyCwtk6yk0c=
=tZ6L
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-23 7:41 ` Vincent Guittot
2014-07-25 13:33 ` Rik van Riel
@ 2014-07-25 14:02 ` Rik van Riel
2014-07-25 14:15 ` Peter Zijlstra
2014-07-25 15:02 ` Vincent Guittot
1 sibling, 2 replies; 19+ messages in thread
From: Rik van Riel @ 2014-07-25 14:02 UTC (permalink / raw)
To: Vincent Guittot
Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar,
Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 07/23/2014 03:41 AM, Vincent Guittot wrote:
> Regarding your issue with "perf bench numa mem" that is not spread
> on all nodes, SD_PREFER_SIBLING flag (of DIE level) should do the
> job by reducing the capacity of "not local DIE" group at NUMA
> level to 1 task during the load balance computation. So you should
> have 1 task per sched_group at NUMA level.
Looking at the code some more, it is clear why this does not
happen. If sd->flags & SD_NUMA, then SD_PREFER_SIBLING will
never be set.
On a related note, that part of the load balancing code probably
needs to be rewritten to deal with unequal group_capacity_factors
anyway.
Say that one group has a group_capacity_factor twice that of
another group.
The group with the smaller group_capacity_factor is overloaded
by a factor 1.3. The larger group is loaded by a factor 0.8.
This means the larger group has a higher load than the first
group, and the current code in update_sd_pick_busiest will
not select the overloaded group as the busiest one, due to not
scaling load with the capacity...
static bool update_sd_pick_busiest(struct lb_env *env,
struct sd_lb_stats *sds,
struct sched_group *sg,
struct sg_lb_stats *sgs)
{
if (sgs->avg_load <= sds->busiest_stat.avg_load)
return false;
I believe we may need to factor the group_capacity_factor
into this calculation, in order to properly identify which
group is busiest.
However, if we do that we may need to get rid of the
SD_PREFER_SIBLING hack that forces group_capacity_factor
to 1 on domains that have SD_PREFER_SIBLING set.
I suspect that should be ok though, if we make sure
update_sd_pick_busiest does the right thing...
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJT0mOCAAoJEM553pKExN6DHq4H/2THfH33d+JYvfOq95OpGLaD
HATAp8Dv0kTiGjnbZrHPp8TqqgLLXuM6HhLvsvURuhoJw6F/nOX6qOQWEtjcMyYp
omShkDSLnPjs/0Iwf9vNocT7K7Sn3Gk0hOj6+ICW7wchyug8JYtuiHunP8pYrpzW
G6l2qHMRqRs5mSENY/uWwH9qh6Z6jcfDoDDDKRTNBe0z67FzwMnX1IYCUA6XOBsZ
iRdXe8E0CIgio+ek8HVzRm5sUlkRyfJpTXJj+pemVJhTrNCCbMGTHxzADU4Ngc8S
+JQ+G6bsHz9R4pffsuzYFbL0avK0mm5SrjCIatE7MX171dQJ1cKpju+fAmnwuNg=
=EAzG
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-25 14:02 ` Rik van Riel
@ 2014-07-25 14:15 ` Peter Zijlstra
2014-07-25 15:02 ` Vincent Guittot
1 sibling, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-07-25 14:15 UTC (permalink / raw)
To: Rik van Riel
Cc: Vincent Guittot, linux-kernel, Michael Neuling, Ingo Molnar,
Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre
[-- Attachment #1: Type: text/plain, Size: 1466 bytes --]
On Fri, Jul 25, 2014 at 10:02:43AM -0400, Rik van Riel wrote:
> On a related note, that part of the load balancing code probably
> needs to be rewritten to deal with unequal group_capacity_factors
> anyway.
>
> Say that one group has a group_capacity_factor twice that of
> another group.
>
> The group with the smaller group_capacity_factor is overloaded
> by a factor 1.3. The larger group is loaded by a factor 0.8.
> This means the larger group has a higher load than the first
> group, and the current code in update_sd_pick_busiest will
> not select the overloaded group as the busiest one, due to not
> scaling load with the capacity...
>
> static bool update_sd_pick_busiest(struct lb_env *env,
> struct sd_lb_stats *sds,
> struct sched_group *sg,
> struct sg_lb_stats *sgs)
> {
> if (sgs->avg_load <= sds->busiest_stat.avg_load)
> return false;
>
> I believe we may need to factor the group_capacity_factor
> into this calculation, in order to properly identify which
> group is busiest.
(sorry, going through this backwards, I'll get to the actual patch in a
bit)
Note how update_sg_lb_stats() where we compute sgs->avg_load we do
divide by sgs->group_capacity.
(also, curse this renaming of stuff)
The group_capacity_factor is something ugly and Vincent was going to
poke at that.
[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-25 13:33 ` Rik van Riel
@ 2014-07-25 14:29 ` Vincent Guittot
2014-07-25 14:46 ` Rik van Riel
0 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2014-07-25 14:29 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar,
Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre
On 25 July 2014 15:33, Rik van Riel <riel@redhat.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 07/23/2014 03:41 AM, Vincent Guittot wrote:
>> On 22 July 2014 20:45, Rik van Riel <riel@redhat.com> wrote:
>>> Currently update_sd_pick_busiest only returns true when an sd is
>>> overloaded, or for SD_ASYM_PACKING when a domain is busier than
>>> average and a higher numbered domain than the target.
>>>
>>> This breaks load balancing between domains that are not
>>> overloaded, in the !SD_ASYM_PACKING case. This patch makes
>>> update_sd_pick_busiest return true when the busiest sd yet is
>>> encountered.
>>>
>>> On a 4 node system, this seems to result in the load balancer
>>> finally putting 1 thread of a 4 thread test run of "perf bench
>>> numa mem" on each node, where before the load was generally not
>>> spread across all nodes.
>>>
>>> Behaviour for SD_ASYM_PACKING does not seem to match the
>>> comment, in that groups with below average load average are
>>> ignored, but I have no hardware to test that so I have left the
>>> behaviour of that code unchanged.
>>>
>>> Cc: mikey@neuling.org Cc: peterz@infradead.org Signed-off-by: Rik
>>> van Riel <riel@redhat.com> --- kernel/sched/fair.c | 18
>>> +++++++++++------- 1 file changed, 11 insertions(+), 7
>>> deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
>>> fea7d33..ff4ddba 100644 --- a/kernel/sched/fair.c +++
>>> b/kernel/sched/fair.c @@ -5942,16 +5942,20 @@ static bool
>>> update_sd_pick_busiest(struct lb_env *env, * numbered CPUs in the
>>> group, therefore mark all groups * higher than ourself as busy.
>>> */ - if ((env->sd->flags & SD_ASYM_PACKING) &&
>>> sgs->sum_nr_running && - env->dst_cpu <
>>> group_first_cpu(sg)) { - if (!sds->busiest) -
>>> return true; + if (env->sd->flags & SD_ASYM_PACKING) { +
>>> if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) {
>>> + if (!sds->busiest) +
>>> return true;
>>>
>>> - if (group_first_cpu(sds->busiest) >
>>> group_first_cpu(sg)) - return true; +
>>> if (group_first_cpu(sds->busiest) > group_first_cpu(sg)) +
>>> return true; + } + + return false; }
>>>
>>> - return false; + /* See above: sgs->avg_load >
>>> sds->busiest_stat.avg_load */ + return true;
>>
>> Hi Rik,
>>
>> I can see one issue with a default return set to true. You
>> increase the number of time where we will not effectively migrate a
>> task because we don't ensure that we will take the overloaded group
>> if there is one. We can be in a situation where a group is
>> overloaded but the load_balance will select a not overloaded group
>> with an average load higher than sched_domain average value just
>> because it is checked after.
>
> Look at the first line of update_sd_pick_busiest()
>
> static bool update_sd_pick_busiest(struct lb_env *env,
> struct sd_lb_stats *sds,
> struct sched_group *sg,
> struct sg_lb_stats *sgs)
> {
> if (sgs->avg_load <= sds->busiest_stat.avg_load)
> return false;
>
> If the group load is less than the busiest, we have already
> returned false, and will not get to the code that I changed.
My point was that if a sched_group A with 1 task has got a higher
avg_load than a sched_group with 2 tasks, we will select sched_group A
whereas we should select the other group
Furthermore, update_sd_lb_stats will always return a busiest group
even an idle one. This will increase the number of failed load balance
and the time spent in the it.
Vincent
>
>> Regarding your issue with "perf bench numa mem" that is not spread
>> on all nodes, SD_PREFER_SIBLING flag (of DIE level) should do the
>> job by reducing the capacity of "not local DIE" group at NUMA
>> level to 1 task during the load balance computation. So you should
>> have 1 task per sched_group at NUMA level.
>
> That did not actually happen in my tests. I almost always
> ended up having only 0 tasks on one node.
>
> - --
> All rights reversed
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQEcBAEBAgAGBQJT0lzDAAoJEM553pKExN6D/8oH/3TmBwlIpj/H7pbs4ucvfigx
> WBgjkA0U/snXA8D/oRicIpX2+N42wwnmME/E20mhVjqUAvrDLbfaWoJC3UJ6Qx08
> GUeKxOBxbf1FypOmLyfKuQrMOojO585TX76n43MZnsfxzCJUIL6x6MQOE+Tbutx9
> 6Y0VCz1uw9BdwnEuP0fObMrExMOlmb2JSiWiCuf8uAorWiArv8TZvBxt5W093ONO
> bRDywJ8sMFVwgQ0TZLPEBFsRAcGuPhHx/FJZuOb/F/NWopaaZD3tt4gM3VDaq4ir
> z+Qvhboql2EdydoYZV+O4VBWI7gFtT2+vLpUteaYmFR3Zx5VtneSwyCwtk6yk0c=
> =tZ6L
> -----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-25 14:29 ` Vincent Guittot
@ 2014-07-25 14:46 ` Rik van Riel
0 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2014-07-25 14:46 UTC (permalink / raw)
To: Vincent Guittot
Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar,
Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 07/25/2014 10:29 AM, Vincent Guittot wrote:
> On 25 July 2014 15:33, Rik van Riel <riel@redhat.com> wrote: On
> 07/23/2014 03:41 AM, Vincent Guittot wrote:
>>>> On 22 July 2014 20:45, Rik van Riel <riel@redhat.com> wrote:
>>>>> Currently update_sd_pick_busiest only returns true when an
>>>>> sd is overloaded, or for SD_ASYM_PACKING when a domain is
>>>>> busier than average and a higher numbered domain than the
>>>>> target.
>>>>>
>>>>> This breaks load balancing between domains that are not
>>>>> overloaded, in the !SD_ASYM_PACKING case. This patch makes
>>>>> update_sd_pick_busiest return true when the busiest sd yet
>>>>> is encountered.
>>>>>
>>>>> On a 4 node system, this seems to result in the load
>>>>> balancer finally putting 1 thread of a 4 thread test run of
>>>>> "perf bench numa mem" on each node, where before the load
>>>>> was generally not spread across all nodes.
>>>>>
>>>>> Behaviour for SD_ASYM_PACKING does not seem to match the
>>>>> comment, in that groups with below average load average
>>>>> are ignored, but I have no hardware to test that so I have
>>>>> left the behaviour of that code unchanged.
>>>>>
>>>>> Cc: mikey@neuling.org Cc: peterz@infradead.org
>>>>> Signed-off-by: Rik van Riel <riel@redhat.com> ---
>>>>> kernel/sched/fair.c | 18 +++++++++++------- 1 file changed,
>>>>> 11 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index fea7d33..ff4ddba 100644 --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c @@ -5942,16 +5942,20 @@ static
>>>>> bool update_sd_pick_busiest(struct lb_env *env, * numbered
>>>>> CPUs in the group, therefore mark all groups * higher than
>>>>> ourself as busy. */ - if ((env->sd->flags &
>>>>> SD_ASYM_PACKING) && sgs->sum_nr_running && -
>>>>> env->dst_cpu < group_first_cpu(sg)) { - if
>>>>> (!sds->busiest) - return true; + if (env->sd->flags &
>>>>> SD_ASYM_PACKING) { + if (sgs->sum_nr_running &&
>>>>> env->dst_cpu < group_first_cpu(sg)) { +
>>>>> if (!sds->busiest) + return true;
>>>>>
>>>>> - if (group_first_cpu(sds->busiest) >
>>>>> group_first_cpu(sg)) - return true;
>>>>> + if (group_first_cpu(sds->busiest) > group_first_cpu(sg))
>>>>> + return true; + } + + return
>>>>> false; }
>>>>>
>>>>> - return false; + /* See above: sgs->avg_load
>>>>> > sds->busiest_stat.avg_load */ + return true;
>>>>
>>>> Hi Rik,
>>>>
>>>> I can see one issue with a default return set to true. You
>>>> increase the number of time where we will not effectively
>>>> migrate a task because we don't ensure that we will take the
>>>> overloaded group if there is one. We can be in a situation
>>>> where a group is overloaded but the load_balance will select
>>>> a not overloaded group with an average load higher than
>>>> sched_domain average value just because it is checked after.
>
> Look at the first line of update_sd_pick_busiest()
>
> static bool update_sd_pick_busiest(struct lb_env *env, struct
> sd_lb_stats *sds, struct sched_group *sg, struct sg_lb_stats *sgs)
> { if (sgs->avg_load <= sds->busiest_stat.avg_load) return false;
>
> If the group load is less than the busiest, we have already
> returned false, and will not get to the code that I changed.
>
>
>> My point was that if a sched_group A with 1 task has got a
>> higher avg_load than a sched_group with 2 tasks, we will select
>> sched_group A whereas we should select the other group
The code already does that, with or without my patch.
If it runs into group A first, that "return false" above will
be hit for group B.
>> Furthermore, update_sd_lb_stats will always return a busiest
>> group even an idle one. This will increase the number of failed
>> load balance and the time spent in the it.
If the busiest group found is idle, surely find_busiest_group
will see that and goto out_balanced ?
There are several safety checks in find_busiest_group to make
sure NULL is returned when the imbalance found is too small to
bother doing anything about.
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJT0m3fAAoJEM553pKExN6D1xMH/3BOTeD9aiu/nVCWnfJIdRUB
2owogPLFDlcbUekzGkjIxc4lYWj/ANVk5jOibcOLutckSJXphDc1KhqYYXaNM6rm
NfoFM0//HGxJgIWMKGWYWJosPdFdrvLxNRwn8+yBnZh9el15GQSBvhKjrCeolNo2
Yy6AqicqvoMXMnzcONcAxyxwH0b6CRFuGHIOAvzjXGUvSKTU7fs1zPRAVxfORbbp
HfiOybqNJW+EnH7xhJU+GSZi+X+agnRS4/axfc48FZH01/P+k21cYougC7kMxxHA
MGP1YtnNYGFBqCX5QwGgw5NkMgHYNCCREh+uLwsgGCN/bkmuHJm+JcbQOQlRoRY=
=v9O8
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-25 14:02 ` Rik van Riel
2014-07-25 14:15 ` Peter Zijlstra
@ 2014-07-25 15:02 ` Vincent Guittot
2014-07-25 15:13 ` Rik van Riel
1 sibling, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2014-07-25 15:02 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar,
Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre
On 25 July 2014 16:02, Rik van Riel <riel@redhat.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 07/23/2014 03:41 AM, Vincent Guittot wrote:
>
>> Regarding your issue with "perf bench numa mem" that is not spread
>> on all nodes, SD_PREFER_SIBLING flag (of DIE level) should do the
>> job by reducing the capacity of "not local DIE" group at NUMA
>> level to 1 task during the load balance computation. So you should
>> have 1 task per sched_group at NUMA level.
>
> Looking at the code some more, it is clear why this does not
> happen. If sd->flags & SD_NUMA, then SD_PREFER_SIBLING will
> never be set.
I don't have a lot of experience on NUMA system and how their
sched_domain topology is described but IIUC, you don't have other
sched_domain level than NUMA ones ? otherwise the flag should be
present in one of the non NUMA level (SMT, MC or DIE)
>
>
> On a related note, that part of the load balancing code probably
> needs to be rewritten to deal with unequal group_capacity_factors
> anyway.
>
> Say that one group has a group_capacity_factor twice that of
> another group.
>
> The group with the smaller group_capacity_factor is overloaded
> by a factor 1.3. The larger group is loaded by a factor 0.8.
> This means the larger group has a higher load than the first
> group, and the current code in update_sd_pick_busiest will
> not select the overloaded group as the busiest one, due to not
> scaling load with the capacity...
>
AFAICT, sgs->avg_load is weighted by the capacity in update_sg_lb_stats
> static bool update_sd_pick_busiest(struct lb_env *env,
> struct sd_lb_stats *sds,
> struct sched_group *sg,
> struct sg_lb_stats *sgs)
> {
> if (sgs->avg_load <= sds->busiest_stat.avg_load)
> return false;
>
> I believe we may need to factor the group_capacity_factor
> into this calculation, in order to properly identify which
> group is busiest.
>
> However, if we do that we may need to get rid of the
> SD_PREFER_SIBLING hack that forces group_capacity_factor
> to 1 on domains that have SD_PREFER_SIBLING set.
I'm working on a patchset that get ride of capacity_factor (as
mentioned by Peter) and directly uses capacity instead. I should send
the v4 next week.
Vincent
>
> I suspect that should be ok though, if we make sure
> update_sd_pick_busiest does the right thing...
>
> - --
> All rights reversed
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQEcBAEBAgAGBQJT0mOCAAoJEM553pKExN6DHq4H/2THfH33d+JYvfOq95OpGLaD
> HATAp8Dv0kTiGjnbZrHPp8TqqgLLXuM6HhLvsvURuhoJw6F/nOX6qOQWEtjcMyYp
> omShkDSLnPjs/0Iwf9vNocT7K7Sn3Gk0hOj6+ICW7wchyug8JYtuiHunP8pYrpzW
> G6l2qHMRqRs5mSENY/uWwH9qh6Z6jcfDoDDDKRTNBe0z67FzwMnX1IYCUA6XOBsZ
> iRdXe8E0CIgio+ek8HVzRm5sUlkRyfJpTXJj+pemVJhTrNCCbMGTHxzADU4Ngc8S
> +JQ+G6bsHz9R4pffsuzYFbL0avK0mm5SrjCIatE7MX171dQJ1cKpju+fAmnwuNg=
> =EAzG
> -----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-25 15:02 ` Vincent Guittot
@ 2014-07-25 15:13 ` Rik van Riel
0 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2014-07-25 15:13 UTC (permalink / raw)
To: Vincent Guittot
Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar,
Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 07/25/2014 11:02 AM, Vincent Guittot wrote:
> On 25 July 2014 16:02, Rik van Riel <riel@redhat.com> wrote: On
> 07/23/2014 03:41 AM, Vincent Guittot wrote:
>
>>>> Regarding your issue with "perf bench numa mem" that is not
>>>> spread on all nodes, SD_PREFER_SIBLING flag (of DIE level)
>>>> should do the job by reducing the capacity of "not local
>>>> DIE" group at NUMA level to 1 task during the load balance
>>>> computation. So you should have 1 task per sched_group at
>>>> NUMA level.
>>
>> Looking at the code some more, it is clear why this does not
>> happen. If sd->flags & SD_NUMA, then SD_PREFER_SIBLING will never
>> be set.
>
> I don't have a lot of experience on NUMA system and how their
> sched_domain topology is described but IIUC, you don't have other
> sched_domain level than NUMA ones ? otherwise the flag should be
> present in one of the non NUMA level (SMT, MC or DIE)
The system I am testing on has 3 or 4 sched_domain levels,
one for each HT sibling(?), one for each core, one for
each node/socket, and one parent domain for the whole system.
SD_PREFER_SIBLING should be set at the HT sibling level
and at the core level.
However, it is not set at the levels above that.
That means the SD_PREFER_SIBLING flag does its thing within
each CPU core and between cores on a socket, but not between
NUMA nodes...
>> On a related note, that part of the load balancing code probably
>> needs to be rewritten to deal with unequal
>> group_capacity_factors anyway.
> AFAICT, sgs->avg_load is weighted by the capacity in
> update_sg_lb_stats
Indeed, I dug into that code after sending the email, and found
that piece of the code just before I read Peter's email pointing
it out to me.
> I'm working on a patchset that get ride of capacity_factor (as
> mentioned by Peter) and directly uses capacity instead. I should
> send the v4 next week.
I am looking forward to anything that will make this code easier
to follow :)
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJT0nQMAAoJEM553pKExN6DeFAIAK7KHHRl+lfWIxiZGRyarYHL
SHJydCA4a9Lkd2D60dULGWY/8ylB8+IMfwv69/jXHZzbxlg7Nu1+da7ZUF3Lx35k
AYxpOhC94eTJvp9KQX2W0nGiDZ0Di7YnWAWdoLsd1kZZDZjd82gtLVh63ossWZlF
hn+YH6E4n0iAe6CZ2PO4QMz7dDYVGzUnKuuQVZKl3DBJWSe6ZvcBhDS4xDT+uUe1
IheA5aQcY8XmAkcXbLs736iBiOCKH6Jts6trJUPaVxw3jkD8lI/CMuMss2dk7RPH
xl3Y8CKridywdtvGN6WrOwzUxVBxaEN1da/VuN7nF4OnoipYApSWwKTBe9wd7rQ=
=AsCN
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-22 18:45 [PATCH] sched: make update_sd_pick_busiest return true on a busier sd Rik van Riel
2014-07-23 7:41 ` Vincent Guittot
@ 2014-07-25 15:27 ` Peter Zijlstra
2014-07-25 15:45 ` Rik van Riel
2014-07-27 23:57 ` [PATCH] " Michael Neuling
1 sibling, 2 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-07-25 15:27 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, mikey, mingo, pjt, jhladky, ktkhai, tim.c.chen,
nicolas.pitre
On Tue, Jul 22, 2014 at 02:45:59PM -0400, Rik van Riel wrote:
> Currently update_sd_pick_busiest only returns true when an sd
> is overloaded, or for SD_ASYM_PACKING when a domain is busier
> than average and a higher numbered domain than the target.
>
> This breaks load balancing between domains that are not overloaded,
> in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest
> return true when the busiest sd yet is encountered.
>
> On a 4 node system, this seems to result in the load balancer finally
> putting 1 thread of a 4 thread test run of "perf bench numa mem" on
> each node, where before the load was generally not spread across all
> nodes.
So for !ASYM the code is effectively:
return sgs->avg_load > sds->busiest_stat.avg_load;
I'd like to at least add a clause that makes overloaded groups
prioritized over !overloaded groups.
Also, like we found earlier, calculate_imbalance() relies on the
sum_nr_running > group_capacity_factor thing, which you've just
'wrecked', so we'd need an update to that part too.
> Behaviour for SD_ASYM_PACKING does not seem to match the comment,
> in that groups with below average load average are ignored, but I
> have no hardware to test that so I have left the behaviour of that
> code unchanged.
Mikey, does that stuff work as expected?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-25 15:27 ` Peter Zijlstra
@ 2014-07-25 15:45 ` Rik van Riel
2014-07-25 16:05 ` Peter Zijlstra
2014-07-27 23:57 ` [PATCH] " Michael Neuling
1 sibling, 1 reply; 19+ messages in thread
From: Rik van Riel @ 2014-07-25 15:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, mikey, mingo, pjt, jhladky, ktkhai, tim.c.chen,
nicolas.pitre
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 07/25/2014 11:27 AM, Peter Zijlstra wrote:
> On Tue, Jul 22, 2014 at 02:45:59PM -0400, Rik van Riel wrote:
>> Currently update_sd_pick_busiest only returns true when an sd is
>> overloaded, or for SD_ASYM_PACKING when a domain is busier than
>> average and a higher numbered domain than the target.
>>
>> This breaks load balancing between domains that are not
>> overloaded, in the !SD_ASYM_PACKING case. This patch makes
>> update_sd_pick_busiest return true when the busiest sd yet is
>> encountered.
>>
>> On a 4 node system, this seems to result in the load balancer
>> finally putting 1 thread of a 4 thread test run of "perf bench
>> numa mem" on each node, where before the load was generally not
>> spread across all nodes.
>
> So for !ASYM the code is effectively:
>
> return sgs->avg_load > sds->busiest_stat.avg_load;
>
> I'd like to at least add a clause that makes overloaded groups
> prioritized over !overloaded groups.
>
> Also, like we found earlier, calculate_imbalance() relies on the
> sum_nr_running > group_capacity_factor thing, which you've just
> 'wrecked', so we'd need an update to that part too.
I guess that would mean update_sd_pick_busiest would look like
this for the !ASYM case:
1) remembering whether busiest is overloaded
2) if (sgs->busiest_stat.overloaded && !sgs->overloaded)
return false;
3) if (sgs->avg_load > sds->busiest_stat.avg_load)
return true;
>> Behaviour for SD_ASYM_PACKING does not seem to match the
>> comment, in that groups with below average load average are
>> ignored, but I have no hardware to test that so I have left the
>> behaviour of that code unchanged.
>
> Mikey, does that stuff work as expected?
I suspect it does not, due to the checks above the SD_ASYM_PACKAGING
code occasionally overriding the SD_ASYM_PACKAGING code.
Also, the ASYM code may rely on CPU numbers not being interleaved
between groups, the "env->dst_cpu < group_first_cpu(sg)" check would
probably fail to pull all load onto group 0 if CPU numbers were
distributed like this:
group 0: 0 2 4 6
group 1: 1 3 5 7
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJT0nt9AAoJEM553pKExN6DdiYH/jnO9c8f9VNIPD6ibsuG0JXO
HgMJVhYY4YA/kP3wvdOOjLNwAohivCqd41ZMe7jw5K25aJ6/PIT9BN4iImSnPbgw
JPLzUyvMF+hl6/rfmKJnpYSpTOqxbZllmbJPEXJMc0biizftA1u7aoO88ufHsZbq
vZOZLHieHaiMSJ5J5sHpGaWWLw3hS4Ba0HJmUSlV+sbqSX6yZmcDFoQlvYlzfybK
Q2HFVU3WGtYcOIxgJB1NVcn+axok8+O8kI8lQpWSzpewZN6fRKqZMVnmIvEKK15C
stE3U8zXz6eYrlko5J0YyRcL5OPCE/tr4V5CRPonIvsLXmAmrMra5Ev4Dc/4Rr4=
=hZqS
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-25 15:45 ` Rik van Riel
@ 2014-07-25 16:05 ` Peter Zijlstra
2014-07-25 16:22 ` Rik van Riel
0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2014-07-25 16:05 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, mikey, mingo, pjt, jhladky, ktkhai, tim.c.chen,
nicolas.pitre
[-- Attachment #1: Type: text/plain, Size: 1090 bytes --]
On Fri, Jul 25, 2014 at 11:45:02AM -0400, Rik van Riel wrote:
> > So for !ASYM the code is effectively:
> >
> > return sgs->avg_load > sds->busiest_stat.avg_load;
> >
> > I'd like to at least add a clause that makes overloaded groups
> > prioritized over !overloaded groups.
> >
> > Also, like we found earlier, calculate_imbalance() relies on the
> > sum_nr_running > group_capacity_factor thing, which you've just
> > 'wrecked', so we'd need an update to that part too.
>
> I guess that would mean update_sd_pick_busiest would look like
> this for the !ASYM case:
>
> 1) remembering whether busiest is overloaded
>
> 2) if (sgs->busiest_stat.overloaded && !sgs->overloaded)
> return false;
>
> 3) if (sgs->avg_load > sds->busiest_stat.avg_load)
> return true;
Right. So the situation I was worried about is where we have say 8 +nice
tasks in one group and 2 -nice in another. So the group with 2 tasks can
actually be 'heavier' but not fully utilized. In that case we want to
pull tasks from the lighter but overloaded group.
[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-25 16:05 ` Peter Zijlstra
@ 2014-07-25 16:22 ` Rik van Riel
2014-07-25 17:57 ` Vincent Guittot
0 siblings, 1 reply; 19+ messages in thread
From: Rik van Riel @ 2014-07-25 16:22 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, mikey, mingo, pjt, jhladky, ktkhai, tim.c.chen,
nicolas.pitre, Vincent Guittot
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 07/25/2014 12:05 PM, Peter Zijlstra wrote:
> On Fri, Jul 25, 2014 at 11:45:02AM -0400, Rik van Riel wrote:
>>> So for !ASYM the code is effectively:
>>>
>>> return sgs->avg_load > sds->busiest_stat.avg_load;
>>>
>>> I'd like to at least add a clause that makes overloaded groups
>>> prioritized over !overloaded groups.
>>>
>>> Also, like we found earlier, calculate_imbalance() relies on
>>> the sum_nr_running > group_capacity_factor thing, which you've
>>> just 'wrecked', so we'd need an update to that part too.
>>
>> I guess that would mean update_sd_pick_busiest would look like
>> this for the !ASYM case:
>>
>> 1) remembering whether busiest is overloaded
>>
>> 2) if (sgs->busiest_stat.overloaded && !sgs->overloaded) return
>> false;
>>
>> 3) if (sgs->avg_load > sds->busiest_stat.avg_load) return true;
>
> Right. So the situation I was worried about is where we have say 8
> +nice tasks in one group and 2 -nice in another. So the group with
> 2 tasks can actually be 'heavier' but not fully utilized. In that
> case we want to pull tasks from the lighter but overloaded group.
I can certainly create a new patch that makes update_sd_pick_busiest
do just that.
Vincent, any objections to me fixing update_sd_pick_busiest now, or
will that conflict with your code cleanups?
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJT0oRWAAoJEM553pKExN6DRdoH/2VsMEbeRdr9o99CYV3hK2bj
OT4O5xlz47aBq+/J3Sp35DZKtQ8PpYICNoNzddyF5Blr7ZcXB2OQj5Cv5Lj3Dtcs
q/AnJAN+6mDNq4B3aVfTb3BMDcrSz8LNdpNwVM80gfvbcZkcWIYWF2FigShIggMZ
bKz4cC5hVphtYR/WH6gnb5dXwgXBGYatwYsDXnVw6SLRbXcwYrB646SmZYXDVcER
bOj5IAjdfAlFkLbeow6OaX1arBrd6lyKfVI4Mzr848dqLBrJvmPvx/R4Z8EzPN2H
Gv4kqU7s9CAk1QOZLHWAOCDjG7PdMZfkS7WEZnQinMIutBK/jYJ4IXY8NNeS96I=
=OcR0
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-25 16:22 ` Rik van Riel
@ 2014-07-25 17:57 ` Vincent Guittot
2014-07-25 19:32 ` [PATCH v2] " Rik van Riel
0 siblings, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2014-07-25 17:57 UTC (permalink / raw)
To: Rik van Riel
Cc: Peter Zijlstra, linux-kernel, Michael Neuling, Ingo Molnar,
Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre
On 25 July 2014 18:22, Rik van Riel <riel@redhat.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 07/25/2014 12:05 PM, Peter Zijlstra wrote:
>> On Fri, Jul 25, 2014 at 11:45:02AM -0400, Rik van Riel wrote:
>>>> So for !ASYM the code is effectively:
>>>>
>>>> return sgs->avg_load > sds->busiest_stat.avg_load;
>>>>
>>>> I'd like to at least add a clause that makes overloaded groups
>>>> prioritized over !overloaded groups.
>>>>
>>>> Also, like we found earlier, calculate_imbalance() relies on
>>>> the sum_nr_running > group_capacity_factor thing, which you've
>>>> just 'wrecked', so we'd need an update to that part too.
>>>
>>> I guess that would mean update_sd_pick_busiest would look like
>>> this for the !ASYM case:
>>>
>>> 1) remembering whether busiest is overloaded
>>>
>>> 2) if (sgs->busiest_stat.overloaded && !sgs->overloaded) return
>>> false;
>>>
>>> 3) if (sgs->avg_load > sds->busiest_stat.avg_load) return true;
>>
>> Right. So the situation I was worried about is where we have say 8
>> +nice tasks in one group and 2 -nice in another. So the group with
>> 2 tasks can actually be 'heavier' but not fully utilized. In that
>> case we want to pull tasks from the lighter but overloaded group.
>
> I can certainly create a new patch that makes update_sd_pick_busiest
> do just that.
>
> Vincent, any objections to me fixing update_sd_pick_busiest now, or
> will that conflict with your code cleanups?
It will probably conflict with my patchset which also modify
update_sd_pick_busiest but we should be able to manage the conflict.
So feel free to send a patch that fix the description above and we
will see how to handle the conflict when it will occur
Vincent
>
> - --
> All rights reversed
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQEcBAEBAgAGBQJT0oRWAAoJEM553pKExN6DRdoH/2VsMEbeRdr9o99CYV3hK2bj
> OT4O5xlz47aBq+/J3Sp35DZKtQ8PpYICNoNzddyF5Blr7ZcXB2OQj5Cv5Lj3Dtcs
> q/AnJAN+6mDNq4B3aVfTb3BMDcrSz8LNdpNwVM80gfvbcZkcWIYWF2FigShIggMZ
> bKz4cC5hVphtYR/WH6gnb5dXwgXBGYatwYsDXnVw6SLRbXcwYrB646SmZYXDVcER
> bOj5IAjdfAlFkLbeow6OaX1arBrd6lyKfVI4Mzr848dqLBrJvmPvx/R4Z8EzPN2H
> Gv4kqU7s9CAk1QOZLHWAOCDjG7PdMZfkS7WEZnQinMIutBK/jYJ4IXY8NNeS96I=
> =OcR0
> -----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v2] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-25 17:57 ` Vincent Guittot
@ 2014-07-25 19:32 ` Rik van Riel
2014-07-28 8:23 ` Vincent Guittot
2014-07-28 14:30 ` Peter Zijlstra
0 siblings, 2 replies; 19+ messages in thread
From: Rik van Riel @ 2014-07-25 19:32 UTC (permalink / raw)
To: Vincent Guittot
Cc: Peter Zijlstra, linux-kernel, Michael Neuling, Ingo Molnar,
Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre
Subject: sched: make update_sd_pick_busiest return true on a busier sd
Currently update_sd_pick_busiest only identifies the busiest sd
that is either overloaded, or has a group imbalance. When no
sd is imbalanced or overloaded, the load balancer fails to find
the busiest domain.
This breaks load balancing between domains that are not overloaded,
in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest
return true when the busiest sd yet is encountered.
Behaviour for SD_ASYM_PACKING does not seem to match the comment,
but I have no hardware to test that so I have left the behaviour
of that code unchanged.
It is unclear what to do with the group_imb condition.
Should group_imb override a busier load? If so, should we fix
calculate_imbalance to return a sensible number when the "busiest"
node found has a below average load? We probably need to fix
calculate_imbalance anyway, to deal with an overloaded group that
happens to have a below average load...
Cc: mikey@neuling.org
Cc: peterz@infradead.org
Signed-off-by: Rik van Riel <riel@redhat.com>
---
kernel/sched/fair.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 45943b2..c96044f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5949,6 +5949,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_has_free_capacity = 1;
}
+static bool group_overloaded(struct sg_lb_stats *sgs)
+{
+ return sgs->sum_nr_running > sgs->group_capacity_factor;
+}
+
/**
* update_sd_pick_busiest - return 1 on busiest group
* @env: The load balancing environment.
@@ -5957,7 +5962,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
* @sgs: sched_group statistics
*
* Determine if @sg is a busier group than the previously selected
- * busiest group.
+ * busiest group.
*
* Return: %true if @sg is a busier group than the previously selected
* busiest group. %false otherwise.
@@ -5967,13 +5972,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
struct sched_group *sg,
struct sg_lb_stats *sgs)
{
+ if (group_overloaded(sgs) && !group_overloaded(&sds->busiest_stat))
+ return true;
+
if (sgs->avg_load <= sds->busiest_stat.avg_load)
return false;
- if (sgs->sum_nr_running > sgs->group_capacity_factor)
+ if (sgs->group_imb)
return true;
- if (sgs->group_imb)
+ /* This is the busiest node. */
+ if (!(env->sd->flags & SD_ASYM_PACKING))
return true;
/*
@@ -5981,8 +5990,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* numbered CPUs in the group, therefore mark all groups
* higher than ourself as busy.
*/
- if ((env->sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running &&
- env->dst_cpu < group_first_cpu(sg)) {
+ if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) {
if (!sds->busiest)
return true;
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-25 15:27 ` Peter Zijlstra
2014-07-25 15:45 ` Rik van Riel
@ 2014-07-27 23:57 ` Michael Neuling
1 sibling, 0 replies; 19+ messages in thread
From: Michael Neuling @ 2014-07-27 23:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rik van Riel, linux-kernel, mingo, pjt, jhladky, ktkhai,
tim.c.chen, nicolas.pitre
On Fri, 2014-07-25 at 17:27 +0200, Peter Zijlstra wrote:
> On Tue, Jul 22, 2014 at 02:45:59PM -0400, Rik van Riel wrote:
> > Currently update_sd_pick_busiest only returns true when an sd
> > is overloaded, or for SD_ASYM_PACKING when a domain is busier
> > than average and a higher numbered domain than the target.
> >
> > This breaks load balancing between domains that are not overloaded,
> > in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest
> > return true when the busiest sd yet is encountered.
> >
> > On a 4 node system, this seems to result in the load balancer finally
> > putting 1 thread of a 4 thread test run of "perf bench numa mem" on
> > each node, where before the load was generally not spread across all
> > nodes.
>
> So for !ASYM the code is effectively:
>
> return sgs->avg_load > sds->busiest_stat.avg_load;
>
> I'd like to at least add a clause that makes overloaded groups
> prioritized over !overloaded groups.
>
> Also, like we found earlier, calculate_imbalance() relies on the
> sum_nr_running > group_capacity_factor thing, which you've just
> 'wrecked', so we'd need an update to that part too.
>
> > Behaviour for SD_ASYM_PACKING does not seem to match the comment,
> > in that groups with below average load average are ignored, but I
> > have no hardware to test that so I have left the behaviour of that
> > code unchanged.
>
> Mikey, does that stuff work as expected?
Sorry for the slow response. v1 and v2 both pass testing on POWER7. So
FWIW...
Acked-By: Michael Neuling <mikey@neuling.org>
Mikey
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-25 19:32 ` [PATCH v2] " Rik van Riel
@ 2014-07-28 8:23 ` Vincent Guittot
2014-07-28 15:04 ` Rik van Riel
2014-07-28 14:30 ` Peter Zijlstra
1 sibling, 1 reply; 19+ messages in thread
From: Vincent Guittot @ 2014-07-28 8:23 UTC (permalink / raw)
To: Rik van Riel
Cc: Peter Zijlstra, linux-kernel, Michael Neuling, Ingo Molnar,
Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre
On 25 July 2014 21:32, Rik van Riel <riel@redhat.com> wrote:
> Subject: sched: make update_sd_pick_busiest return true on a busier sd
>
> Currently update_sd_pick_busiest only identifies the busiest sd
> that is either overloaded, or has a group imbalance. When no
> sd is imbalanced or overloaded, the load balancer fails to find
> the busiest domain.
>
> This breaks load balancing between domains that are not overloaded,
> in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest
> return true when the busiest sd yet is encountered.
>
> Behaviour for SD_ASYM_PACKING does not seem to match the comment,
> but I have no hardware to test that so I have left the behaviour
> of that code unchanged.
>
> It is unclear what to do with the group_imb condition.
> Should group_imb override a busier load? If so, should we fix
IMHO, group_imb should have a lower priority compared to overloaded
group because it generates active migration whereas the use of
overloaded group could solve the imbalance with normal migration
Then, AFAICT, we already have a special way to compute imbalance when
group_imb is set
> calculate_imbalance to return a sensible number when the "busiest"
> node found has a below average load? We probably need to fix
> calculate_imbalance anyway, to deal with an overloaded group that
> happens to have a below average load...
>
> Cc: mikey@neuling.org
> Cc: peterz@infradead.org
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> kernel/sched/fair.c | 18 +++++++++++++-----
> 1 file changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 45943b2..c96044f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5949,6 +5949,11 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> sgs->group_has_free_capacity = 1;
> }
>
> +static bool group_overloaded(struct sg_lb_stats *sgs)
> +{
> + return sgs->sum_nr_running > sgs->group_capacity_factor;
> +}
> +
> /**
> * update_sd_pick_busiest - return 1 on busiest group
> * @env: The load balancing environment.
> @@ -5957,7 +5962,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> * @sgs: sched_group statistics
> *
> * Determine if @sg is a busier group than the previously selected
> - * busiest group.
> + * busiest group.
> *
> * Return: %true if @sg is a busier group than the previously selected
> * busiest group. %false otherwise.
> @@ -5967,13 +5972,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> struct sched_group *sg,
> struct sg_lb_stats *sgs)
> {
> + if (group_overloaded(sgs) && !group_overloaded(&sds->busiest_stat))
The 1st time you run update_sd_pick_busiest, group_capacity_factor and
sum_nr_running of sds->busiest_stat are uninitialized.
> + return true;
> +
IIUC your new test sequence, you haven't solved the following use case:
group A has 3 tasks and is overloaded
group B has 2 tasks and is not overloaded but its avg_load is higher
than group B (either because of nice priority or because of average
runnable time)
The test of group A will return true because it is overloaded and not
the empty busiest_stat
But, the test of group B will also return true because of
!(env->sd->flags & SD_ASYM_PACKING)
So at the end you are selecting the group B which is not overloaded
> if (sgs->avg_load <= sds->busiest_stat.avg_load)
> return false;
>
> - if (sgs->sum_nr_running > sgs->group_capacity_factor)
> + if (sgs->group_imb)
> return true;
>
> - if (sgs->group_imb)
> + /* This is the busiest node. */
> + if (!(env->sd->flags & SD_ASYM_PACKING))
> return true;
>
> /*
> @@ -5981,8 +5990,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> * numbered CPUs in the group, therefore mark all groups
> * higher than ourself as busy.
> */
> - if ((env->sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running &&
> - env->dst_cpu < group_first_cpu(sg)) {
> + if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) {
> if (!sds->busiest)
> return true;
>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-25 19:32 ` [PATCH v2] " Rik van Riel
2014-07-28 8:23 ` Vincent Guittot
@ 2014-07-28 14:30 ` Peter Zijlstra
1 sibling, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-07-28 14:30 UTC (permalink / raw)
To: Rik van Riel
Cc: Vincent Guittot, linux-kernel, Michael Neuling, Ingo Molnar,
Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre
[-- Attachment #1: Type: text/plain, Size: 2975 bytes --]
On Fri, Jul 25, 2014 at 03:32:10PM -0400, Rik van Riel wrote:
> Subject: sched: make update_sd_pick_busiest return true on a busier sd
>
> Currently update_sd_pick_busiest only identifies the busiest sd
> that is either overloaded, or has a group imbalance. When no
> sd is imbalanced or overloaded, the load balancer fails to find
> the busiest domain.
>
> This breaks load balancing between domains that are not overloaded,
> in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest
> return true when the busiest sd yet is encountered.
>
> Behaviour for SD_ASYM_PACKING does not seem to match the comment,
> but I have no hardware to test that so I have left the behaviour
> of that code unchanged.
>
> It is unclear what to do with the group_imb condition.
> Should group_imb override a busier load? If so, should we fix
> calculate_imbalance to return a sensible number when the "busiest"
> node found has a below average load? We probably need to fix
> calculate_imbalance anyway, to deal with an overloaded group that
> happens to have a below average load...
I think we want overloaded > group_imb > other. So prefer overloaded
groups, imbalanced groups if no overloaded and anything else if no
overloaded and imbalanced thingies.
If you look at the comment near sg_imbalanced(), in that case we want to
move tasks from the first group to the second, even though the second
group would be the heaviest.
enum group_type {
group_other = 0,
group_imbalanced,
group_overloaded,
};
static enum group_type group_classify(struct sg_lb_stats *sgs)
{
if (sgs->sum_nr_running > sgs->group_capacity_factor)
return group_overloaded;
if (sgs->group_imb)
return group_imbalanced;
return group_other;
}
> /**
> * update_sd_pick_busiest - return 1 on busiest group
> * @env: The load balancing environment.
> @@ -5957,7 +5962,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> * @sgs: sched_group statistics
> *
> * Determine if @sg is a busier group than the previously selected
> - * busiest group.
> + * busiest group.
We really need that extra trailing whitespace, yes? ;-)
> * Return: %true if @sg is a busier group than the previously selected
> * busiest group. %false otherwise.
> @@ -5967,13 +5972,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> struct sched_group *sg,
> struct sg_lb_stats *sgs)
> {
if (group_classify(sgs) < group_classify(&sds->busiest_stats))
return false;
> if (sgs->avg_load <= sds->busiest_stat.avg_load)
> return false;
>
> + /* This is the busiest node. */
> + if (!(env->sd->flags & SD_ASYM_PACKING))
> return true;
>
> /*
We could replace sg_lb_stats::group_imb with the above and avoid the
endless recomputation I suppose.
Also, we still need a little change to calculate_imbalance() where we
assume sum_nr_running > group_capacity_factor.
[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-28 8:23 ` Vincent Guittot
@ 2014-07-28 15:04 ` Rik van Riel
0 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2014-07-28 15:04 UTC (permalink / raw)
To: Vincent Guittot
Cc: Peter Zijlstra, linux-kernel, Michael Neuling, Ingo Molnar,
Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 07/28/2014 04:23 AM, Vincent Guittot wrote:
> On 25 July 2014 21:32, Rik van Riel <riel@redhat.com> wrote:
>> Subject: sched: make update_sd_pick_busiest return true on a
>> busier sd
>>
>> Currently update_sd_pick_busiest only identifies the busiest sd
>> that is either overloaded, or has a group imbalance. When no sd
>> is imbalanced or overloaded, the load balancer fails to find the
>> busiest domain.
>>
>> This breaks load balancing between domains that are not
>> overloaded, in the !SD_ASYM_PACKING case. This patch makes
>> update_sd_pick_busiest return true when the busiest sd yet is
>> encountered.
>>
>> Behaviour for SD_ASYM_PACKING does not seem to match the
>> comment, but I have no hardware to test that so I have left the
>> behaviour of that code unchanged.
>>
>> It is unclear what to do with the group_imb condition. Should
>> group_imb override a busier load? If so, should we fix
>
> IMHO, group_imb should have a lower priority compared to
> overloaded group because it generates active migration whereas the
> use of overloaded group could solve the imbalance with normal
> migration Then, AFAICT, we already have a special way to compute
> imbalance when group_imb is set
>
>> calculate_imbalance to return a sensible number when the
>> "busiest" node found has a below average load? We probably need
>> to fix calculate_imbalance anyway, to deal with an overloaded
>> group that happens to have a below average load...
>>
>> Cc: mikey@neuling.org Cc: peterz@infradead.org Signed-off-by: Rik
>> van Riel <riel@redhat.com> --- kernel/sched/fair.c | 18
>> +++++++++++++----- 1 file changed, 13 insertions(+), 5
>> deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
>> 45943b2..c96044f 100644 --- a/kernel/sched/fair.c +++
>> b/kernel/sched/fair.c @@ -5949,6 +5949,11 @@ static inline void
>> update_sg_lb_stats(struct lb_env *env,
>> sgs->group_has_free_capacity = 1; }
>>
>> +static bool group_overloaded(struct sg_lb_stats *sgs) +{ +
>> return sgs->sum_nr_running > sgs->group_capacity_factor; +} +
>> /** * update_sd_pick_busiest - return 1 on busiest group * @env:
>> The load balancing environment. @@ -5957,7 +5962,7 @@ static
>> inline void update_sg_lb_stats(struct lb_env *env, * @sgs:
>> sched_group statistics * * Determine if @sg is a busier group
>> than the previously selected - * busiest group. + * busiest
>> group. * * Return: %true if @sg is a busier group than the
>> previously selected * busiest group. %false otherwise. @@
>> -5967,13 +5972,17 @@ static bool update_sd_pick_busiest(struct
>> lb_env *env, struct sched_group *sg, struct sg_lb_stats *sgs) { +
>> if (group_overloaded(sgs) &&
>> !group_overloaded(&sds->busiest_stat))
>
> The 1st time you run update_sd_pick_busiest, group_capacity_factor
> and sum_nr_running of sds->busiest_stat are uninitialized.
Good point, init_sd_lb_stats only zeroes out a few fields.
I will fix this in the next version.
>> + return true; +
>
> IIUC your new test sequence, you haven't solved the following use
> case:
Fixed in the next version, with Peter's idea.
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJT1mZ8AAoJEM553pKExN6DpPAH/A+a9Lbz/pV8uFGG3RRrs2gI
l8rhgtVLEM0uiDTDavB/D1GBnWcxfNbg6o/yF7dTXDA+sWltztZISi6dZ9JpEhww
0wK3rOD6t+VYpu6OaA2rIp2p2o12ou2d8ipjimjGf9gGD8i6vZiGOfTytsSateZH
X9pSQ5Tv63ES8m3DeLcXEy+YQVunyLD83aTwSpziBFKUGguttTvvqEx5MutxVQyA
Wx7hqX4VTR2oC3mS6djzK/hp0OmpZL4WKmnqUgjm11k0+UvBc1MnYE937CHixOdQ
GjfIgk+G8EGVucbKFfhgwrNFfemL4MXxSVxMMor1lMFt9yFKcwoSMRS40mcH+Rw=
=qENP
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2014-07-28 15:05 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-22 18:45 [PATCH] sched: make update_sd_pick_busiest return true on a busier sd Rik van Riel
2014-07-23 7:41 ` Vincent Guittot
2014-07-25 13:33 ` Rik van Riel
2014-07-25 14:29 ` Vincent Guittot
2014-07-25 14:46 ` Rik van Riel
2014-07-25 14:02 ` Rik van Riel
2014-07-25 14:15 ` Peter Zijlstra
2014-07-25 15:02 ` Vincent Guittot
2014-07-25 15:13 ` Rik van Riel
2014-07-25 15:27 ` Peter Zijlstra
2014-07-25 15:45 ` Rik van Riel
2014-07-25 16:05 ` Peter Zijlstra
2014-07-25 16:22 ` Rik van Riel
2014-07-25 17:57 ` Vincent Guittot
2014-07-25 19:32 ` [PATCH v2] " Rik van Riel
2014-07-28 8:23 ` Vincent Guittot
2014-07-28 15:04 ` Rik van Riel
2014-07-28 14:30 ` Peter Zijlstra
2014-07-27 23:57 ` [PATCH] " Michael Neuling
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox