* [PATCH 0/2] load balancing fixes
@ 2014-07-28 18:16 riel
2014-07-28 18:16 ` [PATCH 1/2] sched: fix and clean up calculate_imbalance riel
2014-07-28 18:16 ` [PATCH 2/2] sched: make update_sd_pick_busiest return true on a busier sd riel
0 siblings, 2 replies; 21+ messages in thread
From: riel @ 2014-07-28 18:16 UTC (permalink / raw)
To: linux-kernel
Cc: peterz, vincent.guittot, mikey, mingo, jhladky, ktkhai,
tim.c.chen, nicolas.pitre
Currently update_sd_pick_busiest only identifies the busiest sd
that is either overloaded, or has a group imbalance. When no
sd is imbalanced or overloaded, the load balancer fails to find
the busiest domain.
This breaks load balancing between domains that are not overloaded,
in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest
return true when the busiest sd yet is encountered.
Groups are ranked in the order overloaded > imbalanced > other,
with higher ranked groups getting priority even when their load
is lower. This is necessary due to the possibility of unequal
capacities and cpumasks between domains within a sched group.
Calculate_imbalance knows how to deal with the situation where
a less loaded group is picked, but will only do so when
sgs->group_imb is set. This handling needs to be extended to
all situations where the busiest load is below the average load.
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH 1/2] sched: fix and clean up calculate_imbalance
2014-07-28 18:16 [PATCH 0/2] load balancing fixes riel
@ 2014-07-28 18:16 ` riel
2014-07-29 9:04 ` Vincent Guittot
2014-07-29 14:49 ` [PATCH 1/2] sched: fix and clean up calculate_imbalance Peter Zijlstra
2014-07-28 18:16 ` [PATCH 2/2] sched: make update_sd_pick_busiest return true on a busier sd riel
1 sibling, 2 replies; 21+ messages in thread
From: riel @ 2014-07-28 18:16 UTC (permalink / raw)
To: linux-kernel
Cc: peterz, vincent.guittot, mikey, mingo, jhladky, ktkhai,
tim.c.chen, nicolas.pitre
From: Rik van Riel <riel@redhat.com>
There are several ways in which update_sd_pick_busiest can end up
picking an sd as "busiest" that has a below-average per-cpu load.
All of those could use the same correction that was previously only
applied when the selected group has a group imbalance.
Additionally, the load balancing code will balance out the load between
domains that are below their maximum capacity. This results in the
load_above_capacity calculation underflowing, creating a giant unsigned
number, which is then removed by the min() check below.
In situations where all the domains are overloaded, or where only the
busiest domain is overloaded, that code is also superfluous, since
the normal env->imbalance calculation will figure out how much to move.
Remove the load_above_capacity calculation.
Signed-off-by: Rik van Riel <riel@redhat.com>
---
kernel/sched/fair.c | 33 ++++++++-------------------------
1 file changed, 8 insertions(+), 25 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 45943b2..a28bb3b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6221,16 +6221,16 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
*/
static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
{
- unsigned long max_pull, load_above_capacity = ~0UL;
struct sg_lb_stats *local, *busiest;
local = &sds->local_stat;
busiest = &sds->busiest_stat;
- if (busiest->group_imb) {
+ if (busiest->avg_load <= sds->avg_load) {
/*
- * In the group_imb case we cannot rely on group-wide averages
- * to ensure cpu-load equilibrium, look at wider averages. XXX
+ * Busiest got picked because it is overloaded or imbalanced,
+ * but does not have an above-average load. Look at wider
+ * averages.
*/
busiest->load_per_task =
min(busiest->load_per_task, sds->avg_load);
@@ -6247,32 +6247,15 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
return fix_small_imbalance(env, sds);
}
- if (!busiest->group_imb) {
- /*
- * Don't want to pull so many tasks that a group would go idle.
- * Except of course for the group_imb case, since then we might
- * have to drop below capacity to reach cpu-load equilibrium.
- */
- load_above_capacity =
- (busiest->sum_nr_running - busiest->group_capacity_factor);
-
- load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
- load_above_capacity /= busiest->group_capacity;
- }
-
/*
* We're trying to get all the cpus to the average_load, so we don't
* want to push ourselves above the average load, nor do we wish to
- * reduce the max loaded cpu below the average load. At the same time,
- * we also don't want to reduce the group load below the group capacity
- * (so that we can implement power-savings policies etc). Thus we look
- * for the minimum possible imbalance.
+ * reduce the max loaded cpu below the average load.
+ * The per-cpu avg_load values and the group capacity determine
+ * how much load to move to equalise the imbalance.
*/
- max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
-
- /* How much load to actually move to equalise the imbalance */
env->imbalance = min(
- max_pull * busiest->group_capacity,
+ (busiest->avg_load - sds->avg_load) * busiest->group_capacity,
(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;
--
1.9.3
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [PATCH 2/2] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-28 18:16 [PATCH 0/2] load balancing fixes riel
2014-07-28 18:16 ` [PATCH 1/2] sched: fix and clean up calculate_imbalance riel
@ 2014-07-28 18:16 ` riel
2014-07-29 15:27 ` Peter Zijlstra
1 sibling, 1 reply; 21+ messages in thread
From: riel @ 2014-07-28 18:16 UTC (permalink / raw)
To: linux-kernel
Cc: peterz, vincent.guittot, mikey, mingo, jhladky, ktkhai,
tim.c.chen, nicolas.pitre
From: Rik van Riel <riel@redhat.com>
Currently update_sd_pick_busiest only identifies the busiest sd
that is either overloaded, or has a group imbalance. When no
sd is imbalanced or overloaded, the load balancer fails to find
the busiest domain.
This breaks load balancing between domains that are not overloaded,
in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest
return true when the busiest sd yet is encountered.
Groups are ranked in the order overloaded > imbalanced > other,
with higher ranked groups getting priority even when their load
is lower. This is necessary due to the possibility of unequal
capacities and cpumasks between domains within a sched group.
Behaviour for SD_ASYM_PACKING does not seem to match the comment,
but I have no hardware to test that so I have left the behaviour
of that code unchanged.
Enum for group classification suggested by Peter Zijlstra.
Cc: mikey@neuling.org
Cc: peterz@infradead.org
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
kernel/sched/fair.c | 34 ++++++++++++++++++++++++++++------
1 file changed, 28 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a28bb3b..4f5e3c2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5610,6 +5610,8 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
.total_capacity = 0UL,
.busiest_stat = {
.avg_load = 0UL,
+ .sum_nr_running = 0,
+ .group_imb = 0,
},
};
}
@@ -5949,6 +5951,23 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_has_free_capacity = 1;
}
+enum group_type {
+ group_other = 0,
+ group_imbalanced,
+ group_overloaded,
+};
+
+static enum group_type group_classify(struct sg_lb_stats *sgs)
+{
+ if (sgs->sum_nr_running > sgs->group_capacity_factor)
+ return group_overloaded;
+
+ if (sgs->group_imb)
+ return group_imbalanced;
+
+ return group_other;
+}
+
/**
* update_sd_pick_busiest - return 1 on busiest group
* @env: The load balancing environment.
@@ -5967,13 +5986,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
struct sched_group *sg,
struct sg_lb_stats *sgs)
{
- if (sgs->avg_load <= sds->busiest_stat.avg_load)
+ if (group_classify(sgs) > group_classify(&sds->busiest_stat))
+ return true;
+
+ if (group_classify(sgs) < group_classify(&sds->busiest_stat))
return false;
- if (sgs->sum_nr_running > sgs->group_capacity_factor)
- return true;
+ if (sgs->avg_load <= sds->busiest_stat.avg_load)
+ return false;
- if (sgs->group_imb)
+ /* This is the busiest node in its class. */
+ if (!(env->sd->flags & SD_ASYM_PACKING))
return true;
/*
@@ -5981,8 +6004,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* numbered CPUs in the group, therefore mark all groups
* higher than ourself as busy.
*/
- if ((env->sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running &&
- env->dst_cpu < group_first_cpu(sg)) {
+ if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) {
if (!sds->busiest)
return true;
--
1.9.3
^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] sched: fix and clean up calculate_imbalance
2014-07-28 18:16 ` [PATCH 1/2] sched: fix and clean up calculate_imbalance riel
@ 2014-07-29 9:04 ` Vincent Guittot
2014-07-29 14:53 ` Rik van Riel
2014-07-29 14:59 ` Peter Zijlstra
2014-07-29 14:49 ` [PATCH 1/2] sched: fix and clean up calculate_imbalance Peter Zijlstra
1 sibling, 2 replies; 21+ messages in thread
From: Vincent Guittot @ 2014-07-29 9:04 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar,
jhladky, ktkhai, tim.c.chen, Nicolas Pitre
On 28 July 2014 20:16, <riel@redhat.com> wrote:
> From: Rik van Riel <riel@redhat.com>
>
> There are several ways in which update_sd_pick_busiest can end up
> picking an sd as "busiest" that has a below-average per-cpu load.
>
> All of those could use the same correction that was previously only
> applied when the selected group has a group imbalance.
>
> Additionally, the load balancing code will balance out the load between
> domains that are below their maximum capacity. This results in the
> load_above_capacity calculation underflowing, creating a giant unsigned
> number, which is then removed by the min() check below.
The load_above capacity can't underflow with current version. The
underflow that you mention above, could occur with the change you are
doing in patch 2 which can select a group which not overloaded nor
imbalanced.
>
> In situations where all the domains are overloaded, or where only the
> busiest domain is overloaded, that code is also superfluous, since
> the normal env->imbalance calculation will figure out how much to move.
> Remove the load_above_capacity calculation.
IMHO, we should not remove that part which is used by prefer_sibling
Originally, we had 2 type of busiest group: overloaded or imbalanced.
You add a new one which has only a avg_load higher than other so you
should handle this new case and keep the other ones unchanged
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> ---
> kernel/sched/fair.c | 33 ++++++++-------------------------
> 1 file changed, 8 insertions(+), 25 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 45943b2..a28bb3b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6221,16 +6221,16 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
> */
> static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
> {
> - unsigned long max_pull, load_above_capacity = ~0UL;
> struct sg_lb_stats *local, *busiest;
>
> local = &sds->local_stat;
> busiest = &sds->busiest_stat;
>
> - if (busiest->group_imb) {
> + if (busiest->avg_load <= sds->avg_load) {
busiest->avg_load <= sds->avg_load is already handled in the
fix_small_imbalance function, you should probably handle that here
> /*
> - * In the group_imb case we cannot rely on group-wide averages
> - * to ensure cpu-load equilibrium, look at wider averages. XXX
> + * Busiest got picked because it is overloaded or imbalanced,
> + * but does not have an above-average load. Look at wider
> + * averages.
> */
> busiest->load_per_task =
> min(busiest->load_per_task, sds->avg_load);
> @@ -6247,32 +6247,15 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> return fix_small_imbalance(env, sds);
> }
>
> - if (!busiest->group_imb) {
> - /*
> - * Don't want to pull so many tasks that a group would go idle.
> - * Except of course for the group_imb case, since then we might
> - * have to drop below capacity to reach cpu-load equilibrium.
> - */
> - load_above_capacity =
> - (busiest->sum_nr_running - busiest->group_capacity_factor);
> -
> - load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
> - load_above_capacity /= busiest->group_capacity;
> - }
> -
> /*
> * We're trying to get all the cpus to the average_load, so we don't
> * want to push ourselves above the average load, nor do we wish to
> - * reduce the max loaded cpu below the average load. At the same time,
> - * we also don't want to reduce the group load below the group capacity
> - * (so that we can implement power-savings policies etc). Thus we look
> - * for the minimum possible imbalance.
> + * reduce the max loaded cpu below the average load.
> + * The per-cpu avg_load values and the group capacity determine
> + * how much load to move to equalise the imbalance.
> */
> - max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
> -
> - /* How much load to actually move to equalise the imbalance */
> env->imbalance = min(
> - max_pull * busiest->group_capacity,
> + (busiest->avg_load - sds->avg_load) * busiest->group_capacity,
> (sds->avg_load - local->avg_load) * local->group_capacity
> ) / SCHED_CAPACITY_SCALE;
>
> --
> 1.9.3
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] sched: fix and clean up calculate_imbalance
2014-07-28 18:16 ` [PATCH 1/2] sched: fix and clean up calculate_imbalance riel
2014-07-29 9:04 ` Vincent Guittot
@ 2014-07-29 14:49 ` Peter Zijlstra
2014-07-29 14:53 ` Peter Zijlstra
1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2014-07-29 14:49 UTC (permalink / raw)
To: riel
Cc: linux-kernel, vincent.guittot, mikey, mingo, jhladky, ktkhai,
tim.c.chen, nicolas.pitre
On Mon, Jul 28, 2014 at 02:16:27PM -0400, riel@redhat.com wrote:
> @@ -6221,16 +6221,16 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
> */
> static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
> {
> - unsigned long max_pull, load_above_capacity = ~0UL;
> struct sg_lb_stats *local, *busiest;
>
> local = &sds->local_stat;
> busiest = &sds->busiest_stat;
>
> - if (busiest->group_imb) {
> + if (busiest->avg_load <= sds->avg_load) {
> /*
> - * In the group_imb case we cannot rely on group-wide averages
> - * to ensure cpu-load equilibrium, look at wider averages. XXX
> + * Busiest got picked because it is overloaded or imbalanced,
> + * but does not have an above-average load. Look at wider
> + * averages.
> */
> busiest->load_per_task =
> min(busiest->load_per_task, sds->avg_load);
I don't think that's right, this code is really for imbalance only,
although I'm now wondering why (again)..
So currently the only other case is overloaded (since, as you noticed,
we don't balance for !overloaded) and that explicitly doesn't use it. So
making the overloaded case use this doesn't make sense.
> @@ -6247,32 +6247,15 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> return fix_small_imbalance(env, sds);
> }
>
> - if (!busiest->group_imb) {
> - /*
> - * Don't want to pull so many tasks that a group would go idle.
> - * Except of course for the group_imb case, since then we might
> - * have to drop below capacity to reach cpu-load equilibrium.
> - */
> - load_above_capacity =
> - (busiest->sum_nr_running - busiest->group_capacity_factor);
> -
> - load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
> - load_above_capacity /= busiest->group_capacity;
> - }
I think we want to retain that, esp. for the overloaded case. So that
wants to be:
if (busiest->sum_nr_running > busiest->group_capacity_factor)
Clearly it doesn't make sense for the !overload case, and we explicitly
want to avoid it in the imb case.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] sched: fix and clean up calculate_imbalance
2014-07-29 9:04 ` Vincent Guittot
@ 2014-07-29 14:53 ` Rik van Riel
2014-07-29 15:31 ` Vincent Guittot
2014-07-29 14:59 ` Peter Zijlstra
1 sibling, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2014-07-29 14:53 UTC (permalink / raw)
To: Vincent Guittot
Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar,
jhladky, ktkhai, tim.c.chen, Nicolas Pitre
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 07/29/2014 05:04 AM, Vincent Guittot wrote:
> On 28 July 2014 20:16, <riel@redhat.com> wrote:
>> From: Rik van Riel <riel@redhat.com>
>>
>> There are several ways in which update_sd_pick_busiest can end
>> up picking an sd as "busiest" that has a below-average per-cpu
>> load.
>>
>> All of those could use the same correction that was previously
>> only applied when the selected group has a group imbalance.
>>
>> Additionally, the load balancing code will balance out the load
>> between domains that are below their maximum capacity. This
>> results in the load_above_capacity calculation underflowing,
>> creating a giant unsigned number, which is then removed by the
>> min() check below.
>
> The load_above capacity can't underflow with current version. The
> underflow that you mention above, could occur with the change you
> are doing in patch 2 which can select a group which not overloaded
> nor imbalanced.
With SD_ASYM_PACKING the current code can already hit the underflow.
You are right though that it does not become common until the second
patch has been applied.
>> In situations where all the domains are overloaded, or where only
>> the busiest domain is overloaded, that code is also superfluous,
>> since the normal env->imbalance calculation will figure out how
>> much to move. Remove the load_above_capacity calculation.
>
> IMHO, we should not remove that part which is used by
> prefer_sibling
>
> Originally, we had 2 type of busiest group: overloaded or
> imbalanced. You add a new one which has only a avg_load higher than
> other so you should handle this new case and keep the other ones
> unchanged
The "overloaded" case will simply degenerate into the first case,
if we move enough load to make the domain no longer overloaded,
but still above average.
In the case where the value calculated by the "overloaded" calculation
is different from the above-average, the old code took the minimum of
the two as how much to move.
The new case you ask for would simply take the other part of that
difference, once a domain is no longer overloaded.
I cannot think of any case where keeping the "overloaded" code would
result in the code behaving differently over the long term.
What am I overlooking?
>> Signed-off-by: Rik van Riel <riel@redhat.com> ---
>> kernel/sched/fair.c | 33 ++++++++------------------------- 1 file
>> changed, 8 insertions(+), 25 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
>> 45943b2..a28bb3b 100644 --- a/kernel/sched/fair.c +++
>> b/kernel/sched/fair.c @@ -6221,16 +6221,16 @@ void
>> fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
>> */ static inline void calculate_imbalance(struct lb_env *env,
>> struct sd_lb_stats *sds) { - unsigned long max_pull,
>> load_above_capacity = ~0UL; struct sg_lb_stats *local, *busiest;
>>
>> local = &sds->local_stat; busiest = &sds->busiest_stat;
>>
>> - if (busiest->group_imb) { + if (busiest->avg_load
>> <= sds->avg_load) {
>
> busiest->avg_load <= sds->avg_load is already handled in the
> fix_small_imbalance function, you should probably handle that here
OK, I will move that code into fix_small_imbalance()
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJT17VRAAoJEM553pKExN6DHMgIAI4IQsezUS1B/t8FzgkUR+8K
7EPIlOmsKZN/odfC6G4TntfwJojlcOsIQlxJF+PNCoWU4U61THK+NXif2a9/rNUE
3ffsrhZVy576HExezkAOzC8Z+g+7Y8O77af1PkSWul6Y3Xb2lQGG8ey+gdkOZytQ
vwTlGQHGiUqiJaA1aohkz45Zbv2o7m7qCHoNPNvE9qK3WEY0StgLRQgfny6cgHsM
079Ecx5A5p7W/zL9kvMELQtU1QI0c7PLEGSy5rT0+8moZdR9biQF9ktDkoNawOD1
DLPguz+ZbLZUNOLezC18k8FoqLxkBkZiXQ25f20AFnLkJM4HC3A9EP+SsVZVc+M=
=1hLj
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] sched: fix and clean up calculate_imbalance
2014-07-29 14:49 ` [PATCH 1/2] sched: fix and clean up calculate_imbalance Peter Zijlstra
@ 2014-07-29 14:53 ` Peter Zijlstra
2014-07-29 15:26 ` Peter Zijlstra
0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2014-07-29 14:53 UTC (permalink / raw)
To: riel
Cc: linux-kernel, vincent.guittot, mikey, mingo, jhladky, ktkhai,
tim.c.chen, nicolas.pitre
On Tue, Jul 29, 2014 at 04:49:52PM +0200, Peter Zijlstra wrote:
> > @@ -6247,32 +6247,15 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > return fix_small_imbalance(env, sds);
> > }
> >
> > - if (!busiest->group_imb) {
> > - /*
> > - * Don't want to pull so many tasks that a group would go idle.
> > - * Except of course for the group_imb case, since then we might
> > - * have to drop below capacity to reach cpu-load equilibrium.
> > - */
> > - load_above_capacity =
> > - (busiest->sum_nr_running - busiest->group_capacity_factor);
> > -
> > - load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
> > - load_above_capacity /= busiest->group_capacity;
> > - }
>
> I think we want to retain that, esp. for the overloaded case. So that
> wants to be:
>
> if (busiest->sum_nr_running > busiest->group_capacity_factor)
>
> Clearly it doesn't make sense for the !overload case, and we explicitly
> want to avoid it in the imb case.
Ah, wait, I think I see why you want that gone. I was only expecting a
correction fix wrt changing pick_busiest(), not also behaviour changes.
Lemme reconsider.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] sched: fix and clean up calculate_imbalance
2014-07-29 9:04 ` Vincent Guittot
2014-07-29 14:53 ` Rik van Riel
@ 2014-07-29 14:59 ` Peter Zijlstra
2014-07-29 15:15 ` Rik van Riel
2014-07-29 15:27 ` Peter Zijlstra
1 sibling, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2014-07-29 14:59 UTC (permalink / raw)
To: Vincent Guittot
Cc: Rik van Riel, linux-kernel, Michael Neuling, Ingo Molnar, jhladky,
ktkhai, tim.c.chen, Nicolas Pitre
On Tue, Jul 29, 2014 at 11:04:50AM +0200, Vincent Guittot wrote:
> > In situations where all the domains are overloaded, or where only the
> > busiest domain is overloaded, that code is also superfluous, since
> > the normal env->imbalance calculation will figure out how much to move.
> > Remove the load_above_capacity calculation.
>
> IMHO, we should not remove that part which is used by prefer_sibling
>
> Originally, we had 2 type of busiest group: overloaded or imbalanced.
> You add a new one which has only a avg_load higher than other so you
> should handle this new case and keep the other ones unchanged
Right, so we want that code for overloaded -> overloaded migrations such
as not to cause idle cpus in an attempt to balance things. Idle cpus are
worse than imbalance.
But in case of overloaded/imb -> !overloaded migrations we can allow it,
and in fact want to allow it in order to balance idle cpus.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] sched: fix and clean up calculate_imbalance
2014-07-29 14:59 ` Peter Zijlstra
@ 2014-07-29 15:15 ` Rik van Riel
2014-07-29 15:49 ` Peter Zijlstra
2014-07-29 15:27 ` Peter Zijlstra
1 sibling, 1 reply; 21+ messages in thread
From: Rik van Riel @ 2014-07-29 15:15 UTC (permalink / raw)
To: Peter Zijlstra, Vincent Guittot
Cc: linux-kernel, Michael Neuling, Ingo Molnar, jhladky, ktkhai,
tim.c.chen, Nicolas Pitre
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 07/29/2014 10:59 AM, Peter Zijlstra wrote:
> On Tue, Jul 29, 2014 at 11:04:50AM +0200, Vincent Guittot wrote:
>>> In situations where all the domains are overloaded, or where
>>> only the busiest domain is overloaded, that code is also
>>> superfluous, since the normal env->imbalance calculation will
>>> figure out how much to move. Remove the load_above_capacity
>>> calculation.
>>
>> IMHO, we should not remove that part which is used by
>> prefer_sibling
>>
>> Originally, we had 2 type of busiest group: overloaded or
>> imbalanced. You add a new one which has only a avg_load higher
>> than other so you should handle this new case and keep the other
>> ones unchanged
>
> Right, so we want that code for overloaded -> overloaded migrations
> such as not to cause idle cpus in an attempt to balance things.
> Idle cpus are worse than imbalance.
>
> But in case of overloaded/imb -> !overloaded migrations we can
> allow it, and in fact want to allow it in order to balance idle
> cpus.
In case the destination is over the average load, or the source is under
the average load, fix_small_imbalance() determines env->imbalance.
The "load_above_capacity" calculation is only reached when busiest is
busier than average, and the destination is under the average load.
In that case, env->imbalance ends up as the minimum of busiest - avg
and avg - target.
Is there any case where limiting it further to "load - capacity" from
the busiest domain makes a difference?
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJT17qlAAoJEM553pKExN6D/DgH/RbAK0Sr2Doa+zXVavx9VtU4
39pX1vh/0Rgcy1L1zCuPtKrKaV5sEGr50qH9Wui5rIXdPzykVZROYqp13d178SCY
N3wfbWsP1hAhOS5e2c4sR8WqxsTp/1Ofjyx4PoXwRautwVeWbjPE0I4Hbs2MPjwm
8idkTPpl+qBpe1NvppBm6NmZxtZEJUv5X/hoGcLxEwyAHhJaKfw22Tm1oib2/7Dj
TQy5XjqY1oiOz3/sf9AsRaZMYCmemtUPXCtCncP9rCJLxd04i/Gc0ggrU0qWHCDq
cZcrm9bCP/aYxV9QNyD8QhHuS/6C17zYCG4E2R7BpKXfJYDT8FRbb/FgSoLUVhE=
=iR13
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] sched: fix and clean up calculate_imbalance
2014-07-29 14:53 ` Peter Zijlstra
@ 2014-07-29 15:26 ` Peter Zijlstra
2014-08-12 14:52 ` [tip:sched/core] sched/fair: Make calculate_imbalance() independent tip-bot for Peter Zijlstra
0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2014-07-29 15:26 UTC (permalink / raw)
To: riel
Cc: linux-kernel, vincent.guittot, mikey, mingo, jhladky, ktkhai,
tim.c.chen, nicolas.pitre
On Tue, Jul 29, 2014 at 04:53:08PM +0200, Peter Zijlstra wrote:
> Ah, wait, I think I see why you want that gone. I was only expecting a
> correction fix wrt changing pick_busiest(), not also behaviour changes.
---
Subject: sched,fair: Make calculate_imbalance() independent
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Jul 29 17:00:21 CEST 2014
Rik noticed that calculate_imbalance() relies on
update_sd_pick_busiest() to guarantee that busiest->sum_nr_running >
busiest->group_capacity_factor.
Break this implicit assumption (with the intent of not providing it
anymore) by having calculat_imbalance() verify it and not rely on
others.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6248,7 +6248,7 @@ static inline void calculate_imbalance(s
return fix_small_imbalance(env, sds);
}
- if (!busiest->group_imb) {
+ if (busiest->sum_nr_running > busiest->group_capacity_factor) {
/*
* Don't want to pull so many tasks that a group would go idle.
* Except of course for the group_imb case, since then we might
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] sched: fix and clean up calculate_imbalance
2014-07-29 14:59 ` Peter Zijlstra
2014-07-29 15:15 ` Rik van Riel
@ 2014-07-29 15:27 ` Peter Zijlstra
2014-07-30 9:32 ` Vincent Guittot
2014-08-12 14:52 ` [tip:sched/core] sched/fair: Allow calculate_imbalance() to move idle cpus tip-bot for Peter Zijlstra
1 sibling, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2014-07-29 15:27 UTC (permalink / raw)
To: Vincent Guittot
Cc: Rik van Riel, linux-kernel, Michael Neuling, Ingo Molnar, jhladky,
ktkhai, tim.c.chen, Nicolas Pitre
On Tue, Jul 29, 2014 at 04:59:10PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 29, 2014 at 11:04:50AM +0200, Vincent Guittot wrote:
> > > In situations where all the domains are overloaded, or where only the
> > > busiest domain is overloaded, that code is also superfluous, since
> > > the normal env->imbalance calculation will figure out how much to move.
> > > Remove the load_above_capacity calculation.
> >
> > IMHO, we should not remove that part which is used by prefer_sibling
> >
> > Originally, we had 2 type of busiest group: overloaded or imbalanced.
> > You add a new one which has only a avg_load higher than other so you
> > should handle this new case and keep the other ones unchanged
>
> Right, so we want that code for overloaded -> overloaded migrations such
> as not to cause idle cpus in an attempt to balance things. Idle cpus are
> worse than imbalance.
>
> But in case of overloaded/imb -> !overloaded migrations we can allow it,
> and in fact want to allow it in order to balance idle cpus.
Which would be patch 3/2
---
Subject: sched,fair: Allow calculate_imbalance() to move idle cpus
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Jul 29 17:15:11 CEST 2014
Allow calculate_imbalance() to 'create' idle cpus in the busiest group
if there are idle cpus in the local group.
Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-7k95k4i2tjv78iivstggiude@git.kernel.org
---
kernel/sched/fair.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6273,12 +6273,11 @@ static inline void calculate_imbalance(s
return fix_small_imbalance(env, sds);
}
- if (busiest->group_type == group_overloaded) {
- /*
- * Don't want to pull so many tasks that a group would go idle.
- * Except of course for the group_imb case, since then we might
- * have to drop below capacity to reach cpu-load equilibrium.
- */
+ /*
+ * If there aren't any idle cpus, avoid creating some.
+ */
+ if (busiest->group_type == group_overloaded &&
+ local->group_type == group_overloaded) {
load_above_capacity =
(busiest->sum_nr_running - busiest->group_capacity_factor);
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 2/2] sched: make update_sd_pick_busiest return true on a busier sd
2014-07-28 18:16 ` [PATCH 2/2] sched: make update_sd_pick_busiest return true on a busier sd riel
@ 2014-07-29 15:27 ` Peter Zijlstra
2014-08-12 14:52 ` [tip:sched/core] sched/fair: Make update_sd_pick_busiest() return 'true' " tip-bot for Rik van Riel
0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2014-07-29 15:27 UTC (permalink / raw)
To: riel
Cc: linux-kernel, vincent.guittot, mikey, mingo, jhladky, ktkhai,
tim.c.chen, nicolas.pitre
Made that:
---
Subject: sched,fair: Make update_sd_pick_busiest return true on a busier sd
From: Rik van Riel <riel@redhat.com>
Date: Mon, 28 Jul 2014 14:16:28 -0400
Currently update_sd_pick_busiest only identifies the busiest sd
that is either overloaded, or has a group imbalance. When no
sd is imbalanced or overloaded, the load balancer fails to find
the busiest domain.
This breaks load balancing between domains that are not overloaded,
in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest
return true when the busiest sd yet is encountered.
Groups are ranked in the order overloaded > imbalanced > other,
with higher ranked groups getting priority even when their load
is lower. This is necessary due to the possibility of unequal
capacities and cpumasks between domains within a sched group.
Behaviour for SD_ASYM_PACKING does not seem to match the comment,
but I have no hardware to test that so I have left the behaviour
of that code unchanged.
Enum for group classification suggested by Peter Zijlstra.
Cc: ktkhai@parallels.com
Cc: tim.c.chen@linux.intel.com
Cc: nicolas.pitre@linaro.org
Cc: vincent.guittot@linaro.org
Cc: mingo@kernel.org
Cc: jhladky@redhat.com
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
[peterz: replaced sg_lb_stats::group_imb with the new enum group_type
in an attempt to avoid endless recalculation]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1406571388-3227-3-git-send-email-riel@redhat.com
---
kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++------------
1 file changed, 37 insertions(+), 12 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5559,6 +5559,13 @@ static unsigned long task_h_load(struct
#endif
/********** Helpers for find_busiest_group ************************/
+
+enum group_type {
+ group_other = 0,
+ group_imbalanced,
+ group_overloaded,
+};
+
/*
* sg_lb_stats - stats of a sched_group required for load_balancing
*/
@@ -5572,7 +5579,7 @@ struct sg_lb_stats {
unsigned int group_capacity_factor;
unsigned int idle_cpus;
unsigned int group_weight;
- int group_imb; /* Is there an imbalance in the group ? */
+ enum group_type group_type;
int group_has_free_capacity;
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
@@ -5610,6 +5617,8 @@ static inline void init_sd_lb_stats(stru
.total_capacity = 0UL,
.busiest_stat = {
.avg_load = 0UL,
+ .sum_nr_running = 0,
+ .group_type = group_other,
},
};
}
@@ -5891,6 +5900,18 @@ static inline int sg_capacity_factor(str
return capacity_factor;
}
+static enum group_type
+group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+{
+ if (sgs->sum_nr_running > sgs->group_capacity_factor)
+ return group_overloaded;
+
+ if (sg_imbalanced(group))
+ return group_imbalanced;
+
+ return group_other;
+}
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @env: The load balancing environment.
@@ -5942,9 +5963,8 @@ static inline void update_sg_lb_stats(st
sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
sgs->group_weight = group->group_weight;
-
- sgs->group_imb = sg_imbalanced(group);
sgs->group_capacity_factor = sg_capacity_factor(env, group);
+ sgs->group_type = group_classify(group, sgs);
if (sgs->group_capacity_factor > sgs->sum_nr_running)
sgs->group_has_free_capacity = 1;
@@ -5968,13 +5988,19 @@ static bool update_sd_pick_busiest(struc
struct sched_group *sg,
struct sg_lb_stats *sgs)
{
- if (sgs->avg_load <= sds->busiest_stat.avg_load)
- return false;
+ struct sg_lb_stats *busiest = &sds->busiest_stat;
- if (sgs->sum_nr_running > sgs->group_capacity_factor)
+ if (sgs->group_type > busiest->group_type)
return true;
- if (sgs->group_imb)
+ if (sgs->group_type < busiest->group_type)
+ return false;
+
+ if (sgs->avg_load <= busiest->avg_load)
+ return false;
+
+ /* This is the busiest node in its class. */
+ if (!(env->sd->flags & SD_ASYM_PACKING))
return true;
/*
@@ -5982,8 +6008,7 @@ static bool update_sd_pick_busiest(struc
* numbered CPUs in the group, therefore mark all groups
* higher than ourself as busy.
*/
- if ((env->sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running &&
- env->dst_cpu < group_first_cpu(sg)) {
+ if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) {
if (!sds->busiest)
return true;
@@ -6228,7 +6253,7 @@ static inline void calculate_imbalance(s
local = &sds->local_stat;
busiest = &sds->busiest_stat;
- if (busiest->group_imb) {
+ if (busiest->group_type == group_imbalanced) {
/*
* In the group_imb case we cannot rely on group-wide averages
* to ensure cpu-load equilibrium, look at wider averages. XXX
@@ -6248,7 +6273,7 @@ static inline void calculate_imbalance(s
return fix_small_imbalance(env, sds);
}
- if (busiest->sum_nr_running > busiest->group_capacity_factor) {
+ if (busiest->group_type == group_overloaded) {
/*
* Don't want to pull so many tasks that a group would go idle.
* Except of course for the group_imb case, since then we might
@@ -6337,7 +6362,7 @@ static struct sched_group *find_busiest_
* work because they assume all things are equal, which typically
* isn't true due to cpus_allowed constraints and the like.
*/
- if (busiest->group_imb)
+ if (busiest->group_type == group_imbalanced)
goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] sched: fix and clean up calculate_imbalance
2014-07-29 14:53 ` Rik van Riel
@ 2014-07-29 15:31 ` Vincent Guittot
2014-07-29 15:39 ` Rik van Riel
0 siblings, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2014-07-29 15:31 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar,
jhladky, ktkhai, tim.c.chen, Nicolas Pitre
On 29 July 2014 16:53, Rik van Riel <riel@redhat.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 07/29/2014 05:04 AM, Vincent Guittot wrote:
>> On 28 July 2014 20:16, <riel@redhat.com> wrote:
>>> From: Rik van Riel <riel@redhat.com>
>>>
>>> There are several ways in which update_sd_pick_busiest can end
>>> up picking an sd as "busiest" that has a below-average per-cpu
>>> load.
>>>
>>> All of those could use the same correction that was previously
>>> only applied when the selected group has a group imbalance.
>>>
>>> Additionally, the load balancing code will balance out the load
>>> between domains that are below their maximum capacity. This
>>> results in the load_above_capacity calculation underflowing,
>>> creating a giant unsigned number, which is then removed by the
>>> min() check below.
>>
>> The load_above capacity can't underflow with current version. The
>> underflow that you mention above, could occur with the change you
>> are doing in patch 2 which can select a group which not overloaded
>> nor imbalanced.
>
> With SD_ASYM_PACKING the current code can already hit the underflow.
I don't think so because AFAICT, SD_ASYM_PACKING is used at sibling
level where the group_capacity_factor will be at most 1
and sum_nr_running is > 1 in calculate_imbalance
>
> You are right though that it does not become common until the second
> patch has been applied.
>
>>> In situations where all the domains are overloaded, or where only
>>> the busiest domain is overloaded, that code is also superfluous,
>>> since the normal env->imbalance calculation will figure out how
>>> much to move. Remove the load_above_capacity calculation.
>>
>> IMHO, we should not remove that part which is used by
>> prefer_sibling
>>
>> Originally, we had 2 type of busiest group: overloaded or
>> imbalanced. You add a new one which has only a avg_load higher than
>> other so you should handle this new case and keep the other ones
>> unchanged
>
> The "overloaded" case will simply degenerate into the first case,
> if we move enough load to make the domain no longer overloaded,
> but still above average.
>
> In the case where the value calculated by the "overloaded" calculation
> is different from the above-average, the old code took the minimum of
> the two as how much to move.
>
> The new case you ask for would simply take the other part of that
> difference, once a domain is no longer overloaded.
>
> I cannot think of any case where keeping the "overloaded" code would
> result in the code behaving differently over the long term.
>
> What am I overlooking?
IIUC the load_above_capacity is there to prevent the busiest group to
become idle and you remove that protection
Vincent
>
>>> Signed-off-by: Rik van Riel <riel@redhat.com> ---
>>> kernel/sched/fair.c | 33 ++++++++------------------------- 1 file
>>> changed, 8 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index
>>> 45943b2..a28bb3b 100644 --- a/kernel/sched/fair.c +++
>>> b/kernel/sched/fair.c @@ -6221,16 +6221,16 @@ void
>>> fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
>>> */ static inline void calculate_imbalance(struct lb_env *env,
>>> struct sd_lb_stats *sds) { - unsigned long max_pull,
>>> load_above_capacity = ~0UL; struct sg_lb_stats *local, *busiest;
>>>
>>> local = &sds->local_stat; busiest = &sds->busiest_stat;
>>>
>>> - if (busiest->group_imb) { + if (busiest->avg_load
>>> <= sds->avg_load) {
>>
>> busiest->avg_load <= sds->avg_load is already handled in the
>> fix_small_imbalance function, you should probably handle that here
>
> OK, I will move that code into fix_small_imbalance()
>
> - --
> All rights reversed
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQEcBAEBAgAGBQJT17VRAAoJEM553pKExN6DHMgIAI4IQsezUS1B/t8FzgkUR+8K
> 7EPIlOmsKZN/odfC6G4TntfwJojlcOsIQlxJF+PNCoWU4U61THK+NXif2a9/rNUE
> 3ffsrhZVy576HExezkAOzC8Z+g+7Y8O77af1PkSWul6Y3Xb2lQGG8ey+gdkOZytQ
> vwTlGQHGiUqiJaA1aohkz45Zbv2o7m7qCHoNPNvE9qK3WEY0StgLRQgfny6cgHsM
> 079Ecx5A5p7W/zL9kvMELQtU1QI0c7PLEGSy5rT0+8moZdR9biQF9ktDkoNawOD1
> DLPguz+ZbLZUNOLezC18k8FoqLxkBkZiXQ25f20AFnLkJM4HC3A9EP+SsVZVc+M=
> =1hLj
> -----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] sched: fix and clean up calculate_imbalance
2014-07-29 15:31 ` Vincent Guittot
@ 2014-07-29 15:39 ` Rik van Riel
0 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2014-07-29 15:39 UTC (permalink / raw)
To: Vincent Guittot
Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar,
jhladky, ktkhai, tim.c.chen, Nicolas Pitre
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 07/29/2014 11:31 AM, Vincent Guittot wrote:
> On 29 July 2014 16:53, Rik van Riel <riel@redhat.com> wrote:
>> I cannot think of any case where keeping the "overloaded" code
>> would result in the code behaving differently over the long
>> term.
>
>> What am I overlooking?
>
> IIUC the load_above_capacity is there to prevent the busiest group
> to become idle and you remove that protection
Having the busiest group become partially idle is perfectly fine,
as long as there is also spare capacity in the destination group.
Spreading out the work load over more CPU sockets, with more CPU
cache and more total available memory bandwidth is often a good
thing.
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJT18AbAAoJEM553pKExN6DaLoH/RzCR/YWDYe4/tZEC/sHn89l
jCOVdoY2ZkVHx/yO1Vd0UeoAE94XFsiw0sKwy0lVXLxi3nBIjfh41qZkGoS2lQuo
YSi7PE2o6T5WmluEQfTnUn1DAX0sigUYvWAxnpDYM6I+m8RS0QB7VYSaoqaT3T8D
qomO8FmBN75JrwoywWBDiVRjnxfQJzt6bBK92GUIo6+af6RUSQ9xwQg6EPp3E3r6
FFqIHJ6HaRFCml1KMD4azopVC/bbbhvnbE/EO/5In2nbBsVK06XpBlLWfCyiD48q
NkjnAjISweWTCoMdhULaXdLGeWH85DWoDtvDTCbK+m/QBXRXdiAbij0fDFDhPck=
=QIq6
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] sched: fix and clean up calculate_imbalance
2014-07-29 15:15 ` Rik van Riel
@ 2014-07-29 15:49 ` Peter Zijlstra
2014-07-29 17:04 ` Rik van Riel
0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2014-07-29 15:49 UTC (permalink / raw)
To: Rik van Riel
Cc: Vincent Guittot, linux-kernel, Michael Neuling, Ingo Molnar,
jhladky, ktkhai, tim.c.chen, Nicolas Pitre
On Tue, Jul 29, 2014 at 11:15:49AM -0400, Rik van Riel wrote:
> > Right, so we want that code for overloaded -> overloaded migrations
> > such as not to cause idle cpus in an attempt to balance things.
> > Idle cpus are worse than imbalance.
> >
> > But in case of overloaded/imb -> !overloaded migrations we can
> > allow it, and in fact want to allow it in order to balance idle
> > cpus.
>
> In case the destination is over the average load, or the source is under
> the average load, fix_small_imbalance() determines env->imbalance.
>
> The "load_above_capacity" calculation is only reached when busiest is
> busier than average, and the destination is under the average load.
> In that case, env->imbalance ends up as the minimum of busiest - avg
> and avg - target.
>
> Is there any case where limiting it further to "load - capacity" from
> the busiest domain makes a difference?
sadly yes; suppose 8 cpus in 2 groups and 9 tasks, 8 tasks of weight 10,
1 of 1024. The local group will have 5 tasks of 10, the busiest will
have the remaining 4.
The sd avg is 138, local avg is 12, busiest avg is 263.
This gives: busiest-avg = 122, avg - local = 110
So an imbalance of 110.
Without limiting it further, we would migrate all 3 10 tasks over to
local and run 3 cpus idle.
Now running all 8 10 tasks on a single cpu and the 1 1024 task on
another and keeping 6 cpus idle is the 'fairest' solution, but that's
not the only goal, we also try and be work-conserving, iow. keep as many
cpus busy as possible.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] sched: fix and clean up calculate_imbalance
2014-07-29 15:49 ` Peter Zijlstra
@ 2014-07-29 17:04 ` Rik van Riel
0 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2014-07-29 17:04 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Vincent Guittot, linux-kernel, Michael Neuling, Ingo Molnar,
jhladky, ktkhai, tim.c.chen, Nicolas Pitre
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 07/29/2014 11:49 AM, Peter Zijlstra wrote:
> On Tue, Jul 29, 2014 at 11:15:49AM -0400, Rik van Riel wrote:
>
>>> Right, so we want that code for overloaded -> overloaded
>>> migrations such as not to cause idle cpus in an attempt to
>>> balance things. Idle cpus are worse than imbalance.
>>>
>>> But in case of overloaded/imb -> !overloaded migrations we can
>>> allow it, and in fact want to allow it in order to balance
>>> idle cpus.
>>
>> In case the destination is over the average load, or the source
>> is under the average load, fix_small_imbalance() determines
>> env->imbalance.
>>
>> The "load_above_capacity" calculation is only reached when
>> busiest is busier than average, and the destination is under the
>> average load. In that case, env->imbalance ends up as the minimum
>> of busiest - avg and avg - target.
>>
>> Is there any case where limiting it further to "load - capacity"
>> from the busiest domain makes a difference?
>
> sadly yes; suppose 8 cpus in 2 groups and 9 tasks, 8 tasks of
> weight 10, 1 of 1024. The local group will have 5 tasks of 10, the
> busiest will have the remaining 4.
Good point!
Looks like your patches address all issues raised, and the code
looks nicer than before.
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQEcBAEBAgAGBQJT19QSAAoJEM553pKExN6DM+EH/A0VSAcLB5TIDOC9axBwxdtY
RxDkYvGZJ9OvbsU83alpjm+ee5l1x+A1tKQofqd1QVDSfpq9QPrUniueeqnNyJhM
/alIRP/09bmH8vdn0uTqVrrx/lESeNPop5KmWimAdm1s452Lz8iuU3KmZryGssyJ
MolSWGKwNbWIncBLRtkNB7szlkLyvO0JZQN5FnA1xrzDZeaCA9j072t/49fiscbs
t6O8LxxsyooM+Wj3KyNt5dSKb9kHKf9gg0Z1dwcEqpcBjTB1jZNwOLQfhhaXX4gX
Gu55x8bJzFRMLey1EVtzlM7gnjpgiELbyfB7JibN0I8pYJKS9EcybCjQuq7BuvA=
=dt7v
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] sched: fix and clean up calculate_imbalance
2014-07-29 15:27 ` Peter Zijlstra
@ 2014-07-30 9:32 ` Vincent Guittot
2014-07-30 10:13 ` Peter Zijlstra
2014-08-12 14:52 ` [tip:sched/core] sched/fair: Allow calculate_imbalance() to move idle cpus tip-bot for Peter Zijlstra
1 sibling, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2014-07-30 9:32 UTC (permalink / raw)
To: Peter Zijlstra, Rik van Riel
Cc: linux-kernel, Michael Neuling, Ingo Molnar, jhladky, ktkhai,
tim.c.chen, Nicolas Pitre
On 29 July 2014 17:27, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Jul 29, 2014 at 04:59:10PM +0200, Peter Zijlstra wrote:
>> On Tue, Jul 29, 2014 at 11:04:50AM +0200, Vincent Guittot wrote:
>> > > In situations where all the domains are overloaded, or where only the
>> > > busiest domain is overloaded, that code is also superfluous, since
>> > > the normal env->imbalance calculation will figure out how much to move.
>> > > Remove the load_above_capacity calculation.
>> >
>> > IMHO, we should not remove that part which is used by prefer_sibling
>> >
>> > Originally, we had 2 type of busiest group: overloaded or imbalanced.
>> > You add a new one which has only a avg_load higher than other so you
>> > should handle this new case and keep the other ones unchanged
>>
>> Right, so we want that code for overloaded -> overloaded migrations such
>> as not to cause idle cpus in an attempt to balance things. Idle cpus are
>> worse than imbalance.
>>
>> But in case of overloaded/imb -> !overloaded migrations we can allow it,
>> and in fact want to allow it in order to balance idle cpus.
>
> Which would be patch 3/2
FWIW, you can add my ack for the new version of the patchset which
includes this patch
>
> ---
> Subject: sched,fair: Allow calculate_imbalance() to move idle cpus
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue Jul 29 17:15:11 CEST 2014
>
> Allow calculate_imbalance() to 'create' idle cpus in the busiest group
> if there are idle cpus in the local group.
>
> Suggested-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/n/tip-7k95k4i2tjv78iivstggiude@git.kernel.org
> ---
> kernel/sched/fair.c | 11 +++++------
> 1 file changed, 5 insertions(+), 6 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6273,12 +6273,11 @@ static inline void calculate_imbalance(s
> return fix_small_imbalance(env, sds);
> }
>
> - if (busiest->group_type == group_overloaded) {
> - /*
> - * Don't want to pull so many tasks that a group would go idle.
> - * Except of course for the group_imb case, since then we might
> - * have to drop below capacity to reach cpu-load equilibrium.
> - */
> + /*
> + * If there aren't any idle cpus, avoid creating some.
> + */
> + if (busiest->group_type == group_overloaded &&
> + local->group_type == group_overloaded) {
> load_above_capacity =
> (busiest->sum_nr_running - busiest->group_capacity_factor);
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH 1/2] sched: fix and clean up calculate_imbalance
2014-07-30 9:32 ` Vincent Guittot
@ 2014-07-30 10:13 ` Peter Zijlstra
0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2014-07-30 10:13 UTC (permalink / raw)
To: Vincent Guittot
Cc: Rik van Riel, linux-kernel, Michael Neuling, Ingo Molnar, jhladky,
ktkhai, tim.c.chen, Nicolas Pitre
[-- Attachment #1: Type: text/plain, Size: 184 bytes --]
On Wed, Jul 30, 2014 at 11:32:36AM +0200, Vincent Guittot wrote:
>
> FWIW, you can add my ack for the new version of the patchset which
> includes this patch
>
Thanks, done!
[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* [tip:sched/core] sched/fair: Make calculate_imbalance() independent
2014-07-29 15:26 ` Peter Zijlstra
@ 2014-08-12 14:52 ` tip-bot for Peter Zijlstra
0 siblings, 0 replies; 21+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-08-12 14:52 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, torvalds, peterz, riel, vincent.guittot,
tglx
Commit-ID: 743cb1ff191f00fee653212bdbcee1e56086d6ce
Gitweb: http://git.kernel.org/tip/743cb1ff191f00fee653212bdbcee1e56086d6ce
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 29 Jul 2014 17:00:21 +0200
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 12 Aug 2014 12:48:18 +0200
sched/fair: Make calculate_imbalance() independent
Rik noticed that calculate_imbalance() relies on
update_sd_pick_busiest() to guarantee that busiest->sum_nr_running >
busiest->group_capacity_factor.
Break this implicit assumption (with the intent of not providing it
anymore) by having calculat_imbalance() verify it and not rely on
others.
Reported-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20140729152631.GW12054@laptop.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfa3c86..e9477e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6248,7 +6248,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
return fix_small_imbalance(env, sds);
}
- if (!busiest->group_imb) {
+ if (busiest->sum_nr_running > busiest->group_capacity_factor) {
/*
* Don't want to pull so many tasks that a group would go idle.
* Except of course for the group_imb case, since then we might
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [tip:sched/core] sched/fair: Make update_sd_pick_busiest() return 'true' on a busier sd
2014-07-29 15:27 ` Peter Zijlstra
@ 2014-08-12 14:52 ` tip-bot for Rik van Riel
0 siblings, 0 replies; 21+ messages in thread
From: tip-bot for Rik van Riel @ 2014-08-12 14:52 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, torvalds, peterz, riel, vincent.guittot,
tglx, mikey
Commit-ID: caeb178c60f4f93f1b45c0bc056b5cf6d217b67f
Gitweb: http://git.kernel.org/tip/caeb178c60f4f93f1b45c0bc056b5cf6d217b67f
Author: Rik van Riel <riel@redhat.com>
AuthorDate: Mon, 28 Jul 2014 14:16:28 -0400
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 12 Aug 2014 12:48:19 +0200
sched/fair: Make update_sd_pick_busiest() return 'true' on a busier sd
Currently update_sd_pick_busiest only identifies the busiest sd
that is either overloaded, or has a group imbalance. When no
sd is imbalanced or overloaded, the load balancer fails to find
the busiest domain.
This breaks load balancing between domains that are not overloaded,
in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest
return true when the busiest sd yet is encountered.
Groups are ranked in the order overloaded > imbalanced > other,
with higher ranked groups getting priority even when their load
is lower. This is necessary due to the possibility of unequal
capacities and cpumasks between domains within a sched group.
Behaviour for SD_ASYM_PACKING does not seem to match the comment,
but I have no hardware to test that so I have left the behaviour
of that code unchanged.
Enum for group classification suggested by Peter Zijlstra.
Signed-off-by: Rik van Riel <riel@redhat.com>
[peterz: replaced sg_lb_stats::group_imb with the new enum group_type
in an attempt to avoid endless recalculation]
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Michael Neuling <mikey@neuling.org>
Cc: ktkhai@parallels.com
Cc: tim.c.chen@linux.intel.com
Cc: nicolas.pitre@linaro.org
Cc: jhladky@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140729152743.GI3935@laptop
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++------------
1 file changed, 37 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e9477e6..9437725 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5559,6 +5559,13 @@ static unsigned long task_h_load(struct task_struct *p)
#endif
/********** Helpers for find_busiest_group ************************/
+
+enum group_type {
+ group_other = 0,
+ group_imbalanced,
+ group_overloaded,
+};
+
/*
* sg_lb_stats - stats of a sched_group required for load_balancing
*/
@@ -5572,7 +5579,7 @@ struct sg_lb_stats {
unsigned int group_capacity_factor;
unsigned int idle_cpus;
unsigned int group_weight;
- int group_imb; /* Is there an imbalance in the group ? */
+ enum group_type group_type;
int group_has_free_capacity;
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
@@ -5610,6 +5617,8 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
.total_capacity = 0UL,
.busiest_stat = {
.avg_load = 0UL,
+ .sum_nr_running = 0,
+ .group_type = group_other,
},
};
}
@@ -5891,6 +5900,18 @@ static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *gro
return capacity_factor;
}
+static enum group_type
+group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+{
+ if (sgs->sum_nr_running > sgs->group_capacity_factor)
+ return group_overloaded;
+
+ if (sg_imbalanced(group))
+ return group_imbalanced;
+
+ return group_other;
+}
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @env: The load balancing environment.
@@ -5942,9 +5963,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
sgs->group_weight = group->group_weight;
-
- sgs->group_imb = sg_imbalanced(group);
sgs->group_capacity_factor = sg_capacity_factor(env, group);
+ sgs->group_type = group_classify(group, sgs);
if (sgs->group_capacity_factor > sgs->sum_nr_running)
sgs->group_has_free_capacity = 1;
@@ -5968,13 +5988,19 @@ static bool update_sd_pick_busiest(struct lb_env *env,
struct sched_group *sg,
struct sg_lb_stats *sgs)
{
- if (sgs->avg_load <= sds->busiest_stat.avg_load)
- return false;
+ struct sg_lb_stats *busiest = &sds->busiest_stat;
- if (sgs->sum_nr_running > sgs->group_capacity_factor)
+ if (sgs->group_type > busiest->group_type)
return true;
- if (sgs->group_imb)
+ if (sgs->group_type < busiest->group_type)
+ return false;
+
+ if (sgs->avg_load <= busiest->avg_load)
+ return false;
+
+ /* This is the busiest node in its class. */
+ if (!(env->sd->flags & SD_ASYM_PACKING))
return true;
/*
@@ -5982,8 +6008,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* numbered CPUs in the group, therefore mark all groups
* higher than ourself as busy.
*/
- if ((env->sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running &&
- env->dst_cpu < group_first_cpu(sg)) {
+ if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) {
if (!sds->busiest)
return true;
@@ -6228,7 +6253,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
local = &sds->local_stat;
busiest = &sds->busiest_stat;
- if (busiest->group_imb) {
+ if (busiest->group_type == group_imbalanced) {
/*
* In the group_imb case we cannot rely on group-wide averages
* to ensure cpu-load equilibrium, look at wider averages. XXX
@@ -6248,7 +6273,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
return fix_small_imbalance(env, sds);
}
- if (busiest->sum_nr_running > busiest->group_capacity_factor) {
+ if (busiest->group_type == group_overloaded) {
/*
* Don't want to pull so many tasks that a group would go idle.
* Except of course for the group_imb case, since then we might
@@ -6337,7 +6362,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
* work because they assume all things are equal, which typically
* isn't true due to cpus_allowed constraints and the like.
*/
- if (busiest->group_imb)
+ if (busiest->group_type == group_imbalanced)
goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
^ permalink raw reply related [flat|nested] 21+ messages in thread
* [tip:sched/core] sched/fair: Allow calculate_imbalance() to move idle cpus
2014-07-29 15:27 ` Peter Zijlstra
2014-07-30 9:32 ` Vincent Guittot
@ 2014-08-12 14:52 ` tip-bot for Peter Zijlstra
1 sibling, 0 replies; 21+ messages in thread
From: tip-bot for Peter Zijlstra @ 2014-08-12 14:52 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, hpa, mingo, torvalds, peterz, riel, vincent.guittot,
tglx
Commit-ID: 9a5d9ba6a3631d55c358fe1bdbaa162a97471a05
Gitweb: http://git.kernel.org/tip/9a5d9ba6a3631d55c358fe1bdbaa162a97471a05
Author: Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 29 Jul 2014 17:15:11 +0200
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 12 Aug 2014 12:48:20 +0200
sched/fair: Allow calculate_imbalance() to move idle cpus
Allow calculate_imbalance() to 'create' idle cpus in the busiest group
if there are idle cpus in the local group.
Suggested-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140729152705.GX12054@laptop.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/fair.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9437725..df1ed17 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6273,12 +6273,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
return fix_small_imbalance(env, sds);
}
- if (busiest->group_type == group_overloaded) {
- /*
- * Don't want to pull so many tasks that a group would go idle.
- * Except of course for the group_imb case, since then we might
- * have to drop below capacity to reach cpu-load equilibrium.
- */
+ /*
+ * If there aren't any idle cpus, avoid creating some.
+ */
+ if (busiest->group_type == group_overloaded &&
+ local->group_type == group_overloaded) {
load_above_capacity =
(busiest->sum_nr_running - busiest->group_capacity_factor);
^ permalink raw reply related [flat|nested] 21+ messages in thread
end of thread, other threads:[~2014-08-12 14:55 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-28 18:16 [PATCH 0/2] load balancing fixes riel
2014-07-28 18:16 ` [PATCH 1/2] sched: fix and clean up calculate_imbalance riel
2014-07-29 9:04 ` Vincent Guittot
2014-07-29 14:53 ` Rik van Riel
2014-07-29 15:31 ` Vincent Guittot
2014-07-29 15:39 ` Rik van Riel
2014-07-29 14:59 ` Peter Zijlstra
2014-07-29 15:15 ` Rik van Riel
2014-07-29 15:49 ` Peter Zijlstra
2014-07-29 17:04 ` Rik van Riel
2014-07-29 15:27 ` Peter Zijlstra
2014-07-30 9:32 ` Vincent Guittot
2014-07-30 10:13 ` Peter Zijlstra
2014-08-12 14:52 ` [tip:sched/core] sched/fair: Allow calculate_imbalance() to move idle cpus tip-bot for Peter Zijlstra
2014-07-29 14:49 ` [PATCH 1/2] sched: fix and clean up calculate_imbalance Peter Zijlstra
2014-07-29 14:53 ` Peter Zijlstra
2014-07-29 15:26 ` Peter Zijlstra
2014-08-12 14:52 ` [tip:sched/core] sched/fair: Make calculate_imbalance() independent tip-bot for Peter Zijlstra
2014-07-28 18:16 ` [PATCH 2/2] sched: make update_sd_pick_busiest return true on a busier sd riel
2014-07-29 15:27 ` Peter Zijlstra
2014-08-12 14:52 ` [tip:sched/core] sched/fair: Make update_sd_pick_busiest() return 'true' " tip-bot for Rik van Riel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox