* [PATCH] sched: make update_sd_pick_busiest return true on a busier sd @ 2014-07-22 18:45 Rik van Riel 2014-07-23 7:41 ` Vincent Guittot 2014-07-25 15:27 ` Peter Zijlstra 0 siblings, 2 replies; 19+ messages in thread From: Rik van Riel @ 2014-07-22 18:45 UTC (permalink / raw) To: linux-kernel Cc: peterz, mikey, mingo, pjt, jhladky, ktkhai, tim.c.chen, nicolas.pitre Currently update_sd_pick_busiest only returns true when an sd is overloaded, or for SD_ASYM_PACKING when a domain is busier than average and a higher numbered domain than the target. This breaks load balancing between domains that are not overloaded, in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest return true when the busiest sd yet is encountered. On a 4 node system, this seems to result in the load balancer finally putting 1 thread of a 4 thread test run of "perf bench numa mem" on each node, where before the load was generally not spread across all nodes. Behaviour for SD_ASYM_PACKING does not seem to match the comment, in that groups with below average load average are ignored, but I have no hardware to test that so I have left the behaviour of that code unchanged. Cc: mikey@neuling.org Cc: peterz@infradead.org Signed-off-by: Rik van Riel <riel@redhat.com> --- kernel/sched/fair.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fea7d33..ff4ddba 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5942,16 +5942,20 @@ static bool update_sd_pick_busiest(struct lb_env *env, * numbered CPUs in the group, therefore mark all groups * higher than ourself as busy. */ - if ((env->sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running && - env->dst_cpu < group_first_cpu(sg)) { - if (!sds->busiest) - return true; + if (env->sd->flags & SD_ASYM_PACKING) { + if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) { + if (!sds->busiest) + return true; - if (group_first_cpu(sds->busiest) > group_first_cpu(sg)) - return true; + if (group_first_cpu(sds->busiest) > group_first_cpu(sg)) + return true; + } + + return false; } - return false; + /* See above: sgs->avg_load > sds->busiest_stat.avg_load */ + return true; } #ifdef CONFIG_NUMA_BALANCING ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-22 18:45 [PATCH] sched: make update_sd_pick_busiest return true on a busier sd Rik van Riel @ 2014-07-23 7:41 ` Vincent Guittot 2014-07-25 13:33 ` Rik van Riel 2014-07-25 14:02 ` Rik van Riel 2014-07-25 15:27 ` Peter Zijlstra 1 sibling, 2 replies; 19+ messages in thread From: Vincent Guittot @ 2014-07-23 7:41 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar, Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre On 22 July 2014 20:45, Rik van Riel <riel@redhat.com> wrote: > Currently update_sd_pick_busiest only returns true when an sd > is overloaded, or for SD_ASYM_PACKING when a domain is busier > than average and a higher numbered domain than the target. > > This breaks load balancing between domains that are not overloaded, > in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest > return true when the busiest sd yet is encountered. > > On a 4 node system, this seems to result in the load balancer finally > putting 1 thread of a 4 thread test run of "perf bench numa mem" on > each node, where before the load was generally not spread across all > nodes. > > Behaviour for SD_ASYM_PACKING does not seem to match the comment, > in that groups with below average load average are ignored, but I > have no hardware to test that so I have left the behaviour of that > code unchanged. > > Cc: mikey@neuling.org > Cc: peterz@infradead.org > Signed-off-by: Rik van Riel <riel@redhat.com> > --- > kernel/sched/fair.c | 18 +++++++++++------- > 1 file changed, 11 insertions(+), 7 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index fea7d33..ff4ddba 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5942,16 +5942,20 @@ static bool update_sd_pick_busiest(struct lb_env *env, > * numbered CPUs in the group, therefore mark all groups > * higher than ourself as busy. > */ > - if ((env->sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running && > - env->dst_cpu < group_first_cpu(sg)) { > - if (!sds->busiest) > - return true; > + if (env->sd->flags & SD_ASYM_PACKING) { > + if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) { > + if (!sds->busiest) > + return true; > > - if (group_first_cpu(sds->busiest) > group_first_cpu(sg)) > - return true; > + if (group_first_cpu(sds->busiest) > group_first_cpu(sg)) > + return true; > + } > + > + return false; > } > > - return false; > + /* See above: sgs->avg_load > sds->busiest_stat.avg_load */ > + return true; Hi Rik, I can see one issue with a default return set to true. You increase the number of time where we will not effectively migrate a task because we don't ensure that we will take the overloaded group if there is one. We can be in a situation where a group is overloaded but the load_balance will select a not overloaded group with an average load higher than sched_domain average value just because it is checked after. Regarding your issue with "perf bench numa mem" that is not spread on all nodes, SD_PREFER_SIBLING flag (of DIE level) should do the job by reducing the capacity of "not local DIE" group at NUMA level to 1 task during the load balance computation. So you should have 1 task per sched_group at NUMA level. Vincent > } > > #ifdef CONFIG_NUMA_BALANCING > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-23 7:41 ` Vincent Guittot @ 2014-07-25 13:33 ` Rik van Riel 2014-07-25 14:29 ` Vincent Guittot 2014-07-25 14:02 ` Rik van Riel 1 sibling, 1 reply; 19+ messages in thread From: Rik van Riel @ 2014-07-25 13:33 UTC (permalink / raw) To: Vincent Guittot Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar, Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 07/23/2014 03:41 AM, Vincent Guittot wrote: > On 22 July 2014 20:45, Rik van Riel <riel@redhat.com> wrote: >> Currently update_sd_pick_busiest only returns true when an sd is >> overloaded, or for SD_ASYM_PACKING when a domain is busier than >> average and a higher numbered domain than the target. >> >> This breaks load balancing between domains that are not >> overloaded, in the !SD_ASYM_PACKING case. This patch makes >> update_sd_pick_busiest return true when the busiest sd yet is >> encountered. >> >> On a 4 node system, this seems to result in the load balancer >> finally putting 1 thread of a 4 thread test run of "perf bench >> numa mem" on each node, where before the load was generally not >> spread across all nodes. >> >> Behaviour for SD_ASYM_PACKING does not seem to match the >> comment, in that groups with below average load average are >> ignored, but I have no hardware to test that so I have left the >> behaviour of that code unchanged. >> >> Cc: mikey@neuling.org Cc: peterz@infradead.org Signed-off-by: Rik >> van Riel <riel@redhat.com> --- kernel/sched/fair.c | 18 >> +++++++++++------- 1 file changed, 11 insertions(+), 7 >> deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index >> fea7d33..ff4ddba 100644 --- a/kernel/sched/fair.c +++ >> b/kernel/sched/fair.c @@ -5942,16 +5942,20 @@ static bool >> update_sd_pick_busiest(struct lb_env *env, * numbered CPUs in the >> group, therefore mark all groups * higher than ourself as busy. >> */ - if ((env->sd->flags & SD_ASYM_PACKING) && >> sgs->sum_nr_running && - env->dst_cpu < >> group_first_cpu(sg)) { - if (!sds->busiest) - >> return true; + if (env->sd->flags & SD_ASYM_PACKING) { + >> if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) { >> + if (!sds->busiest) + >> return true; >> >> - if (group_first_cpu(sds->busiest) > >> group_first_cpu(sg)) - return true; + >> if (group_first_cpu(sds->busiest) > group_first_cpu(sg)) + >> return true; + } + + return false; } >> >> - return false; + /* See above: sgs->avg_load > >> sds->busiest_stat.avg_load */ + return true; > > Hi Rik, > > I can see one issue with a default return set to true. You > increase the number of time where we will not effectively migrate a > task because we don't ensure that we will take the overloaded group > if there is one. We can be in a situation where a group is > overloaded but the load_balance will select a not overloaded group > with an average load higher than sched_domain average value just > because it is checked after. Look at the first line of update_sd_pick_busiest() static bool update_sd_pick_busiest(struct lb_env *env, struct sd_lb_stats *sds, struct sched_group *sg, struct sg_lb_stats *sgs) { if (sgs->avg_load <= sds->busiest_stat.avg_load) return false; If the group load is less than the busiest, we have already returned false, and will not get to the code that I changed. > Regarding your issue with "perf bench numa mem" that is not spread > on all nodes, SD_PREFER_SIBLING flag (of DIE level) should do the > job by reducing the capacity of "not local DIE" group at NUMA > level to 1 task during the load balance computation. So you should > have 1 task per sched_group at NUMA level. That did not actually happen in my tests. I almost always ended up having only 0 tasks on one node. - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJT0lzDAAoJEM553pKExN6D/8oH/3TmBwlIpj/H7pbs4ucvfigx WBgjkA0U/snXA8D/oRicIpX2+N42wwnmME/E20mhVjqUAvrDLbfaWoJC3UJ6Qx08 GUeKxOBxbf1FypOmLyfKuQrMOojO585TX76n43MZnsfxzCJUIL6x6MQOE+Tbutx9 6Y0VCz1uw9BdwnEuP0fObMrExMOlmb2JSiWiCuf8uAorWiArv8TZvBxt5W093ONO bRDywJ8sMFVwgQ0TZLPEBFsRAcGuPhHx/FJZuOb/F/NWopaaZD3tt4gM3VDaq4ir z+Qvhboql2EdydoYZV+O4VBWI7gFtT2+vLpUteaYmFR3Zx5VtneSwyCwtk6yk0c= =tZ6L -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-25 13:33 ` Rik van Riel @ 2014-07-25 14:29 ` Vincent Guittot 2014-07-25 14:46 ` Rik van Riel 0 siblings, 1 reply; 19+ messages in thread From: Vincent Guittot @ 2014-07-25 14:29 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar, Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre On 25 July 2014 15:33, Rik van Riel <riel@redhat.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 07/23/2014 03:41 AM, Vincent Guittot wrote: >> On 22 July 2014 20:45, Rik van Riel <riel@redhat.com> wrote: >>> Currently update_sd_pick_busiest only returns true when an sd is >>> overloaded, or for SD_ASYM_PACKING when a domain is busier than >>> average and a higher numbered domain than the target. >>> >>> This breaks load balancing between domains that are not >>> overloaded, in the !SD_ASYM_PACKING case. This patch makes >>> update_sd_pick_busiest return true when the busiest sd yet is >>> encountered. >>> >>> On a 4 node system, this seems to result in the load balancer >>> finally putting 1 thread of a 4 thread test run of "perf bench >>> numa mem" on each node, where before the load was generally not >>> spread across all nodes. >>> >>> Behaviour for SD_ASYM_PACKING does not seem to match the >>> comment, in that groups with below average load average are >>> ignored, but I have no hardware to test that so I have left the >>> behaviour of that code unchanged. >>> >>> Cc: mikey@neuling.org Cc: peterz@infradead.org Signed-off-by: Rik >>> van Riel <riel@redhat.com> --- kernel/sched/fair.c | 18 >>> +++++++++++------- 1 file changed, 11 insertions(+), 7 >>> deletions(-) >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index >>> fea7d33..ff4ddba 100644 --- a/kernel/sched/fair.c +++ >>> b/kernel/sched/fair.c @@ -5942,16 +5942,20 @@ static bool >>> update_sd_pick_busiest(struct lb_env *env, * numbered CPUs in the >>> group, therefore mark all groups * higher than ourself as busy. >>> */ - if ((env->sd->flags & SD_ASYM_PACKING) && >>> sgs->sum_nr_running && - env->dst_cpu < >>> group_first_cpu(sg)) { - if (!sds->busiest) - >>> return true; + if (env->sd->flags & SD_ASYM_PACKING) { + >>> if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) { >>> + if (!sds->busiest) + >>> return true; >>> >>> - if (group_first_cpu(sds->busiest) > >>> group_first_cpu(sg)) - return true; + >>> if (group_first_cpu(sds->busiest) > group_first_cpu(sg)) + >>> return true; + } + + return false; } >>> >>> - return false; + /* See above: sgs->avg_load > >>> sds->busiest_stat.avg_load */ + return true; >> >> Hi Rik, >> >> I can see one issue with a default return set to true. You >> increase the number of time where we will not effectively migrate a >> task because we don't ensure that we will take the overloaded group >> if there is one. We can be in a situation where a group is >> overloaded but the load_balance will select a not overloaded group >> with an average load higher than sched_domain average value just >> because it is checked after. > > Look at the first line of update_sd_pick_busiest() > > static bool update_sd_pick_busiest(struct lb_env *env, > struct sd_lb_stats *sds, > struct sched_group *sg, > struct sg_lb_stats *sgs) > { > if (sgs->avg_load <= sds->busiest_stat.avg_load) > return false; > > If the group load is less than the busiest, we have already > returned false, and will not get to the code that I changed. My point was that if a sched_group A with 1 task has got a higher avg_load than a sched_group with 2 tasks, we will select sched_group A whereas we should select the other group Furthermore, update_sd_lb_stats will always return a busiest group even an idle one. This will increase the number of failed load balance and the time spent in the it. Vincent > >> Regarding your issue with "perf bench numa mem" that is not spread >> on all nodes, SD_PREFER_SIBLING flag (of DIE level) should do the >> job by reducing the capacity of "not local DIE" group at NUMA >> level to 1 task during the load balance computation. So you should >> have 1 task per sched_group at NUMA level. > > That did not actually happen in my tests. I almost always > ended up having only 0 tasks on one node. > > - -- > All rights reversed > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1 > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iQEcBAEBAgAGBQJT0lzDAAoJEM553pKExN6D/8oH/3TmBwlIpj/H7pbs4ucvfigx > WBgjkA0U/snXA8D/oRicIpX2+N42wwnmME/E20mhVjqUAvrDLbfaWoJC3UJ6Qx08 > GUeKxOBxbf1FypOmLyfKuQrMOojO585TX76n43MZnsfxzCJUIL6x6MQOE+Tbutx9 > 6Y0VCz1uw9BdwnEuP0fObMrExMOlmb2JSiWiCuf8uAorWiArv8TZvBxt5W093ONO > bRDywJ8sMFVwgQ0TZLPEBFsRAcGuPhHx/FJZuOb/F/NWopaaZD3tt4gM3VDaq4ir > z+Qvhboql2EdydoYZV+O4VBWI7gFtT2+vLpUteaYmFR3Zx5VtneSwyCwtk6yk0c= > =tZ6L > -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-25 14:29 ` Vincent Guittot @ 2014-07-25 14:46 ` Rik van Riel 0 siblings, 0 replies; 19+ messages in thread From: Rik van Riel @ 2014-07-25 14:46 UTC (permalink / raw) To: Vincent Guittot Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar, Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 07/25/2014 10:29 AM, Vincent Guittot wrote: > On 25 July 2014 15:33, Rik van Riel <riel@redhat.com> wrote: On > 07/23/2014 03:41 AM, Vincent Guittot wrote: >>>> On 22 July 2014 20:45, Rik van Riel <riel@redhat.com> wrote: >>>>> Currently update_sd_pick_busiest only returns true when an >>>>> sd is overloaded, or for SD_ASYM_PACKING when a domain is >>>>> busier than average and a higher numbered domain than the >>>>> target. >>>>> >>>>> This breaks load balancing between domains that are not >>>>> overloaded, in the !SD_ASYM_PACKING case. This patch makes >>>>> update_sd_pick_busiest return true when the busiest sd yet >>>>> is encountered. >>>>> >>>>> On a 4 node system, this seems to result in the load >>>>> balancer finally putting 1 thread of a 4 thread test run of >>>>> "perf bench numa mem" on each node, where before the load >>>>> was generally not spread across all nodes. >>>>> >>>>> Behaviour for SD_ASYM_PACKING does not seem to match the >>>>> comment, in that groups with below average load average >>>>> are ignored, but I have no hardware to test that so I have >>>>> left the behaviour of that code unchanged. >>>>> >>>>> Cc: mikey@neuling.org Cc: peterz@infradead.org >>>>> Signed-off-by: Rik van Riel <riel@redhat.com> --- >>>>> kernel/sched/fair.c | 18 +++++++++++------- 1 file changed, >>>>> 11 insertions(+), 7 deletions(-) >>>>> >>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>> index fea7d33..ff4ddba 100644 --- a/kernel/sched/fair.c >>>>> +++ b/kernel/sched/fair.c @@ -5942,16 +5942,20 @@ static >>>>> bool update_sd_pick_busiest(struct lb_env *env, * numbered >>>>> CPUs in the group, therefore mark all groups * higher than >>>>> ourself as busy. */ - if ((env->sd->flags & >>>>> SD_ASYM_PACKING) && sgs->sum_nr_running && - >>>>> env->dst_cpu < group_first_cpu(sg)) { - if >>>>> (!sds->busiest) - return true; + if (env->sd->flags & >>>>> SD_ASYM_PACKING) { + if (sgs->sum_nr_running && >>>>> env->dst_cpu < group_first_cpu(sg)) { + >>>>> if (!sds->busiest) + return true; >>>>> >>>>> - if (group_first_cpu(sds->busiest) > >>>>> group_first_cpu(sg)) - return true; >>>>> + if (group_first_cpu(sds->busiest) > group_first_cpu(sg)) >>>>> + return true; + } + + return >>>>> false; } >>>>> >>>>> - return false; + /* See above: sgs->avg_load >>>>> > sds->busiest_stat.avg_load */ + return true; >>>> >>>> Hi Rik, >>>> >>>> I can see one issue with a default return set to true. You >>>> increase the number of time where we will not effectively >>>> migrate a task because we don't ensure that we will take the >>>> overloaded group if there is one. We can be in a situation >>>> where a group is overloaded but the load_balance will select >>>> a not overloaded group with an average load higher than >>>> sched_domain average value just because it is checked after. > > Look at the first line of update_sd_pick_busiest() > > static bool update_sd_pick_busiest(struct lb_env *env, struct > sd_lb_stats *sds, struct sched_group *sg, struct sg_lb_stats *sgs) > { if (sgs->avg_load <= sds->busiest_stat.avg_load) return false; > > If the group load is less than the busiest, we have already > returned false, and will not get to the code that I changed. > > >> My point was that if a sched_group A with 1 task has got a >> higher avg_load than a sched_group with 2 tasks, we will select >> sched_group A whereas we should select the other group The code already does that, with or without my patch. If it runs into group A first, that "return false" above will be hit for group B. >> Furthermore, update_sd_lb_stats will always return a busiest >> group even an idle one. This will increase the number of failed >> load balance and the time spent in the it. If the busiest group found is idle, surely find_busiest_group will see that and goto out_balanced ? There are several safety checks in find_busiest_group to make sure NULL is returned when the imbalance found is too small to bother doing anything about. - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJT0m3fAAoJEM553pKExN6D1xMH/3BOTeD9aiu/nVCWnfJIdRUB 2owogPLFDlcbUekzGkjIxc4lYWj/ANVk5jOibcOLutckSJXphDc1KhqYYXaNM6rm NfoFM0//HGxJgIWMKGWYWJosPdFdrvLxNRwn8+yBnZh9el15GQSBvhKjrCeolNo2 Yy6AqicqvoMXMnzcONcAxyxwH0b6CRFuGHIOAvzjXGUvSKTU7fs1zPRAVxfORbbp HfiOybqNJW+EnH7xhJU+GSZi+X+agnRS4/axfc48FZH01/P+k21cYougC7kMxxHA MGP1YtnNYGFBqCX5QwGgw5NkMgHYNCCREh+uLwsgGCN/bkmuHJm+JcbQOQlRoRY= =v9O8 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-23 7:41 ` Vincent Guittot 2014-07-25 13:33 ` Rik van Riel @ 2014-07-25 14:02 ` Rik van Riel 2014-07-25 14:15 ` Peter Zijlstra 2014-07-25 15:02 ` Vincent Guittot 1 sibling, 2 replies; 19+ messages in thread From: Rik van Riel @ 2014-07-25 14:02 UTC (permalink / raw) To: Vincent Guittot Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar, Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 07/23/2014 03:41 AM, Vincent Guittot wrote: > Regarding your issue with "perf bench numa mem" that is not spread > on all nodes, SD_PREFER_SIBLING flag (of DIE level) should do the > job by reducing the capacity of "not local DIE" group at NUMA > level to 1 task during the load balance computation. So you should > have 1 task per sched_group at NUMA level. Looking at the code some more, it is clear why this does not happen. If sd->flags & SD_NUMA, then SD_PREFER_SIBLING will never be set. On a related note, that part of the load balancing code probably needs to be rewritten to deal with unequal group_capacity_factors anyway. Say that one group has a group_capacity_factor twice that of another group. The group with the smaller group_capacity_factor is overloaded by a factor 1.3. The larger group is loaded by a factor 0.8. This means the larger group has a higher load than the first group, and the current code in update_sd_pick_busiest will not select the overloaded group as the busiest one, due to not scaling load with the capacity... static bool update_sd_pick_busiest(struct lb_env *env, struct sd_lb_stats *sds, struct sched_group *sg, struct sg_lb_stats *sgs) { if (sgs->avg_load <= sds->busiest_stat.avg_load) return false; I believe we may need to factor the group_capacity_factor into this calculation, in order to properly identify which group is busiest. However, if we do that we may need to get rid of the SD_PREFER_SIBLING hack that forces group_capacity_factor to 1 on domains that have SD_PREFER_SIBLING set. I suspect that should be ok though, if we make sure update_sd_pick_busiest does the right thing... - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJT0mOCAAoJEM553pKExN6DHq4H/2THfH33d+JYvfOq95OpGLaD HATAp8Dv0kTiGjnbZrHPp8TqqgLLXuM6HhLvsvURuhoJw6F/nOX6qOQWEtjcMyYp omShkDSLnPjs/0Iwf9vNocT7K7Sn3Gk0hOj6+ICW7wchyug8JYtuiHunP8pYrpzW G6l2qHMRqRs5mSENY/uWwH9qh6Z6jcfDoDDDKRTNBe0z67FzwMnX1IYCUA6XOBsZ iRdXe8E0CIgio+ek8HVzRm5sUlkRyfJpTXJj+pemVJhTrNCCbMGTHxzADU4Ngc8S +JQ+G6bsHz9R4pffsuzYFbL0avK0mm5SrjCIatE7MX171dQJ1cKpju+fAmnwuNg= =EAzG -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-25 14:02 ` Rik van Riel @ 2014-07-25 14:15 ` Peter Zijlstra 2014-07-25 15:02 ` Vincent Guittot 1 sibling, 0 replies; 19+ messages in thread From: Peter Zijlstra @ 2014-07-25 14:15 UTC (permalink / raw) To: Rik van Riel Cc: Vincent Guittot, linux-kernel, Michael Neuling, Ingo Molnar, Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre [-- Attachment #1: Type: text/plain, Size: 1466 bytes --] On Fri, Jul 25, 2014 at 10:02:43AM -0400, Rik van Riel wrote: > On a related note, that part of the load balancing code probably > needs to be rewritten to deal with unequal group_capacity_factors > anyway. > > Say that one group has a group_capacity_factor twice that of > another group. > > The group with the smaller group_capacity_factor is overloaded > by a factor 1.3. The larger group is loaded by a factor 0.8. > This means the larger group has a higher load than the first > group, and the current code in update_sd_pick_busiest will > not select the overloaded group as the busiest one, due to not > scaling load with the capacity... > > static bool update_sd_pick_busiest(struct lb_env *env, > struct sd_lb_stats *sds, > struct sched_group *sg, > struct sg_lb_stats *sgs) > { > if (sgs->avg_load <= sds->busiest_stat.avg_load) > return false; > > I believe we may need to factor the group_capacity_factor > into this calculation, in order to properly identify which > group is busiest. (sorry, going through this backwards, I'll get to the actual patch in a bit) Note how update_sg_lb_stats() where we compute sgs->avg_load we do divide by sgs->group_capacity. (also, curse this renaming of stuff) The group_capacity_factor is something ugly and Vincent was going to poke at that. [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-25 14:02 ` Rik van Riel 2014-07-25 14:15 ` Peter Zijlstra @ 2014-07-25 15:02 ` Vincent Guittot 2014-07-25 15:13 ` Rik van Riel 1 sibling, 1 reply; 19+ messages in thread From: Vincent Guittot @ 2014-07-25 15:02 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar, Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre On 25 July 2014 16:02, Rik van Riel <riel@redhat.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 07/23/2014 03:41 AM, Vincent Guittot wrote: > >> Regarding your issue with "perf bench numa mem" that is not spread >> on all nodes, SD_PREFER_SIBLING flag (of DIE level) should do the >> job by reducing the capacity of "not local DIE" group at NUMA >> level to 1 task during the load balance computation. So you should >> have 1 task per sched_group at NUMA level. > > Looking at the code some more, it is clear why this does not > happen. If sd->flags & SD_NUMA, then SD_PREFER_SIBLING will > never be set. I don't have a lot of experience on NUMA system and how their sched_domain topology is described but IIUC, you don't have other sched_domain level than NUMA ones ? otherwise the flag should be present in one of the non NUMA level (SMT, MC or DIE) > > > On a related note, that part of the load balancing code probably > needs to be rewritten to deal with unequal group_capacity_factors > anyway. > > Say that one group has a group_capacity_factor twice that of > another group. > > The group with the smaller group_capacity_factor is overloaded > by a factor 1.3. The larger group is loaded by a factor 0.8. > This means the larger group has a higher load than the first > group, and the current code in update_sd_pick_busiest will > not select the overloaded group as the busiest one, due to not > scaling load with the capacity... > AFAICT, sgs->avg_load is weighted by the capacity in update_sg_lb_stats > static bool update_sd_pick_busiest(struct lb_env *env, > struct sd_lb_stats *sds, > struct sched_group *sg, > struct sg_lb_stats *sgs) > { > if (sgs->avg_load <= sds->busiest_stat.avg_load) > return false; > > I believe we may need to factor the group_capacity_factor > into this calculation, in order to properly identify which > group is busiest. > > However, if we do that we may need to get rid of the > SD_PREFER_SIBLING hack that forces group_capacity_factor > to 1 on domains that have SD_PREFER_SIBLING set. I'm working on a patchset that get ride of capacity_factor (as mentioned by Peter) and directly uses capacity instead. I should send the v4 next week. Vincent > > I suspect that should be ok though, if we make sure > update_sd_pick_busiest does the right thing... > > - -- > All rights reversed > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1 > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iQEcBAEBAgAGBQJT0mOCAAoJEM553pKExN6DHq4H/2THfH33d+JYvfOq95OpGLaD > HATAp8Dv0kTiGjnbZrHPp8TqqgLLXuM6HhLvsvURuhoJw6F/nOX6qOQWEtjcMyYp > omShkDSLnPjs/0Iwf9vNocT7K7Sn3Gk0hOj6+ICW7wchyug8JYtuiHunP8pYrpzW > G6l2qHMRqRs5mSENY/uWwH9qh6Z6jcfDoDDDKRTNBe0z67FzwMnX1IYCUA6XOBsZ > iRdXe8E0CIgio+ek8HVzRm5sUlkRyfJpTXJj+pemVJhTrNCCbMGTHxzADU4Ngc8S > +JQ+G6bsHz9R4pffsuzYFbL0avK0mm5SrjCIatE7MX171dQJ1cKpju+fAmnwuNg= > =EAzG > -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-25 15:02 ` Vincent Guittot @ 2014-07-25 15:13 ` Rik van Riel 0 siblings, 0 replies; 19+ messages in thread From: Rik van Riel @ 2014-07-25 15:13 UTC (permalink / raw) To: Vincent Guittot Cc: linux-kernel, Peter Zijlstra, Michael Neuling, Ingo Molnar, Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 07/25/2014 11:02 AM, Vincent Guittot wrote: > On 25 July 2014 16:02, Rik van Riel <riel@redhat.com> wrote: On > 07/23/2014 03:41 AM, Vincent Guittot wrote: > >>>> Regarding your issue with "perf bench numa mem" that is not >>>> spread on all nodes, SD_PREFER_SIBLING flag (of DIE level) >>>> should do the job by reducing the capacity of "not local >>>> DIE" group at NUMA level to 1 task during the load balance >>>> computation. So you should have 1 task per sched_group at >>>> NUMA level. >> >> Looking at the code some more, it is clear why this does not >> happen. If sd->flags & SD_NUMA, then SD_PREFER_SIBLING will never >> be set. > > I don't have a lot of experience on NUMA system and how their > sched_domain topology is described but IIUC, you don't have other > sched_domain level than NUMA ones ? otherwise the flag should be > present in one of the non NUMA level (SMT, MC or DIE) The system I am testing on has 3 or 4 sched_domain levels, one for each HT sibling(?), one for each core, one for each node/socket, and one parent domain for the whole system. SD_PREFER_SIBLING should be set at the HT sibling level and at the core level. However, it is not set at the levels above that. That means the SD_PREFER_SIBLING flag does its thing within each CPU core and between cores on a socket, but not between NUMA nodes... >> On a related note, that part of the load balancing code probably >> needs to be rewritten to deal with unequal >> group_capacity_factors anyway. > AFAICT, sgs->avg_load is weighted by the capacity in > update_sg_lb_stats Indeed, I dug into that code after sending the email, and found that piece of the code just before I read Peter's email pointing it out to me. > I'm working on a patchset that get ride of capacity_factor (as > mentioned by Peter) and directly uses capacity instead. I should > send the v4 next week. I am looking forward to anything that will make this code easier to follow :) - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJT0nQMAAoJEM553pKExN6DeFAIAK7KHHRl+lfWIxiZGRyarYHL SHJydCA4a9Lkd2D60dULGWY/8ylB8+IMfwv69/jXHZzbxlg7Nu1+da7ZUF3Lx35k AYxpOhC94eTJvp9KQX2W0nGiDZ0Di7YnWAWdoLsd1kZZDZjd82gtLVh63ossWZlF hn+YH6E4n0iAe6CZ2PO4QMz7dDYVGzUnKuuQVZKl3DBJWSe6ZvcBhDS4xDT+uUe1 IheA5aQcY8XmAkcXbLs736iBiOCKH6Jts6trJUPaVxw3jkD8lI/CMuMss2dk7RPH xl3Y8CKridywdtvGN6WrOwzUxVBxaEN1da/VuN7nF4OnoipYApSWwKTBe9wd7rQ= =AsCN -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-22 18:45 [PATCH] sched: make update_sd_pick_busiest return true on a busier sd Rik van Riel 2014-07-23 7:41 ` Vincent Guittot @ 2014-07-25 15:27 ` Peter Zijlstra 2014-07-25 15:45 ` Rik van Riel 2014-07-27 23:57 ` [PATCH] " Michael Neuling 1 sibling, 2 replies; 19+ messages in thread From: Peter Zijlstra @ 2014-07-25 15:27 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, mikey, mingo, pjt, jhladky, ktkhai, tim.c.chen, nicolas.pitre On Tue, Jul 22, 2014 at 02:45:59PM -0400, Rik van Riel wrote: > Currently update_sd_pick_busiest only returns true when an sd > is overloaded, or for SD_ASYM_PACKING when a domain is busier > than average and a higher numbered domain than the target. > > This breaks load balancing between domains that are not overloaded, > in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest > return true when the busiest sd yet is encountered. > > On a 4 node system, this seems to result in the load balancer finally > putting 1 thread of a 4 thread test run of "perf bench numa mem" on > each node, where before the load was generally not spread across all > nodes. So for !ASYM the code is effectively: return sgs->avg_load > sds->busiest_stat.avg_load; I'd like to at least add a clause that makes overloaded groups prioritized over !overloaded groups. Also, like we found earlier, calculate_imbalance() relies on the sum_nr_running > group_capacity_factor thing, which you've just 'wrecked', so we'd need an update to that part too. > Behaviour for SD_ASYM_PACKING does not seem to match the comment, > in that groups with below average load average are ignored, but I > have no hardware to test that so I have left the behaviour of that > code unchanged. Mikey, does that stuff work as expected? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-25 15:27 ` Peter Zijlstra @ 2014-07-25 15:45 ` Rik van Riel 2014-07-25 16:05 ` Peter Zijlstra 2014-07-27 23:57 ` [PATCH] " Michael Neuling 1 sibling, 1 reply; 19+ messages in thread From: Rik van Riel @ 2014-07-25 15:45 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, mikey, mingo, pjt, jhladky, ktkhai, tim.c.chen, nicolas.pitre -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 07/25/2014 11:27 AM, Peter Zijlstra wrote: > On Tue, Jul 22, 2014 at 02:45:59PM -0400, Rik van Riel wrote: >> Currently update_sd_pick_busiest only returns true when an sd is >> overloaded, or for SD_ASYM_PACKING when a domain is busier than >> average and a higher numbered domain than the target. >> >> This breaks load balancing between domains that are not >> overloaded, in the !SD_ASYM_PACKING case. This patch makes >> update_sd_pick_busiest return true when the busiest sd yet is >> encountered. >> >> On a 4 node system, this seems to result in the load balancer >> finally putting 1 thread of a 4 thread test run of "perf bench >> numa mem" on each node, where before the load was generally not >> spread across all nodes. > > So for !ASYM the code is effectively: > > return sgs->avg_load > sds->busiest_stat.avg_load; > > I'd like to at least add a clause that makes overloaded groups > prioritized over !overloaded groups. > > Also, like we found earlier, calculate_imbalance() relies on the > sum_nr_running > group_capacity_factor thing, which you've just > 'wrecked', so we'd need an update to that part too. I guess that would mean update_sd_pick_busiest would look like this for the !ASYM case: 1) remembering whether busiest is overloaded 2) if (sgs->busiest_stat.overloaded && !sgs->overloaded) return false; 3) if (sgs->avg_load > sds->busiest_stat.avg_load) return true; >> Behaviour for SD_ASYM_PACKING does not seem to match the >> comment, in that groups with below average load average are >> ignored, but I have no hardware to test that so I have left the >> behaviour of that code unchanged. > > Mikey, does that stuff work as expected? I suspect it does not, due to the checks above the SD_ASYM_PACKAGING code occasionally overriding the SD_ASYM_PACKAGING code. Also, the ASYM code may rely on CPU numbers not being interleaved between groups, the "env->dst_cpu < group_first_cpu(sg)" check would probably fail to pull all load onto group 0 if CPU numbers were distributed like this: group 0: 0 2 4 6 group 1: 1 3 5 7 - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJT0nt9AAoJEM553pKExN6DdiYH/jnO9c8f9VNIPD6ibsuG0JXO HgMJVhYY4YA/kP3wvdOOjLNwAohivCqd41ZMe7jw5K25aJ6/PIT9BN4iImSnPbgw JPLzUyvMF+hl6/rfmKJnpYSpTOqxbZllmbJPEXJMc0biizftA1u7aoO88ufHsZbq vZOZLHieHaiMSJ5J5sHpGaWWLw3hS4Ba0HJmUSlV+sbqSX6yZmcDFoQlvYlzfybK Q2HFVU3WGtYcOIxgJB1NVcn+axok8+O8kI8lQpWSzpewZN6fRKqZMVnmIvEKK15C stE3U8zXz6eYrlko5J0YyRcL5OPCE/tr4V5CRPonIvsLXmAmrMra5Ev4Dc/4Rr4= =hZqS -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-25 15:45 ` Rik van Riel @ 2014-07-25 16:05 ` Peter Zijlstra 2014-07-25 16:22 ` Rik van Riel 0 siblings, 1 reply; 19+ messages in thread From: Peter Zijlstra @ 2014-07-25 16:05 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, mikey, mingo, pjt, jhladky, ktkhai, tim.c.chen, nicolas.pitre [-- Attachment #1: Type: text/plain, Size: 1090 bytes --] On Fri, Jul 25, 2014 at 11:45:02AM -0400, Rik van Riel wrote: > > So for !ASYM the code is effectively: > > > > return sgs->avg_load > sds->busiest_stat.avg_load; > > > > I'd like to at least add a clause that makes overloaded groups > > prioritized over !overloaded groups. > > > > Also, like we found earlier, calculate_imbalance() relies on the > > sum_nr_running > group_capacity_factor thing, which you've just > > 'wrecked', so we'd need an update to that part too. > > I guess that would mean update_sd_pick_busiest would look like > this for the !ASYM case: > > 1) remembering whether busiest is overloaded > > 2) if (sgs->busiest_stat.overloaded && !sgs->overloaded) > return false; > > 3) if (sgs->avg_load > sds->busiest_stat.avg_load) > return true; Right. So the situation I was worried about is where we have say 8 +nice tasks in one group and 2 -nice in another. So the group with 2 tasks can actually be 'heavier' but not fully utilized. In that case we want to pull tasks from the lighter but overloaded group. [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-25 16:05 ` Peter Zijlstra @ 2014-07-25 16:22 ` Rik van Riel 2014-07-25 17:57 ` Vincent Guittot 0 siblings, 1 reply; 19+ messages in thread From: Rik van Riel @ 2014-07-25 16:22 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, mikey, mingo, pjt, jhladky, ktkhai, tim.c.chen, nicolas.pitre, Vincent Guittot -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 07/25/2014 12:05 PM, Peter Zijlstra wrote: > On Fri, Jul 25, 2014 at 11:45:02AM -0400, Rik van Riel wrote: >>> So for !ASYM the code is effectively: >>> >>> return sgs->avg_load > sds->busiest_stat.avg_load; >>> >>> I'd like to at least add a clause that makes overloaded groups >>> prioritized over !overloaded groups. >>> >>> Also, like we found earlier, calculate_imbalance() relies on >>> the sum_nr_running > group_capacity_factor thing, which you've >>> just 'wrecked', so we'd need an update to that part too. >> >> I guess that would mean update_sd_pick_busiest would look like >> this for the !ASYM case: >> >> 1) remembering whether busiest is overloaded >> >> 2) if (sgs->busiest_stat.overloaded && !sgs->overloaded) return >> false; >> >> 3) if (sgs->avg_load > sds->busiest_stat.avg_load) return true; > > Right. So the situation I was worried about is where we have say 8 > +nice tasks in one group and 2 -nice in another. So the group with > 2 tasks can actually be 'heavier' but not fully utilized. In that > case we want to pull tasks from the lighter but overloaded group. I can certainly create a new patch that makes update_sd_pick_busiest do just that. Vincent, any objections to me fixing update_sd_pick_busiest now, or will that conflict with your code cleanups? - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJT0oRWAAoJEM553pKExN6DRdoH/2VsMEbeRdr9o99CYV3hK2bj OT4O5xlz47aBq+/J3Sp35DZKtQ8PpYICNoNzddyF5Blr7ZcXB2OQj5Cv5Lj3Dtcs q/AnJAN+6mDNq4B3aVfTb3BMDcrSz8LNdpNwVM80gfvbcZkcWIYWF2FigShIggMZ bKz4cC5hVphtYR/WH6gnb5dXwgXBGYatwYsDXnVw6SLRbXcwYrB646SmZYXDVcER bOj5IAjdfAlFkLbeow6OaX1arBrd6lyKfVI4Mzr848dqLBrJvmPvx/R4Z8EzPN2H Gv4kqU7s9CAk1QOZLHWAOCDjG7PdMZfkS7WEZnQinMIutBK/jYJ4IXY8NNeS96I= =OcR0 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-25 16:22 ` Rik van Riel @ 2014-07-25 17:57 ` Vincent Guittot 2014-07-25 19:32 ` [PATCH v2] " Rik van Riel 0 siblings, 1 reply; 19+ messages in thread From: Vincent Guittot @ 2014-07-25 17:57 UTC (permalink / raw) To: Rik van Riel Cc: Peter Zijlstra, linux-kernel, Michael Neuling, Ingo Molnar, Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre On 25 July 2014 18:22, Rik van Riel <riel@redhat.com> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 07/25/2014 12:05 PM, Peter Zijlstra wrote: >> On Fri, Jul 25, 2014 at 11:45:02AM -0400, Rik van Riel wrote: >>>> So for !ASYM the code is effectively: >>>> >>>> return sgs->avg_load > sds->busiest_stat.avg_load; >>>> >>>> I'd like to at least add a clause that makes overloaded groups >>>> prioritized over !overloaded groups. >>>> >>>> Also, like we found earlier, calculate_imbalance() relies on >>>> the sum_nr_running > group_capacity_factor thing, which you've >>>> just 'wrecked', so we'd need an update to that part too. >>> >>> I guess that would mean update_sd_pick_busiest would look like >>> this for the !ASYM case: >>> >>> 1) remembering whether busiest is overloaded >>> >>> 2) if (sgs->busiest_stat.overloaded && !sgs->overloaded) return >>> false; >>> >>> 3) if (sgs->avg_load > sds->busiest_stat.avg_load) return true; >> >> Right. So the situation I was worried about is where we have say 8 >> +nice tasks in one group and 2 -nice in another. So the group with >> 2 tasks can actually be 'heavier' but not fully utilized. In that >> case we want to pull tasks from the lighter but overloaded group. > > I can certainly create a new patch that makes update_sd_pick_busiest > do just that. > > Vincent, any objections to me fixing update_sd_pick_busiest now, or > will that conflict with your code cleanups? It will probably conflict with my patchset which also modify update_sd_pick_busiest but we should be able to manage the conflict. So feel free to send a patch that fix the description above and we will see how to handle the conflict when it will occur Vincent > > - -- > All rights reversed > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1 > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ > > iQEcBAEBAgAGBQJT0oRWAAoJEM553pKExN6DRdoH/2VsMEbeRdr9o99CYV3hK2bj > OT4O5xlz47aBq+/J3Sp35DZKtQ8PpYICNoNzddyF5Blr7ZcXB2OQj5Cv5Lj3Dtcs > q/AnJAN+6mDNq4B3aVfTb3BMDcrSz8LNdpNwVM80gfvbcZkcWIYWF2FigShIggMZ > bKz4cC5hVphtYR/WH6gnb5dXwgXBGYatwYsDXnVw6SLRbXcwYrB646SmZYXDVcER > bOj5IAjdfAlFkLbeow6OaX1arBrd6lyKfVI4Mzr848dqLBrJvmPvx/R4Z8EzPN2H > Gv4kqU7s9CAk1QOZLHWAOCDjG7PdMZfkS7WEZnQinMIutBK/jYJ4IXY8NNeS96I= > =OcR0 > -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH v2] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-25 17:57 ` Vincent Guittot @ 2014-07-25 19:32 ` Rik van Riel 2014-07-28 8:23 ` Vincent Guittot 2014-07-28 14:30 ` Peter Zijlstra 0 siblings, 2 replies; 19+ messages in thread From: Rik van Riel @ 2014-07-25 19:32 UTC (permalink / raw) To: Vincent Guittot Cc: Peter Zijlstra, linux-kernel, Michael Neuling, Ingo Molnar, Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre Subject: sched: make update_sd_pick_busiest return true on a busier sd Currently update_sd_pick_busiest only identifies the busiest sd that is either overloaded, or has a group imbalance. When no sd is imbalanced or overloaded, the load balancer fails to find the busiest domain. This breaks load balancing between domains that are not overloaded, in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest return true when the busiest sd yet is encountered. Behaviour for SD_ASYM_PACKING does not seem to match the comment, but I have no hardware to test that so I have left the behaviour of that code unchanged. It is unclear what to do with the group_imb condition. Should group_imb override a busier load? If so, should we fix calculate_imbalance to return a sensible number when the "busiest" node found has a below average load? We probably need to fix calculate_imbalance anyway, to deal with an overloaded group that happens to have a below average load... Cc: mikey@neuling.org Cc: peterz@infradead.org Signed-off-by: Rik van Riel <riel@redhat.com> --- kernel/sched/fair.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 45943b2..c96044f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5949,6 +5949,11 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->group_has_free_capacity = 1; } +static bool group_overloaded(struct sg_lb_stats *sgs) +{ + return sgs->sum_nr_running > sgs->group_capacity_factor; +} + /** * update_sd_pick_busiest - return 1 on busiest group * @env: The load balancing environment. @@ -5957,7 +5962,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, * @sgs: sched_group statistics * * Determine if @sg is a busier group than the previously selected - * busiest group. + * busiest group. * * Return: %true if @sg is a busier group than the previously selected * busiest group. %false otherwise. @@ -5967,13 +5972,17 @@ static bool update_sd_pick_busiest(struct lb_env *env, struct sched_group *sg, struct sg_lb_stats *sgs) { + if (group_overloaded(sgs) && !group_overloaded(&sds->busiest_stat)) + return true; + if (sgs->avg_load <= sds->busiest_stat.avg_load) return false; - if (sgs->sum_nr_running > sgs->group_capacity_factor) + if (sgs->group_imb) return true; - if (sgs->group_imb) + /* This is the busiest node. */ + if (!(env->sd->flags & SD_ASYM_PACKING)) return true; /* @@ -5981,8 +5990,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, * numbered CPUs in the group, therefore mark all groups * higher than ourself as busy. */ - if ((env->sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running && - env->dst_cpu < group_first_cpu(sg)) { + if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) { if (!sds->busiest) return true; ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH v2] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-25 19:32 ` [PATCH v2] " Rik van Riel @ 2014-07-28 8:23 ` Vincent Guittot 2014-07-28 15:04 ` Rik van Riel 2014-07-28 14:30 ` Peter Zijlstra 1 sibling, 1 reply; 19+ messages in thread From: Vincent Guittot @ 2014-07-28 8:23 UTC (permalink / raw) To: Rik van Riel Cc: Peter Zijlstra, linux-kernel, Michael Neuling, Ingo Molnar, Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre On 25 July 2014 21:32, Rik van Riel <riel@redhat.com> wrote: > Subject: sched: make update_sd_pick_busiest return true on a busier sd > > Currently update_sd_pick_busiest only identifies the busiest sd > that is either overloaded, or has a group imbalance. When no > sd is imbalanced or overloaded, the load balancer fails to find > the busiest domain. > > This breaks load balancing between domains that are not overloaded, > in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest > return true when the busiest sd yet is encountered. > > Behaviour for SD_ASYM_PACKING does not seem to match the comment, > but I have no hardware to test that so I have left the behaviour > of that code unchanged. > > It is unclear what to do with the group_imb condition. > Should group_imb override a busier load? If so, should we fix IMHO, group_imb should have a lower priority compared to overloaded group because it generates active migration whereas the use of overloaded group could solve the imbalance with normal migration Then, AFAICT, we already have a special way to compute imbalance when group_imb is set > calculate_imbalance to return a sensible number when the "busiest" > node found has a below average load? We probably need to fix > calculate_imbalance anyway, to deal with an overloaded group that > happens to have a below average load... > > Cc: mikey@neuling.org > Cc: peterz@infradead.org > Signed-off-by: Rik van Riel <riel@redhat.com> > --- > kernel/sched/fair.c | 18 +++++++++++++----- > 1 file changed, 13 insertions(+), 5 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 45943b2..c96044f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5949,6 +5949,11 @@ static inline void update_sg_lb_stats(struct lb_env *env, > sgs->group_has_free_capacity = 1; > } > > +static bool group_overloaded(struct sg_lb_stats *sgs) > +{ > + return sgs->sum_nr_running > sgs->group_capacity_factor; > +} > + > /** > * update_sd_pick_busiest - return 1 on busiest group > * @env: The load balancing environment. > @@ -5957,7 +5962,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, > * @sgs: sched_group statistics > * > * Determine if @sg is a busier group than the previously selected > - * busiest group. > + * busiest group. > * > * Return: %true if @sg is a busier group than the previously selected > * busiest group. %false otherwise. > @@ -5967,13 +5972,17 @@ static bool update_sd_pick_busiest(struct lb_env *env, > struct sched_group *sg, > struct sg_lb_stats *sgs) > { > + if (group_overloaded(sgs) && !group_overloaded(&sds->busiest_stat)) The 1st time you run update_sd_pick_busiest, group_capacity_factor and sum_nr_running of sds->busiest_stat are uninitialized. > + return true; > + IIUC your new test sequence, you haven't solved the following use case: group A has 3 tasks and is overloaded group B has 2 tasks and is not overloaded but its avg_load is higher than group B (either because of nice priority or because of average runnable time) The test of group A will return true because it is overloaded and not the empty busiest_stat But, the test of group B will also return true because of !(env->sd->flags & SD_ASYM_PACKING) So at the end you are selecting the group B which is not overloaded > if (sgs->avg_load <= sds->busiest_stat.avg_load) > return false; > > - if (sgs->sum_nr_running > sgs->group_capacity_factor) > + if (sgs->group_imb) > return true; > > - if (sgs->group_imb) > + /* This is the busiest node. */ > + if (!(env->sd->flags & SD_ASYM_PACKING)) > return true; > > /* > @@ -5981,8 +5990,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, > * numbered CPUs in the group, therefore mark all groups > * higher than ourself as busy. > */ > - if ((env->sd->flags & SD_ASYM_PACKING) && sgs->sum_nr_running && > - env->dst_cpu < group_first_cpu(sg)) { > + if (sgs->sum_nr_running && env->dst_cpu < group_first_cpu(sg)) { > if (!sds->busiest) > return true; > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-28 8:23 ` Vincent Guittot @ 2014-07-28 15:04 ` Rik van Riel 0 siblings, 0 replies; 19+ messages in thread From: Rik van Riel @ 2014-07-28 15:04 UTC (permalink / raw) To: Vincent Guittot Cc: Peter Zijlstra, linux-kernel, Michael Neuling, Ingo Molnar, Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 07/28/2014 04:23 AM, Vincent Guittot wrote: > On 25 July 2014 21:32, Rik van Riel <riel@redhat.com> wrote: >> Subject: sched: make update_sd_pick_busiest return true on a >> busier sd >> >> Currently update_sd_pick_busiest only identifies the busiest sd >> that is either overloaded, or has a group imbalance. When no sd >> is imbalanced or overloaded, the load balancer fails to find the >> busiest domain. >> >> This breaks load balancing between domains that are not >> overloaded, in the !SD_ASYM_PACKING case. This patch makes >> update_sd_pick_busiest return true when the busiest sd yet is >> encountered. >> >> Behaviour for SD_ASYM_PACKING does not seem to match the >> comment, but I have no hardware to test that so I have left the >> behaviour of that code unchanged. >> >> It is unclear what to do with the group_imb condition. Should >> group_imb override a busier load? If so, should we fix > > IMHO, group_imb should have a lower priority compared to > overloaded group because it generates active migration whereas the > use of overloaded group could solve the imbalance with normal > migration Then, AFAICT, we already have a special way to compute > imbalance when group_imb is set > >> calculate_imbalance to return a sensible number when the >> "busiest" node found has a below average load? We probably need >> to fix calculate_imbalance anyway, to deal with an overloaded >> group that happens to have a below average load... >> >> Cc: mikey@neuling.org Cc: peterz@infradead.org Signed-off-by: Rik >> van Riel <riel@redhat.com> --- kernel/sched/fair.c | 18 >> +++++++++++++----- 1 file changed, 13 insertions(+), 5 >> deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index >> 45943b2..c96044f 100644 --- a/kernel/sched/fair.c +++ >> b/kernel/sched/fair.c @@ -5949,6 +5949,11 @@ static inline void >> update_sg_lb_stats(struct lb_env *env, >> sgs->group_has_free_capacity = 1; } >> >> +static bool group_overloaded(struct sg_lb_stats *sgs) +{ + >> return sgs->sum_nr_running > sgs->group_capacity_factor; +} + >> /** * update_sd_pick_busiest - return 1 on busiest group * @env: >> The load balancing environment. @@ -5957,7 +5962,7 @@ static >> inline void update_sg_lb_stats(struct lb_env *env, * @sgs: >> sched_group statistics * * Determine if @sg is a busier group >> than the previously selected - * busiest group. + * busiest >> group. * * Return: %true if @sg is a busier group than the >> previously selected * busiest group. %false otherwise. @@ >> -5967,13 +5972,17 @@ static bool update_sd_pick_busiest(struct >> lb_env *env, struct sched_group *sg, struct sg_lb_stats *sgs) { + >> if (group_overloaded(sgs) && >> !group_overloaded(&sds->busiest_stat)) > > The 1st time you run update_sd_pick_busiest, group_capacity_factor > and sum_nr_running of sds->busiest_stat are uninitialized. Good point, init_sd_lb_stats only zeroes out a few fields. I will fix this in the next version. >> + return true; + > > IIUC your new test sequence, you haven't solved the following use > case: Fixed in the next version, with Peter's idea. - -- All rights reversed -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJT1mZ8AAoJEM553pKExN6DpPAH/A+a9Lbz/pV8uFGG3RRrs2gI l8rhgtVLEM0uiDTDavB/D1GBnWcxfNbg6o/yF7dTXDA+sWltztZISi6dZ9JpEhww 0wK3rOD6t+VYpu6OaA2rIp2p2o12ou2d8ipjimjGf9gGD8i6vZiGOfTytsSateZH X9pSQ5Tv63ES8m3DeLcXEy+YQVunyLD83aTwSpziBFKUGguttTvvqEx5MutxVQyA Wx7hqX4VTR2oC3mS6djzK/hp0OmpZL4WKmnqUgjm11k0+UvBc1MnYE937CHixOdQ GjfIgk+G8EGVucbKFfhgwrNFfemL4MXxSVxMMor1lMFt9yFKcwoSMRS40mcH+Rw= =qENP -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH v2] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-25 19:32 ` [PATCH v2] " Rik van Riel 2014-07-28 8:23 ` Vincent Guittot @ 2014-07-28 14:30 ` Peter Zijlstra 1 sibling, 0 replies; 19+ messages in thread From: Peter Zijlstra @ 2014-07-28 14:30 UTC (permalink / raw) To: Rik van Riel Cc: Vincent Guittot, linux-kernel, Michael Neuling, Ingo Molnar, Paul Turner, jhladky, ktkhai, tim.c.chen, Nicolas Pitre [-- Attachment #1: Type: text/plain, Size: 2975 bytes --] On Fri, Jul 25, 2014 at 03:32:10PM -0400, Rik van Riel wrote: > Subject: sched: make update_sd_pick_busiest return true on a busier sd > > Currently update_sd_pick_busiest only identifies the busiest sd > that is either overloaded, or has a group imbalance. When no > sd is imbalanced or overloaded, the load balancer fails to find > the busiest domain. > > This breaks load balancing between domains that are not overloaded, > in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest > return true when the busiest sd yet is encountered. > > Behaviour for SD_ASYM_PACKING does not seem to match the comment, > but I have no hardware to test that so I have left the behaviour > of that code unchanged. > > It is unclear what to do with the group_imb condition. > Should group_imb override a busier load? If so, should we fix > calculate_imbalance to return a sensible number when the "busiest" > node found has a below average load? We probably need to fix > calculate_imbalance anyway, to deal with an overloaded group that > happens to have a below average load... I think we want overloaded > group_imb > other. So prefer overloaded groups, imbalanced groups if no overloaded and anything else if no overloaded and imbalanced thingies. If you look at the comment near sg_imbalanced(), in that case we want to move tasks from the first group to the second, even though the second group would be the heaviest. enum group_type { group_other = 0, group_imbalanced, group_overloaded, }; static enum group_type group_classify(struct sg_lb_stats *sgs) { if (sgs->sum_nr_running > sgs->group_capacity_factor) return group_overloaded; if (sgs->group_imb) return group_imbalanced; return group_other; } > /** > * update_sd_pick_busiest - return 1 on busiest group > * @env: The load balancing environment. > @@ -5957,7 +5962,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, > * @sgs: sched_group statistics > * > * Determine if @sg is a busier group than the previously selected > - * busiest group. > + * busiest group. We really need that extra trailing whitespace, yes? ;-) > * Return: %true if @sg is a busier group than the previously selected > * busiest group. %false otherwise. > @@ -5967,13 +5972,17 @@ static bool update_sd_pick_busiest(struct lb_env *env, > struct sched_group *sg, > struct sg_lb_stats *sgs) > { if (group_classify(sgs) < group_classify(&sds->busiest_stats)) return false; > if (sgs->avg_load <= sds->busiest_stat.avg_load) > return false; > > + /* This is the busiest node. */ > + if (!(env->sd->flags & SD_ASYM_PACKING)) > return true; > > /* We could replace sg_lb_stats::group_imb with the above and avoid the endless recomputation I suppose. Also, we still need a little change to calculate_imbalance() where we assume sum_nr_running > group_capacity_factor. [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] sched: make update_sd_pick_busiest return true on a busier sd 2014-07-25 15:27 ` Peter Zijlstra 2014-07-25 15:45 ` Rik van Riel @ 2014-07-27 23:57 ` Michael Neuling 1 sibling, 0 replies; 19+ messages in thread From: Michael Neuling @ 2014-07-27 23:57 UTC (permalink / raw) To: Peter Zijlstra Cc: Rik van Riel, linux-kernel, mingo, pjt, jhladky, ktkhai, tim.c.chen, nicolas.pitre On Fri, 2014-07-25 at 17:27 +0200, Peter Zijlstra wrote: > On Tue, Jul 22, 2014 at 02:45:59PM -0400, Rik van Riel wrote: > > Currently update_sd_pick_busiest only returns true when an sd > > is overloaded, or for SD_ASYM_PACKING when a domain is busier > > than average and a higher numbered domain than the target. > > > > This breaks load balancing between domains that are not overloaded, > > in the !SD_ASYM_PACKING case. This patch makes update_sd_pick_busiest > > return true when the busiest sd yet is encountered. > > > > On a 4 node system, this seems to result in the load balancer finally > > putting 1 thread of a 4 thread test run of "perf bench numa mem" on > > each node, where before the load was generally not spread across all > > nodes. > > So for !ASYM the code is effectively: > > return sgs->avg_load > sds->busiest_stat.avg_load; > > I'd like to at least add a clause that makes overloaded groups > prioritized over !overloaded groups. > > Also, like we found earlier, calculate_imbalance() relies on the > sum_nr_running > group_capacity_factor thing, which you've just > 'wrecked', so we'd need an update to that part too. > > > Behaviour for SD_ASYM_PACKING does not seem to match the comment, > > in that groups with below average load average are ignored, but I > > have no hardware to test that so I have left the behaviour of that > > code unchanged. > > Mikey, does that stuff work as expected? Sorry for the slow response. v1 and v2 both pass testing on POWER7. So FWIW... Acked-By: Michael Neuling <mikey@neuling.org> Mikey ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2014-07-28 15:05 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-07-22 18:45 [PATCH] sched: make update_sd_pick_busiest return true on a busier sd Rik van Riel 2014-07-23 7:41 ` Vincent Guittot 2014-07-25 13:33 ` Rik van Riel 2014-07-25 14:29 ` Vincent Guittot 2014-07-25 14:46 ` Rik van Riel 2014-07-25 14:02 ` Rik van Riel 2014-07-25 14:15 ` Peter Zijlstra 2014-07-25 15:02 ` Vincent Guittot 2014-07-25 15:13 ` Rik van Riel 2014-07-25 15:27 ` Peter Zijlstra 2014-07-25 15:45 ` Rik van Riel 2014-07-25 16:05 ` Peter Zijlstra 2014-07-25 16:22 ` Rik van Riel 2014-07-25 17:57 ` Vincent Guittot 2014-07-25 19:32 ` [PATCH v2] " Rik van Riel 2014-07-28 8:23 ` Vincent Guittot 2014-07-28 15:04 ` Rik van Riel 2014-07-28 14:30 ` Peter Zijlstra 2014-07-27 23:57 ` [PATCH] " Michael Neuling
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox