* CFS group scheduler fairness broken starting from 2.6.29-rc1
@ 2009-07-23 7:57 Bharata B Rao
2009-07-23 22:17 ` Ken Chen
2009-07-27 12:09 ` Peter Zijlstra
0 siblings, 2 replies; 7+ messages in thread
From: Bharata B Rao @ 2009-07-23 7:57 UTC (permalink / raw)
To: linux-kernel
Cc: Ingo Molnar, Peter Zijlstra, Dhaval Giani, Srivatsa Vaddagiri,
Ken Chen, Balbir Singh
Hi,
Group scheduler fainess is broken since 2.6.29-rc1. git bisect led me
to this commit:
commit ec4e0e2fe018992d980910db901637c814575914
Author: Ken Chen <kenchen@google.com>
Date: Tue Nov 18 22:41:57 2008 -0800
sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares
Impact: make load-balancing more consistent
In the update_shares() path leading to tg_shares_up(), the calculation of
per-cpu cfs_rq shares is rather erratic even under moderate task wake up
rate. The problem is that the per-cpu tg->cfs_rq load weight used in the
sd_rq_weight aggregation and actual redistribution of the cfs_rq->shares
are collected at different time. Under moderate system load, we've seen
quite a bit of variation on the cfs_rq->shares and ultimately wildly
affects sched_entity's load weight.
This patch caches the result of initial per-cpu load weight when doing the
sum calculation, and then pass it down to update_group_shares_cpu() for
redistributing per-cpu cfs_rq shares. This allows consistent total cfs_rq
shares across all CPUs. It also simplifies the rounding and zero load
weight check.
Signed-off-by: Ken Chen <kenchen@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
======================================================================
% CPU time division b/n groups
Group 2.6.29-rc1 2.6.29-rc1 w/o the above patch
======================================================================
a with 8 tasks 44 31
b with 5 tasks 32 34
c with 3 tasks 22 34
======================================================================
All groups had equal shares.
Regards,
Bharata.
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: CFS group scheduler fairness broken starting from 2.6.29-rc1 2009-07-23 7:57 CFS group scheduler fairness broken starting from 2.6.29-rc1 Bharata B Rao @ 2009-07-23 22:17 ` Ken Chen 2009-07-24 4:30 ` Bharata B Rao 2009-07-27 12:09 ` Peter Zijlstra 1 sibling, 1 reply; 7+ messages in thread From: Ken Chen @ 2009-07-23 22:17 UTC (permalink / raw) To: bharata Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Dhaval Giani, Srivatsa Vaddagiri, Balbir Singh On Thu, Jul 23, 2009 at 12:57 AM, Bharata BRao<bharata@linux.vnet.ibm.com> wrote:> Hi,>> Group scheduler fainess is broken since 2.6.29-rc1. git bisect led me> to this commit:>> commit ec4e0e2fe018992d980910db901637c814575914> Author: Ken Chen <kenchen@google.com>> Date: Tue Nov 18 22:41:57 2008 -0800>> sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares>> Impact: make load-balancing more consistent> ....>> ======================================================================> % CPU time division b/n groups> Group 2.6.29-rc1 2.6.29-rc1 w/o the above patch> ======================================================================> a with 8 tasks 44 31> b with 5 tasks 32 34> c with 3 tasks 22 34> ======================================================================> All groups had equal shares. What value did you use for each task_group's share? For very largevalue of tg->shares, it could be that all of the boost went to one CPUand subsequently causes load-balancer to shuffle tasks around. Do yousee any unexpected task migration? - Kenÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CFS group scheduler fairness broken starting from 2.6.29-rc1 2009-07-23 22:17 ` Ken Chen @ 2009-07-24 4:30 ` Bharata B Rao 0 siblings, 0 replies; 7+ messages in thread From: Bharata B Rao @ 2009-07-24 4:30 UTC (permalink / raw) To: Ken Chen Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Dhaval Giani, Srivatsa Vaddagiri, Balbir Singh On Thu, Jul 23, 2009 at 03:17:18PM -0700, Ken Chen wrote: > On Thu, Jul 23, 2009 at 12:57 AM, Bharata B > Rao<bharata@linux.vnet.ibm.com> wrote: > > Hi, > > > > Group scheduler fainess is broken since 2.6.29-rc1. git bisect led me > > to this commit: > > > > commit ec4e0e2fe018992d980910db901637c814575914 > > Author: Ken Chen <kenchen@google.com> > > Date: Tue Nov 18 22:41:57 2008 -0800 > > > > sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares > > > > Impact: make load-balancing more consistent > > .... > > > > ====================================================================== > > % CPU time division b/n groups > > Group 2.6.29-rc1 2.6.29-rc1 w/o the above patch > > ====================================================================== > > a with 8 tasks 44 31 > > b with 5 tasks 32 34 > > c with 3 tasks 22 34 > > ====================================================================== > > All groups had equal shares. > > What value did you use for each task_group's share? For very large > value of tg->shares, it could be that all of the boost went to one CPU > and subsequently causes load-balancer to shuffle tasks around. Do you > see any unexpected task migration? Used default 1024 for each group. Without your patch, each of the tasks see around 165 migrations during a 60s run, but with your patch, they see 125 migrations (as per se.nr_migrations). I am using a 8CPU machine here. Regards, Bharata. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CFS group scheduler fairness broken starting from 2.6.29-rc1 2009-07-23 7:57 CFS group scheduler fairness broken starting from 2.6.29-rc1 Bharata B Rao 2009-07-23 22:17 ` Ken Chen @ 2009-07-27 12:09 ` Peter Zijlstra 2009-07-28 4:14 ` Bharata B Rao 2009-08-02 13:12 ` [tip:sched/core] sched: Fix cgroup smp fairness tip-bot for Peter Zijlstra 1 sibling, 2 replies; 7+ messages in thread From: Peter Zijlstra @ 2009-07-27 12:09 UTC (permalink / raw) To: bharata Cc: linux-kernel, Ingo Molnar, Dhaval Giani, Srivatsa Vaddagiri, Ken Chen, Balbir Singh On Thu, 2009-07-23 at 13:27 +0530, Bharata B Rao wrote: > Hi, > > Group scheduler fainess is broken since 2.6.29-rc1. git bisect led me > to this commit: > > commit ec4e0e2fe018992d980910db901637c814575914 > Author: Ken Chen <kenchen@google.com> > Date: Tue Nov 18 22:41:57 2008 -0800 > > sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares > > Impact: make load-balancing more consistent > > In the update_shares() path leading to tg_shares_up(), the calculation of > per-cpu cfs_rq shares is rather erratic even under moderate task wake up > rate. The problem is that the per-cpu tg->cfs_rq load weight used in the > sd_rq_weight aggregation and actual redistribution of the cfs_rq->shares > are collected at different time. Under moderate system load, we've seen > quite a bit of variation on the cfs_rq->shares and ultimately wildly > affects sched_entity's load weight. > > This patch caches the result of initial per-cpu load weight when doing the > sum calculation, and then pass it down to update_group_shares_cpu() for > redistributing per-cpu cfs_rq shares. This allows consistent total cfs_rq > shares across all CPUs. It also simplifies the rounding and zero load > weight check. > > Signed-off-by: Ken Chen <kenchen@google.com> > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > Signed-off-by: Ingo Molnar <mingo@elte.hu> Right, I think I spotted the bug. Before this patch we would assign a non-0 share to empty cpu groups in order to avoid starvation cases. But we could not account that non-0 share into the shares sum of the sd on the next run. With this patch however we do. Which will create a skew which will only be corrected on the top level domain when we reach there. - tg->cfs_rq[cpu]->shares = boost ? 0 : shares; Is the logic that went missing. /me goes frob a patch together. How does the below work? Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/sched.c | 28 ++++++++++++++++++++-------- 1 file changed, 20 insertions(+), 8 deletions(-) Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -1523,13 +1523,18 @@ static void update_group_shares_cpu(struct task_group *tg, int cpu, unsigned long sd_shares, unsigned long sd_rq_weight) { - unsigned long shares; unsigned long rq_weight; + unsigned long shares; + int boost = 0; if (!tg->se[cpu]) return; rq_weight = tg->cfs_rq[cpu]->rq_weight; + if (!rq_weight) { + boost = 1; + rq_weight = NICE_0_LOAD; + } /* * \Sum shares * rq_weight @@ -1546,8 +1551,7 @@ update_group_shares_cpu(struct task_grou unsigned long flags; spin_lock_irqsave(&rq->lock, flags); - tg->cfs_rq[cpu]->shares = shares; - + tg->cfs_rq[cpu]->shares = boost ? 0 : shares; __set_se_shares(tg->se[cpu], shares); spin_unlock_irqrestore(&rq->lock, flags); } @@ -1560,7 +1564,7 @@ update_group_shares_cpu(struct task_grou */ static int tg_shares_up(struct task_group *tg, void *data) { - unsigned long weight, rq_weight = 0; + unsigned long weight, rq_weight = 0, eff_weight = 0; unsigned long shares = 0; struct sched_domain *sd = data; int i; @@ -1572,11 +1576,13 @@ static int tg_shares_up(struct task_grou * run here it will not get delayed by group starvation. */ weight = tg->cfs_rq[i]->load.weight; + tg->cfs_rq[i]->rq_weight = weight; + rq_weight += weight; + if (!weight) weight = NICE_0_LOAD; - tg->cfs_rq[i]->rq_weight = weight; - rq_weight += weight; + eff_weight += weight; shares += tg->cfs_rq[i]->shares; } @@ -1586,8 +1592,14 @@ static int tg_shares_up(struct task_grou if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE)) shares = tg->shares; - for_each_cpu(i, sched_domain_span(sd)) - update_group_shares_cpu(tg, i, shares, rq_weight); + for_each_cpu(i, sched_domain_span(sd)) { + unsigned long sd_rq_weight = rq_weight; + + if (!tg->cfs_rq[i]->rq_weight) + sd_rq_weight = eff_weight; + + update_group_shares_cpu(tg, i, shares, sd_rq_weight); + } return 0; } ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CFS group scheduler fairness broken starting from 2.6.29-rc1 2009-07-27 12:09 ` Peter Zijlstra @ 2009-07-28 4:14 ` Bharata B Rao 2009-07-28 7:28 ` Peter Zijlstra 2009-08-02 13:12 ` [tip:sched/core] sched: Fix cgroup smp fairness tip-bot for Peter Zijlstra 1 sibling, 1 reply; 7+ messages in thread From: Bharata B Rao @ 2009-07-28 4:14 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Ingo Molnar, Dhaval Giani, Srivatsa Vaddagiri, Ken Chen, Balbir Singh On Mon, Jul 27, 2009 at 02:09:17PM +0200, Peter Zijlstra wrote: > On Thu, 2009-07-23 at 13:27 +0530, Bharata B Rao wrote: > > Hi, > > > > Group scheduler fainess is broken since 2.6.29-rc1. git bisect led me > > to this commit: > > > > commit ec4e0e2fe018992d980910db901637c814575914 > > Author: Ken Chen <kenchen@google.com> > > Date: Tue Nov 18 22:41:57 2008 -0800 > > > > sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares > > > > Impact: make load-balancing more consistent > > > > In the update_shares() path leading to tg_shares_up(), the calculation of > > per-cpu cfs_rq shares is rather erratic even under moderate task wake up > > rate. The problem is that the per-cpu tg->cfs_rq load weight used in the > > sd_rq_weight aggregation and actual redistribution of the cfs_rq->shares > > are collected at different time. Under moderate system load, we've seen > > quite a bit of variation on the cfs_rq->shares and ultimately wildly > > affects sched_entity's load weight. > > > > This patch caches the result of initial per-cpu load weight when doing the > > sum calculation, and then pass it down to update_group_shares_cpu() for > > redistributing per-cpu cfs_rq shares. This allows consistent total cfs_rq > > shares across all CPUs. It also simplifies the rounding and zero load > > weight check. > > > > Signed-off-by: Ken Chen <kenchen@google.com> > > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > > Signed-off-by: Ingo Molnar <mingo@elte.hu> > > Right, I think I spotted the bug. > > Before this patch we would assign a non-0 share to empty cpu groups in > order to avoid starvation cases. But we could not account that non-0 > share into the shares sum of the sd on the next run. > > With this patch however we do. Which will create a skew which will only > be corrected on the top level domain when we reach there. > > - tg->cfs_rq[cpu]->shares = boost ? 0 : shares; > > Is the logic that went missing. > > /me goes frob a patch together. > > How does the below work? Restores the fairness values to that of 2.6.28. IOW, works fine. Regards, Bharata. > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > --- > kernel/sched.c | 28 ++++++++++++++++++++-------- > 1 file changed, 20 insertions(+), 8 deletions(-) > > Index: linux-2.6/kernel/sched.c > =================================================================== > --- linux-2.6.orig/kernel/sched.c > +++ linux-2.6/kernel/sched.c > @@ -1523,13 +1523,18 @@ static void > update_group_shares_cpu(struct task_group *tg, int cpu, > unsigned long sd_shares, unsigned long sd_rq_weight) > { > - unsigned long shares; > unsigned long rq_weight; > + unsigned long shares; > + int boost = 0; > > if (!tg->se[cpu]) > return; > > rq_weight = tg->cfs_rq[cpu]->rq_weight; > + if (!rq_weight) { > + boost = 1; > + rq_weight = NICE_0_LOAD; > + } > > /* > * \Sum shares * rq_weight > @@ -1546,8 +1551,7 @@ update_group_shares_cpu(struct task_grou > unsigned long flags; > > spin_lock_irqsave(&rq->lock, flags); > - tg->cfs_rq[cpu]->shares = shares; > - > + tg->cfs_rq[cpu]->shares = boost ? 0 : shares; > __set_se_shares(tg->se[cpu], shares); > spin_unlock_irqrestore(&rq->lock, flags); > } > @@ -1560,7 +1564,7 @@ update_group_shares_cpu(struct task_grou > */ > static int tg_shares_up(struct task_group *tg, void *data) > { > - unsigned long weight, rq_weight = 0; > + unsigned long weight, rq_weight = 0, eff_weight = 0; > unsigned long shares = 0; > struct sched_domain *sd = data; > int i; > @@ -1572,11 +1576,13 @@ static int tg_shares_up(struct task_grou > * run here it will not get delayed by group starvation. > */ > weight = tg->cfs_rq[i]->load.weight; > + tg->cfs_rq[i]->rq_weight = weight; > + rq_weight += weight; > + > if (!weight) > weight = NICE_0_LOAD; > > - tg->cfs_rq[i]->rq_weight = weight; > - rq_weight += weight; > + eff_weight += weight; > shares += tg->cfs_rq[i]->shares; > } > > @@ -1586,8 +1592,14 @@ static int tg_shares_up(struct task_grou > if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE)) > shares = tg->shares; > > - for_each_cpu(i, sched_domain_span(sd)) > - update_group_shares_cpu(tg, i, shares, rq_weight); > + for_each_cpu(i, sched_domain_span(sd)) { > + unsigned long sd_rq_weight = rq_weight; > + > + if (!tg->cfs_rq[i]->rq_weight) > + sd_rq_weight = eff_weight; > + > + update_group_shares_cpu(tg, i, shares, sd_rq_weight); > + } > > return 0; > } > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CFS group scheduler fairness broken starting from 2.6.29-rc1 2009-07-28 4:14 ` Bharata B Rao @ 2009-07-28 7:28 ` Peter Zijlstra 0 siblings, 0 replies; 7+ messages in thread From: Peter Zijlstra @ 2009-07-28 7:28 UTC (permalink / raw) To: bharata Cc: linux-kernel, Ingo Molnar, Dhaval Giani, Srivatsa Vaddagiri, Ken Chen, Balbir Singh On Tue, 2009-07-28 at 09:44 +0530, Bharata B Rao wrote: > > How does the below work? > > Restores the fairness values to that of 2.6.28. IOW, works fine. That's pretty good for a patch I merely compile tested ;-) I'll merge this, unless someone (Ken?) can convince me there's something wrong with it ;-) ^ permalink raw reply [flat|nested] 7+ messages in thread
* [tip:sched/core] sched: Fix cgroup smp fairness 2009-07-27 12:09 ` Peter Zijlstra 2009-07-28 4:14 ` Bharata B Rao @ 2009-08-02 13:12 ` tip-bot for Peter Zijlstra 1 sibling, 0 replies; 7+ messages in thread From: tip-bot for Peter Zijlstra @ 2009-08-02 13:12 UTC (permalink / raw) To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, mingo Commit-ID: a5004278f0525dcb9aa43703ef77bf371ea837cd Gitweb: http://git.kernel.org/tip/a5004278f0525dcb9aa43703ef77bf371ea837cd Author: Peter Zijlstra <a.p.zijlstra@chello.nl> AuthorDate: Mon, 27 Jul 2009 14:04:49 +0200 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Sun, 2 Aug 2009 14:26:06 +0200 sched: Fix cgroup smp fairness Commit ec4e0e2fe018992d980910db901637c814575914 ("fix inconsistency when redistribute per-cpu tg->cfs_rq shares") broke cgroup smp fairness. In order to avoid starvation of newly placed tasks, we never quite set the share of an empty cpu group-task to 0, but instead we set it as if there's a single NICE-0 task present. If however we actually set this in cfs_rq[cpu]->shares, that means the total shares for that group will be slightly inflated every time we balance, causing the observed unfairness. Fix this by setting cfs_rq[cpu]->shares to 0 but actually setting the effective weight of the related se to the inflated number. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1248696557.6987.1615.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 28 ++++++++++++++++++++-------- 1 files changed, 20 insertions(+), 8 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index ce1056e..26976cd 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1523,13 +1523,18 @@ static void update_group_shares_cpu(struct task_group *tg, int cpu, unsigned long sd_shares, unsigned long sd_rq_weight) { - unsigned long shares; unsigned long rq_weight; + unsigned long shares; + int boost = 0; if (!tg->se[cpu]) return; rq_weight = tg->cfs_rq[cpu]->rq_weight; + if (!rq_weight) { + boost = 1; + rq_weight = NICE_0_LOAD; + } /* * \Sum shares * rq_weight @@ -1546,8 +1551,7 @@ update_group_shares_cpu(struct task_group *tg, int cpu, unsigned long flags; spin_lock_irqsave(&rq->lock, flags); - tg->cfs_rq[cpu]->shares = shares; - + tg->cfs_rq[cpu]->shares = boost ? 0 : shares; __set_se_shares(tg->se[cpu], shares); spin_unlock_irqrestore(&rq->lock, flags); } @@ -1560,7 +1564,7 @@ update_group_shares_cpu(struct task_group *tg, int cpu, */ static int tg_shares_up(struct task_group *tg, void *data) { - unsigned long weight, rq_weight = 0; + unsigned long weight, rq_weight = 0, eff_weight = 0; unsigned long shares = 0; struct sched_domain *sd = data; int i; @@ -1572,11 +1576,13 @@ static int tg_shares_up(struct task_group *tg, void *data) * run here it will not get delayed by group starvation. */ weight = tg->cfs_rq[i]->load.weight; + tg->cfs_rq[i]->rq_weight = weight; + rq_weight += weight; + if (!weight) weight = NICE_0_LOAD; - tg->cfs_rq[i]->rq_weight = weight; - rq_weight += weight; + eff_weight += weight; shares += tg->cfs_rq[i]->shares; } @@ -1586,8 +1592,14 @@ static int tg_shares_up(struct task_group *tg, void *data) if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE)) shares = tg->shares; - for_each_cpu(i, sched_domain_span(sd)) - update_group_shares_cpu(tg, i, shares, rq_weight); + for_each_cpu(i, sched_domain_span(sd)) { + unsigned long sd_rq_weight = rq_weight; + + if (!tg->cfs_rq[i]->rq_weight) + sd_rq_weight = eff_weight; + + update_group_shares_cpu(tg, i, shares, sd_rq_weight); + } return 0; } ^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-08-02 13:13 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-07-23 7:57 CFS group scheduler fairness broken starting from 2.6.29-rc1 Bharata B Rao 2009-07-23 22:17 ` Ken Chen 2009-07-24 4:30 ` Bharata B Rao 2009-07-27 12:09 ` Peter Zijlstra 2009-07-28 4:14 ` Bharata B Rao 2009-07-28 7:28 ` Peter Zijlstra 2009-08-02 13:12 ` [tip:sched/core] sched: Fix cgroup smp fairness tip-bot for Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox