* CFS group scheduler fairness broken starting from 2.6.29-rc1
@ 2009-07-23 7:57 Bharata B Rao
2009-07-23 22:17 ` Ken Chen
2009-07-27 12:09 ` Peter Zijlstra
0 siblings, 2 replies; 7+ messages in thread
From: Bharata B Rao @ 2009-07-23 7:57 UTC (permalink / raw)
To: linux-kernel
Cc: Ingo Molnar, Peter Zijlstra, Dhaval Giani, Srivatsa Vaddagiri,
Ken Chen, Balbir Singh
Hi,
Group scheduler fainess is broken since 2.6.29-rc1. git bisect led me
to this commit:
commit ec4e0e2fe018992d980910db901637c814575914
Author: Ken Chen <kenchen@google.com>
Date: Tue Nov 18 22:41:57 2008 -0800
sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares
Impact: make load-balancing more consistent
In the update_shares() path leading to tg_shares_up(), the calculation of
per-cpu cfs_rq shares is rather erratic even under moderate task wake up
rate. The problem is that the per-cpu tg->cfs_rq load weight used in the
sd_rq_weight aggregation and actual redistribution of the cfs_rq->shares
are collected at different time. Under moderate system load, we've seen
quite a bit of variation on the cfs_rq->shares and ultimately wildly
affects sched_entity's load weight.
This patch caches the result of initial per-cpu load weight when doing the
sum calculation, and then pass it down to update_group_shares_cpu() for
redistributing per-cpu cfs_rq shares. This allows consistent total cfs_rq
shares across all CPUs. It also simplifies the rounding and zero load
weight check.
Signed-off-by: Ken Chen <kenchen@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
======================================================================
% CPU time division b/n groups
Group 2.6.29-rc1 2.6.29-rc1 w/o the above patch
======================================================================
a with 8 tasks 44 31
b with 5 tasks 32 34
c with 3 tasks 22 34
======================================================================
All groups had equal shares.
Regards,
Bharata.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CFS group scheduler fairness broken starting from 2.6.29-rc1
2009-07-23 7:57 CFS group scheduler fairness broken starting from 2.6.29-rc1 Bharata B Rao
@ 2009-07-23 22:17 ` Ken Chen
2009-07-24 4:30 ` Bharata B Rao
2009-07-27 12:09 ` Peter Zijlstra
1 sibling, 1 reply; 7+ messages in thread
From: Ken Chen @ 2009-07-23 22:17 UTC (permalink / raw)
To: bharata
Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Dhaval Giani,
Srivatsa Vaddagiri, Balbir Singh
On Thu, Jul 23, 2009 at 12:57 AM, Bharata BRao<bharata@linux.vnet.ibm.com> wrote:> Hi,>> Group scheduler fainess is broken since 2.6.29-rc1. git bisect led me> to this commit:>> commit ec4e0e2fe018992d980910db901637c814575914> Author: Ken Chen <kenchen@google.com>> Date: Tue Nov 18 22:41:57 2008 -0800>> sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares>> Impact: make load-balancing more consistent> ....>> ======================================================================> % CPU time division b/n groups> Group 2.6.29-rc1 2.6.29-rc1 w/o the above patch> ======================================================================> a with 8 tasks 44 31> b with 5 tasks 32 34> c with 3 tasks 22 34> ======================================================================> All groups had equal shares.
What value did you use for each task_group's share? For very largevalue of tg->shares, it could be that all of the boost went to one CPUand subsequently causes load-balancer to shuffle tasks around. Do yousee any unexpected task migration?
- Kenÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CFS group scheduler fairness broken starting from 2.6.29-rc1
2009-07-23 22:17 ` Ken Chen
@ 2009-07-24 4:30 ` Bharata B Rao
0 siblings, 0 replies; 7+ messages in thread
From: Bharata B Rao @ 2009-07-24 4:30 UTC (permalink / raw)
To: Ken Chen
Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Dhaval Giani,
Srivatsa Vaddagiri, Balbir Singh
On Thu, Jul 23, 2009 at 03:17:18PM -0700, Ken Chen wrote:
> On Thu, Jul 23, 2009 at 12:57 AM, Bharata B
> Rao<bharata@linux.vnet.ibm.com> wrote:
> > Hi,
> >
> > Group scheduler fainess is broken since 2.6.29-rc1. git bisect led me
> > to this commit:
> >
> > commit ec4e0e2fe018992d980910db901637c814575914
> > Author: Ken Chen <kenchen@google.com>
> > Date: Tue Nov 18 22:41:57 2008 -0800
> >
> > sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares
> >
> > Impact: make load-balancing more consistent
> > ....
> >
> > ======================================================================
> > % CPU time division b/n groups
> > Group 2.6.29-rc1 2.6.29-rc1 w/o the above patch
> > ======================================================================
> > a with 8 tasks 44 31
> > b with 5 tasks 32 34
> > c with 3 tasks 22 34
> > ======================================================================
> > All groups had equal shares.
>
> What value did you use for each task_group's share? For very large
> value of tg->shares, it could be that all of the boost went to one CPU
> and subsequently causes load-balancer to shuffle tasks around. Do you
> see any unexpected task migration?
Used default 1024 for each group.
Without your patch, each of the tasks see around 165 migrations during
a 60s run, but with your patch, they see 125 migrations (as per
se.nr_migrations). I am using a 8CPU machine here.
Regards,
Bharata.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CFS group scheduler fairness broken starting from 2.6.29-rc1
2009-07-23 7:57 CFS group scheduler fairness broken starting from 2.6.29-rc1 Bharata B Rao
2009-07-23 22:17 ` Ken Chen
@ 2009-07-27 12:09 ` Peter Zijlstra
2009-07-28 4:14 ` Bharata B Rao
2009-08-02 13:12 ` [tip:sched/core] sched: Fix cgroup smp fairness tip-bot for Peter Zijlstra
1 sibling, 2 replies; 7+ messages in thread
From: Peter Zijlstra @ 2009-07-27 12:09 UTC (permalink / raw)
To: bharata
Cc: linux-kernel, Ingo Molnar, Dhaval Giani, Srivatsa Vaddagiri,
Ken Chen, Balbir Singh
On Thu, 2009-07-23 at 13:27 +0530, Bharata B Rao wrote:
> Hi,
>
> Group scheduler fainess is broken since 2.6.29-rc1. git bisect led me
> to this commit:
>
> commit ec4e0e2fe018992d980910db901637c814575914
> Author: Ken Chen <kenchen@google.com>
> Date: Tue Nov 18 22:41:57 2008 -0800
>
> sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares
>
> Impact: make load-balancing more consistent
>
> In the update_shares() path leading to tg_shares_up(), the calculation of
> per-cpu cfs_rq shares is rather erratic even under moderate task wake up
> rate. The problem is that the per-cpu tg->cfs_rq load weight used in the
> sd_rq_weight aggregation and actual redistribution of the cfs_rq->shares
> are collected at different time. Under moderate system load, we've seen
> quite a bit of variation on the cfs_rq->shares and ultimately wildly
> affects sched_entity's load weight.
>
> This patch caches the result of initial per-cpu load weight when doing the
> sum calculation, and then pass it down to update_group_shares_cpu() for
> redistributing per-cpu cfs_rq shares. This allows consistent total cfs_rq
> shares across all CPUs. It also simplifies the rounding and zero load
> weight check.
>
> Signed-off-by: Ken Chen <kenchen@google.com>
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Right, I think I spotted the bug.
Before this patch we would assign a non-0 share to empty cpu groups in
order to avoid starvation cases. But we could not account that non-0
share into the shares sum of the sd on the next run.
With this patch however we do. Which will create a skew which will only
be corrected on the top level domain when we reach there.
- tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
Is the logic that went missing.
/me goes frob a patch together.
How does the below work?
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
kernel/sched.c | 28 ++++++++++++++++++++--------
1 file changed, 20 insertions(+), 8 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1523,13 +1523,18 @@ static void
update_group_shares_cpu(struct task_group *tg, int cpu,
unsigned long sd_shares, unsigned long sd_rq_weight)
{
- unsigned long shares;
unsigned long rq_weight;
+ unsigned long shares;
+ int boost = 0;
if (!tg->se[cpu])
return;
rq_weight = tg->cfs_rq[cpu]->rq_weight;
+ if (!rq_weight) {
+ boost = 1;
+ rq_weight = NICE_0_LOAD;
+ }
/*
* \Sum shares * rq_weight
@@ -1546,8 +1551,7 @@ update_group_shares_cpu(struct task_grou
unsigned long flags;
spin_lock_irqsave(&rq->lock, flags);
- tg->cfs_rq[cpu]->shares = shares;
-
+ tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
__set_se_shares(tg->se[cpu], shares);
spin_unlock_irqrestore(&rq->lock, flags);
}
@@ -1560,7 +1564,7 @@ update_group_shares_cpu(struct task_grou
*/
static int tg_shares_up(struct task_group *tg, void *data)
{
- unsigned long weight, rq_weight = 0;
+ unsigned long weight, rq_weight = 0, eff_weight = 0;
unsigned long shares = 0;
struct sched_domain *sd = data;
int i;
@@ -1572,11 +1576,13 @@ static int tg_shares_up(struct task_grou
* run here it will not get delayed by group starvation.
*/
weight = tg->cfs_rq[i]->load.weight;
+ tg->cfs_rq[i]->rq_weight = weight;
+ rq_weight += weight;
+
if (!weight)
weight = NICE_0_LOAD;
- tg->cfs_rq[i]->rq_weight = weight;
- rq_weight += weight;
+ eff_weight += weight;
shares += tg->cfs_rq[i]->shares;
}
@@ -1586,8 +1592,14 @@ static int tg_shares_up(struct task_grou
if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
shares = tg->shares;
- for_each_cpu(i, sched_domain_span(sd))
- update_group_shares_cpu(tg, i, shares, rq_weight);
+ for_each_cpu(i, sched_domain_span(sd)) {
+ unsigned long sd_rq_weight = rq_weight;
+
+ if (!tg->cfs_rq[i]->rq_weight)
+ sd_rq_weight = eff_weight;
+
+ update_group_shares_cpu(tg, i, shares, sd_rq_weight);
+ }
return 0;
}
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CFS group scheduler fairness broken starting from 2.6.29-rc1
2009-07-27 12:09 ` Peter Zijlstra
@ 2009-07-28 4:14 ` Bharata B Rao
2009-07-28 7:28 ` Peter Zijlstra
2009-08-02 13:12 ` [tip:sched/core] sched: Fix cgroup smp fairness tip-bot for Peter Zijlstra
1 sibling, 1 reply; 7+ messages in thread
From: Bharata B Rao @ 2009-07-28 4:14 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, Ingo Molnar, Dhaval Giani, Srivatsa Vaddagiri,
Ken Chen, Balbir Singh
On Mon, Jul 27, 2009 at 02:09:17PM +0200, Peter Zijlstra wrote:
> On Thu, 2009-07-23 at 13:27 +0530, Bharata B Rao wrote:
> > Hi,
> >
> > Group scheduler fainess is broken since 2.6.29-rc1. git bisect led me
> > to this commit:
> >
> > commit ec4e0e2fe018992d980910db901637c814575914
> > Author: Ken Chen <kenchen@google.com>
> > Date: Tue Nov 18 22:41:57 2008 -0800
> >
> > sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares
> >
> > Impact: make load-balancing more consistent
> >
> > In the update_shares() path leading to tg_shares_up(), the calculation of
> > per-cpu cfs_rq shares is rather erratic even under moderate task wake up
> > rate. The problem is that the per-cpu tg->cfs_rq load weight used in the
> > sd_rq_weight aggregation and actual redistribution of the cfs_rq->shares
> > are collected at different time. Under moderate system load, we've seen
> > quite a bit of variation on the cfs_rq->shares and ultimately wildly
> > affects sched_entity's load weight.
> >
> > This patch caches the result of initial per-cpu load weight when doing the
> > sum calculation, and then pass it down to update_group_shares_cpu() for
> > redistributing per-cpu cfs_rq shares. This allows consistent total cfs_rq
> > shares across all CPUs. It also simplifies the rounding and zero load
> > weight check.
> >
> > Signed-off-by: Ken Chen <kenchen@google.com>
> > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Signed-off-by: Ingo Molnar <mingo@elte.hu>
>
> Right, I think I spotted the bug.
>
> Before this patch we would assign a non-0 share to empty cpu groups in
> order to avoid starvation cases. But we could not account that non-0
> share into the shares sum of the sd on the next run.
>
> With this patch however we do. Which will create a skew which will only
> be corrected on the top level domain when we reach there.
>
> - tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
>
> Is the logic that went missing.
>
> /me goes frob a patch together.
>
> How does the below work?
Restores the fairness values to that of 2.6.28. IOW, works fine.
Regards,
Bharata.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> kernel/sched.c | 28 ++++++++++++++++++++--------
> 1 file changed, 20 insertions(+), 8 deletions(-)
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -1523,13 +1523,18 @@ static void
> update_group_shares_cpu(struct task_group *tg, int cpu,
> unsigned long sd_shares, unsigned long sd_rq_weight)
> {
> - unsigned long shares;
> unsigned long rq_weight;
> + unsigned long shares;
> + int boost = 0;
>
> if (!tg->se[cpu])
> return;
>
> rq_weight = tg->cfs_rq[cpu]->rq_weight;
> + if (!rq_weight) {
> + boost = 1;
> + rq_weight = NICE_0_LOAD;
> + }
>
> /*
> * \Sum shares * rq_weight
> @@ -1546,8 +1551,7 @@ update_group_shares_cpu(struct task_grou
> unsigned long flags;
>
> spin_lock_irqsave(&rq->lock, flags);
> - tg->cfs_rq[cpu]->shares = shares;
> -
> + tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
> __set_se_shares(tg->se[cpu], shares);
> spin_unlock_irqrestore(&rq->lock, flags);
> }
> @@ -1560,7 +1564,7 @@ update_group_shares_cpu(struct task_grou
> */
> static int tg_shares_up(struct task_group *tg, void *data)
> {
> - unsigned long weight, rq_weight = 0;
> + unsigned long weight, rq_weight = 0, eff_weight = 0;
> unsigned long shares = 0;
> struct sched_domain *sd = data;
> int i;
> @@ -1572,11 +1576,13 @@ static int tg_shares_up(struct task_grou
> * run here it will not get delayed by group starvation.
> */
> weight = tg->cfs_rq[i]->load.weight;
> + tg->cfs_rq[i]->rq_weight = weight;
> + rq_weight += weight;
> +
> if (!weight)
> weight = NICE_0_LOAD;
>
> - tg->cfs_rq[i]->rq_weight = weight;
> - rq_weight += weight;
> + eff_weight += weight;
> shares += tg->cfs_rq[i]->shares;
> }
>
> @@ -1586,8 +1592,14 @@ static int tg_shares_up(struct task_grou
> if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
> shares = tg->shares;
>
> - for_each_cpu(i, sched_domain_span(sd))
> - update_group_shares_cpu(tg, i, shares, rq_weight);
> + for_each_cpu(i, sched_domain_span(sd)) {
> + unsigned long sd_rq_weight = rq_weight;
> +
> + if (!tg->cfs_rq[i]->rq_weight)
> + sd_rq_weight = eff_weight;
> +
> + update_group_shares_cpu(tg, i, shares, sd_rq_weight);
> + }
>
> return 0;
> }
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: CFS group scheduler fairness broken starting from 2.6.29-rc1
2009-07-28 4:14 ` Bharata B Rao
@ 2009-07-28 7:28 ` Peter Zijlstra
0 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2009-07-28 7:28 UTC (permalink / raw)
To: bharata
Cc: linux-kernel, Ingo Molnar, Dhaval Giani, Srivatsa Vaddagiri,
Ken Chen, Balbir Singh
On Tue, 2009-07-28 at 09:44 +0530, Bharata B Rao wrote:
> > How does the below work?
>
> Restores the fairness values to that of 2.6.28. IOW, works fine.
That's pretty good for a patch I merely compile tested ;-)
I'll merge this, unless someone (Ken?) can convince me there's something
wrong with it ;-)
^ permalink raw reply [flat|nested] 7+ messages in thread
* [tip:sched/core] sched: Fix cgroup smp fairness
2009-07-27 12:09 ` Peter Zijlstra
2009-07-28 4:14 ` Bharata B Rao
@ 2009-08-02 13:12 ` tip-bot for Peter Zijlstra
1 sibling, 0 replies; 7+ messages in thread
From: tip-bot for Peter Zijlstra @ 2009-08-02 13:12 UTC (permalink / raw)
To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, mingo
Commit-ID: a5004278f0525dcb9aa43703ef77bf371ea837cd
Gitweb: http://git.kernel.org/tip/a5004278f0525dcb9aa43703ef77bf371ea837cd
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 27 Jul 2009 14:04:49 +0200
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Sun, 2 Aug 2009 14:26:06 +0200
sched: Fix cgroup smp fairness
Commit ec4e0e2fe018992d980910db901637c814575914 ("fix
inconsistency when redistribute per-cpu tg->cfs_rq shares")
broke cgroup smp fairness.
In order to avoid starvation of newly placed tasks, we never
quite set the share of an empty cpu group-task to 0, but
instead we set it as if there's a single NICE-0 task present.
If however we actually set this in cfs_rq[cpu]->shares, that
means the total shares for that group will be slightly inflated
every time we balance, causing the observed unfairness.
Fix this by setting cfs_rq[cpu]->shares to 0 but actually
setting the effective weight of the related se to the inflated
number.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1248696557.6987.1615.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched.c | 28 ++++++++++++++++++++--------
1 files changed, 20 insertions(+), 8 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index ce1056e..26976cd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1523,13 +1523,18 @@ static void
update_group_shares_cpu(struct task_group *tg, int cpu,
unsigned long sd_shares, unsigned long sd_rq_weight)
{
- unsigned long shares;
unsigned long rq_weight;
+ unsigned long shares;
+ int boost = 0;
if (!tg->se[cpu])
return;
rq_weight = tg->cfs_rq[cpu]->rq_weight;
+ if (!rq_weight) {
+ boost = 1;
+ rq_weight = NICE_0_LOAD;
+ }
/*
* \Sum shares * rq_weight
@@ -1546,8 +1551,7 @@ update_group_shares_cpu(struct task_group *tg, int cpu,
unsigned long flags;
spin_lock_irqsave(&rq->lock, flags);
- tg->cfs_rq[cpu]->shares = shares;
-
+ tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
__set_se_shares(tg->se[cpu], shares);
spin_unlock_irqrestore(&rq->lock, flags);
}
@@ -1560,7 +1564,7 @@ update_group_shares_cpu(struct task_group *tg, int cpu,
*/
static int tg_shares_up(struct task_group *tg, void *data)
{
- unsigned long weight, rq_weight = 0;
+ unsigned long weight, rq_weight = 0, eff_weight = 0;
unsigned long shares = 0;
struct sched_domain *sd = data;
int i;
@@ -1572,11 +1576,13 @@ static int tg_shares_up(struct task_group *tg, void *data)
* run here it will not get delayed by group starvation.
*/
weight = tg->cfs_rq[i]->load.weight;
+ tg->cfs_rq[i]->rq_weight = weight;
+ rq_weight += weight;
+
if (!weight)
weight = NICE_0_LOAD;
- tg->cfs_rq[i]->rq_weight = weight;
- rq_weight += weight;
+ eff_weight += weight;
shares += tg->cfs_rq[i]->shares;
}
@@ -1586,8 +1592,14 @@ static int tg_shares_up(struct task_group *tg, void *data)
if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
shares = tg->shares;
- for_each_cpu(i, sched_domain_span(sd))
- update_group_shares_cpu(tg, i, shares, rq_weight);
+ for_each_cpu(i, sched_domain_span(sd)) {
+ unsigned long sd_rq_weight = rq_weight;
+
+ if (!tg->cfs_rq[i]->rq_weight)
+ sd_rq_weight = eff_weight;
+
+ update_group_shares_cpu(tg, i, shares, sd_rq_weight);
+ }
return 0;
}
^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-08-02 13:13 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-23 7:57 CFS group scheduler fairness broken starting from 2.6.29-rc1 Bharata B Rao
2009-07-23 22:17 ` Ken Chen
2009-07-24 4:30 ` Bharata B Rao
2009-07-27 12:09 ` Peter Zijlstra
2009-07-28 4:14 ` Bharata B Rao
2009-07-28 7:28 ` Peter Zijlstra
2009-08-02 13:12 ` [tip:sched/core] sched: Fix cgroup smp fairness tip-bot for Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox