Re: CFS group scheduler fairness broken starting from 2.6.29-rc1

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: bharata@linux.vnet.ibm.com
Cc: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@elte.hu>,
	Dhaval Giani <dhaval@linux.vnet.ibm.com>,
	Srivatsa Vaddagiri <vatsa@in.ibm.com>,
	Ken Chen <kenchen@google.com>,
	Balbir Singh <balbir@linux.vnet.ibm.com>
Subject: Re: CFS group scheduler fairness broken starting from 2.6.29-rc1
Date: Mon, 27 Jul 2009 14:09:17 +0200	[thread overview]
Message-ID: <1248696557.6987.1615.camel@twins> (raw)
In-Reply-To: <20090723075735.GA18878@in.ibm.com>

On Thu, 2009-07-23 at 13:27 +0530, Bharata B Rao wrote:
> Hi,
> 
> Group scheduler fainess is broken since 2.6.29-rc1. git bisect led me
> to this commit:
> 
> commit ec4e0e2fe018992d980910db901637c814575914
> Author: Ken Chen <kenchen@google.com>
> Date:   Tue Nov 18 22:41:57 2008 -0800
> 
>     sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares
>     
>     Impact: make load-balancing more consistent
>     
>     In the update_shares() path leading to tg_shares_up(), the calculation of
>     per-cpu cfs_rq shares is rather erratic even under moderate task wake up
>     rate.  The problem is that the per-cpu tg->cfs_rq load weight used in the
>     sd_rq_weight aggregation and actual redistribution of the cfs_rq->shares
>     are collected at different time.  Under moderate system load, we've seen
>     quite a bit of variation on the cfs_rq->shares and ultimately wildly
>     affects sched_entity's load weight.
>     
>     This patch caches the result of initial per-cpu load weight when doing the
>     sum calculation, and then pass it down to update_group_shares_cpu() for
>     redistributing per-cpu cfs_rq shares.  This allows consistent total cfs_rq
>     shares across all CPUs. It also simplifies the rounding and zero load
>     weight check.
>     
>     Signed-off-by: Ken Chen <kenchen@google.com>
>     Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>     Signed-off-by: Ingo Molnar <mingo@elte.hu>

Right, I think I spotted the bug.

Before this patch we would assign a non-0 share to empty cpu groups in
order to avoid starvation cases. But we could not account that non-0
share into the shares sum of the sd on the next run.

With this patch however we do. Which will create a skew which will only
be corrected on the top level domain when we reach there.

-               tg->cfs_rq[cpu]->shares = boost ? 0 : shares;

Is the logic that went missing.

/me goes frob a patch together.

How does the below work?

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1523,13 +1523,18 @@ static void
 update_group_shares_cpu(struct task_group *tg, int cpu,
 			unsigned long sd_shares, unsigned long sd_rq_weight)
 {
-	unsigned long shares;
 	unsigned long rq_weight;
+	unsigned long shares;
+	int boost = 0;
 
 	if (!tg->se[cpu])
 		return;
 
 	rq_weight = tg->cfs_rq[cpu]->rq_weight;
+	if (!rq_weight) {
+		boost = 1;
+		rq_weight = NICE_0_LOAD;
+	}
 
 	/*
 	 *           \Sum shares * rq_weight
@@ -1546,8 +1551,7 @@ update_group_shares_cpu(struct task_grou
 		unsigned long flags;
 
 		spin_lock_irqsave(&rq->lock, flags);
-		tg->cfs_rq[cpu]->shares = shares;
-
+		tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
 		__set_se_shares(tg->se[cpu], shares);
 		spin_unlock_irqrestore(&rq->lock, flags);
 	}
@@ -1560,7 +1564,7 @@ update_group_shares_cpu(struct task_grou
  */
 static int tg_shares_up(struct task_group *tg, void *data)
 {
-	unsigned long weight, rq_weight = 0;
+	unsigned long weight, rq_weight = 0, eff_weight = 0;
 	unsigned long shares = 0;
 	struct sched_domain *sd = data;
 	int i;
@@ -1572,11 +1576,13 @@ static int tg_shares_up(struct task_grou
 		 * run here it will not get delayed by group starvation.
 		 */
 		weight = tg->cfs_rq[i]->load.weight;
+		tg->cfs_rq[i]->rq_weight = weight;
+		rq_weight += weight;
+
 		if (!weight)
 			weight = NICE_0_LOAD;
 
-		tg->cfs_rq[i]->rq_weight = weight;
-		rq_weight += weight;
+		eff_weight += weight;
 		shares += tg->cfs_rq[i]->shares;
 	}
 
@@ -1586,8 +1592,14 @@ static int tg_shares_up(struct task_grou
 	if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
 		shares = tg->shares;
 
-	for_each_cpu(i, sched_domain_span(sd))
-		update_group_shares_cpu(tg, i, shares, rq_weight);
+	for_each_cpu(i, sched_domain_span(sd)) {
+		unsigned long sd_rq_weight = rq_weight;
+
+		if (!tg->cfs_rq[i]->rq_weight)
+			sd_rq_weight = eff_weight;
+
+		update_group_shares_cpu(tg, i, shares, sd_rq_weight);
+	}
 
 	return 0;
 }

next prev parent reply	other threads:[~2009-07-27 12:07 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-07-23  7:57 CFS group scheduler fairness broken starting from 2.6.29-rc1 Bharata B Rao
2009-07-23 22:17 ` Ken Chen
2009-07-24  4:30   ` Bharata B Rao
2009-07-27 12:09 ` Peter Zijlstra [this message]
2009-07-28  4:14   ` Bharata B Rao
2009-07-28  7:28     ` Peter Zijlstra
2009-08-02 13:12   ` [tip:sched/core] sched: Fix cgroup smp fairness tip-bot for Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1248696557.6987.1615.camel@twins \
    --to=a.p.zijlstra@chello.nl \
    --cc=balbir@linux.vnet.ibm.com \
    --cc=bharata@linux.vnet.ibm.com \
    --cc=dhaval@linux.vnet.ibm.com \
    --cc=kenchen@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=vatsa@in.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox