[patch] sched: fix uneven per-cpu task

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch] sched: fix uneven per-cpu task_group share distribution
@ 2008-12-16  7:37 Ken Chen
  2008-12-17  8:20 ` Ken Chen
  2008-12-17  8:38 ` Peter Zijlstra
  0 siblings, 2 replies; 5+ messages in thread
From: Ken Chen @ 2008-12-16  7:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra; +Cc: Linux Kernel Mailing List

While testing CFS scheduler on linux-2.6-tip tree
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

We found that task which is pinned to a CPU could be starved relative to its
allocated fair share.

The per-cpu sched_enetity load share calculation in tg_shares_up /
update_group_shares_cpu distributes total task_group's share among all CPUs
for a given SD domain, this would dilute task_group's per-cpu share because
it distributes share to CPU that even has no load.  The trapped share is now
un-consumable and it leads to fair share starvation on the runnable CPU.
Peter was right that it is still required for the low level function to make
distinction between a boosted share that don't have any load and actual tg
share that should be distributed among CPUs in which the tg is running.

Patch to add that boost and we think the scheduler should only boost one
times of tg shares over all empty CPU that don't have any load for the
specific task_group in order to bound maximum temporary boost that a given
task_group can have.


Signed-off-by: Ken Chen <kenchen@google.com>


diff --git a/kernel/sched.c b/kernel/sched.c
index 1d673a9..7198a26 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1465,24 +1465,34 @@ static void __set_se_shares(
  * Calculate and set the cpu's group shares.
  */
 static void
-update_group_shares_cpu(struct task_group *tg, int cpu,
+update_group_shares_cpu(struct task_group *tg, int cpu, int empty,
 			unsigned long sd_shares, unsigned long sd_rq_weight)
 {
-	unsigned long shares;
+	unsigned long shares, raw_shares;
 	unsigned long rq_weight;

 	if (!tg->se[cpu])
 		return;

 	rq_weight = tg->cfs_rq[cpu]->rq_weight;
-
-	/*
-	 *           \Sum shares * rq_weight
-	 * shares =  -----------------------
-	 *               \Sum rq_weight
-	 *
-	 */
-	shares = (sd_shares * rq_weight) / sd_rq_weight;
+	if (rq_weight) {
+		/*
+		 *           \Sum shares * rq_weight
+		 * shares =  -----------------------
+		 *               \Sum rq_weight
+		 *
+		 */
+		raw_shares = (sd_shares * rq_weight) / sd_rq_weight;
+		shares = raw_shares;
+	} else {
+		/*
+		 * If there are currently no tasks on the cpu pretend there
+		 * is one of average load so that when a new task gets to
+		 * run here it will not get delayed by group starvation.
+		 */
+		raw_shares = 0;
+		shares = sd_shares / empty;
+	}
 	shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);

 	if (abs(shares - tg->se[cpu]->load.weight) >
@@ -1491,7 +1501,7 @@ update_group_shares_cpu(
 		unsigned long flags;

 		spin_lock_irqsave(&rq->lock, flags);
-		tg->cfs_rq[cpu]->shares = shares;
+		tg->cfs_rq[cpu]->shares = raw_shares;

 		__set_se_shares(tg->se[cpu], shares);
 		spin_unlock_irqrestore(&rq->lock, flags);
@@ -1508,17 +1518,12 @@ static int tg_shares_up(
 	unsigned long weight, rq_weight = 0;
 	unsigned long shares = 0;
 	struct sched_domain *sd = data;
-	int i;
+	int i, empty = 0;

 	for_each_cpu(i, sched_domain_span(sd)) {
-		/*
-		 * If there are currently no tasks on the cpu pretend there
-		 * is one of average load so that when a new task gets to
-		 * run here it will not get delayed by group starvation.
-		 */
 		weight = tg->cfs_rq[i]->load.weight;
 		if (!weight)
-			weight = NICE_0_LOAD;
+			empty++;

 		tg->cfs_rq[i]->rq_weight = weight;
 		rq_weight += weight;
@@ -1532,7 +1537,7 @@ static int tg_shares_up(
 		shares = tg->shares;

 	for_each_cpu(i, sched_domain_span(sd))
-		update_group_shares_cpu(tg, i, shares, rq_weight);
+		update_group_shares_cpu(tg, i, empty, shares, rq_weight);

 	return 0;
 }

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [patch] sched: fix uneven per-cpu task_group share distribution
  2008-12-16  7:37 [patch] sched: fix uneven per-cpu task_group share distribution Ken Chen
@ 2008-12-17  8:20 ` Ken Chen
  2008-12-17  8:34   ` Peter Zijlstra
  2008-12-17  8:38 ` Peter Zijlstra
  1 sibling, 1 reply; 5+ messages in thread
From: Ken Chen @ 2008-12-17  8:20 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra; +Cc: Linux Kernel Mailing List

On Mon, Dec 15, 2008 at 11:37 PM, Ken Chen <kenchen@google.com> wrote:
> While testing CFS scheduler on linux-2.6-tip tree
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
>
> We found that task which is pinned to a CPU could be starved relative to its
> allocated fair share.

I just want to poke you two to see if you have time to look over this
patch.  There is a real bug in linux-2.6-tip tree.

If I create a CFS cgroup under directory .../parent/cgdir, put one
task under 'parent' with CFS weight 1024, and one task under 'cgdir'
(cgdir has group weight of 1024 also), pin both tasks onto same CPU.
The CPU cycle allocation on these two tasks will be N:1 where N is
number of CPUs in the system.  The expected allocation should be 1:1.

- Ken

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch] sched: fix uneven per-cpu task_group share distribution
  2008-12-17  8:20 ` Ken Chen
@ 2008-12-17  8:34   ` Peter Zijlstra
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Zijlstra @ 2008-12-17  8:34 UTC (permalink / raw)
  To: Ken Chen; +Cc: Ingo Molnar, Linux Kernel Mailing List

On Wed, 2008-12-17 at 00:20 -0800, Ken Chen wrote:
> On Mon, Dec 15, 2008 at 11:37 PM, Ken Chen <kenchen@google.com> wrote:
> > While testing CFS scheduler on linux-2.6-tip tree
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
> >
> > We found that task which is pinned to a CPU could be starved relative to its
> > allocated fair share.
> 
> I just want to poke you two to see if you have time to look over this
> patch.  There is a real bug in linux-2.6-tip tree.
> 
> If I create a CFS cgroup under directory .../parent/cgdir, put one
> task under 'parent' with CFS weight 1024, and one task under 'cgdir'
> (cgdir has group weight of 1024 also), pin both tasks onto same CPU.
> The CPU cycle allocation on these two tasks will be N:1 where N is
> number of CPUs in the system.  The expected allocation should be 1:1.

right - it was on the todo list, let me stare at it a little..


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch] sched: fix uneven per-cpu task_group share distribution
  2008-12-16  7:37 [patch] sched: fix uneven per-cpu task_group share distribution Ken Chen
  2008-12-17  8:20 ` Ken Chen
@ 2008-12-17  8:38 ` Peter Zijlstra
  2008-12-18 13:42   ` Ingo Molnar
  1 sibling, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2008-12-17  8:38 UTC (permalink / raw)
  To: Ken Chen; +Cc: Ingo Molnar, Linux Kernel Mailing List

On Mon, 2008-12-15 at 23:37 -0800, Ken Chen wrote:
> While testing CFS scheduler on linux-2.6-tip tree
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
> 
> We found that task which is pinned to a CPU could be starved relative to its
> allocated fair share.
> 
> The per-cpu sched_enetity load share calculation in tg_shares_up /
> update_group_shares_cpu distributes total task_group's share among all CPUs
> for a given SD domain, this would dilute task_group's per-cpu share because
> it distributes share to CPU that even has no load.  The trapped share is now
> un-consumable and it leads to fair share starvation on the runnable CPU.
> Peter was right that it is still required for the low level function to make
> distinction between a boosted share that don't have any load and actual tg
> share that should be distributed among CPUs in which the tg is running.
> 
> Patch to add that boost and we think the scheduler should only boost one
> times of tg shares over all empty CPU that don't have any load for the
> specific task_group in order to bound maximum temporary boost that a given
> task_group can have.

Yeah I was worried about you removing that boost, but since you seemed
to be doing serious testing on the thing.. A well, this sounds good.

> Signed-off-by: Ken Chen <kenchen@google.com>

OK, patch looks good too, thanks!

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

> diff --git a/kernel/sched.c b/kernel/sched.c
> index 1d673a9..7198a26 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1465,24 +1465,34 @@ static void __set_se_shares(
>   * Calculate and set the cpu's group shares.
>   */
>  static void
> -update_group_shares_cpu(struct task_group *tg, int cpu,
> +update_group_shares_cpu(struct task_group *tg, int cpu, int empty,
>  			unsigned long sd_shares, unsigned long sd_rq_weight)
>  {
> -	unsigned long shares;
> +	unsigned long shares, raw_shares;
>  	unsigned long rq_weight;
> 
>  	if (!tg->se[cpu])
>  		return;
> 
>  	rq_weight = tg->cfs_rq[cpu]->rq_weight;
> -
> -	/*
> -	 *           \Sum shares * rq_weight
> -	 * shares =  -----------------------
> -	 *               \Sum rq_weight
> -	 *
> -	 */
> -	shares = (sd_shares * rq_weight) / sd_rq_weight;
> +	if (rq_weight) {
> +		/*
> +		 *           \Sum shares * rq_weight
> +		 * shares =  -----------------------
> +		 *               \Sum rq_weight
> +		 *
> +		 */
> +		raw_shares = (sd_shares * rq_weight) / sd_rq_weight;
> +		shares = raw_shares;
> +	} else {
> +		/*
> +		 * If there are currently no tasks on the cpu pretend there
> +		 * is one of average load so that when a new task gets to
> +		 * run here it will not get delayed by group starvation.
> +		 */
> +		raw_shares = 0;
> +		shares = sd_shares / empty;
> +	}
>  	shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);
> 
>  	if (abs(shares - tg->se[cpu]->load.weight) >
> @@ -1491,7 +1501,7 @@ update_group_shares_cpu(
>  		unsigned long flags;
> 
>  		spin_lock_irqsave(&rq->lock, flags);
> -		tg->cfs_rq[cpu]->shares = shares;
> +		tg->cfs_rq[cpu]->shares = raw_shares;
> 
>  		__set_se_shares(tg->se[cpu], shares);
>  		spin_unlock_irqrestore(&rq->lock, flags);
> @@ -1508,17 +1518,12 @@ static int tg_shares_up(
>  	unsigned long weight, rq_weight = 0;
>  	unsigned long shares = 0;
>  	struct sched_domain *sd = data;
> -	int i;
> +	int i, empty = 0;
> 
>  	for_each_cpu(i, sched_domain_span(sd)) {
> -		/*
> -		 * If there are currently no tasks on the cpu pretend there
> -		 * is one of average load so that when a new task gets to
> -		 * run here it will not get delayed by group starvation.
> -		 */
>  		weight = tg->cfs_rq[i]->load.weight;
>  		if (!weight)
> -			weight = NICE_0_LOAD;
> +			empty++;
> 
>  		tg->cfs_rq[i]->rq_weight = weight;
>  		rq_weight += weight;
> @@ -1532,7 +1537,7 @@ static int tg_shares_up(
>  		shares = tg->shares;
> 
>  	for_each_cpu(i, sched_domain_span(sd))
> -		update_group_shares_cpu(tg, i, shares, rq_weight);
> +		update_group_shares_cpu(tg, i, empty, shares, rq_weight);
> 
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [patch] sched: fix uneven per-cpu task_group share distribution
  2008-12-17  8:38 ` Peter Zijlstra
@ 2008-12-18 13:42   ` Ingo Molnar
  0 siblings, 0 replies; 5+ messages in thread
From: Ingo Molnar @ 2008-12-18 13:42 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ken Chen, Linux Kernel Mailing List


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Mon, 2008-12-15 at 23:37 -0800, Ken Chen wrote:
> > While testing CFS scheduler on linux-2.6-tip tree
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
> > 
> > We found that task which is pinned to a CPU could be starved relative to its
> > allocated fair share.
> > 
> > The per-cpu sched_enetity load share calculation in tg_shares_up /
> > update_group_shares_cpu distributes total task_group's share among all CPUs
> > for a given SD domain, this would dilute task_group's per-cpu share because
> > it distributes share to CPU that even has no load.  The trapped share is now
> > un-consumable and it leads to fair share starvation on the runnable CPU.
> > Peter was right that it is still required for the low level function to make
> > distinction between a boosted share that don't have any load and actual tg
> > share that should be distributed among CPUs in which the tg is running.
> > 
> > Patch to add that boost and we think the scheduler should only boost one
> > times of tg shares over all empty CPU that don't have any load for the
> > specific task_group in order to bound maximum temporary boost that a given
> > task_group can have.
> 
> Yeah I was worried about you removing that boost, but since you seemed
> to be doing serious testing on the thing.. A well, this sounds good.
> 
> > Signed-off-by: Ken Chen <kenchen@google.com>
> 
> OK, patch looks good too, thanks!
> 
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

applied to tip/cpus4096. [not in the tip/sched/core branch, because 
cpumask_t changes conflict with Ken's fix so this is the proper ordering.]

	Ingo

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-12-18 13:42 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-16  7:37 [patch] sched: fix uneven per-cpu task_group share distribution Ken Chen
2008-12-17  8:20 ` Ken Chen
2008-12-17  8:34   ` Peter Zijlstra
2008-12-17  8:38 ` Peter Zijlstra
2008-12-18 13:42   ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox