All of lore.kernel.org
 help / color / mirror / Atom feed
From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
To: "Chris Friesen" <cfriesen@nortel.com>
Cc: linux-kernel@vger.kernel.org, mingo@elte.hu,
	a.p.zijlstra@chello.nl, pj@sgi.com,
	Balbir Singh <balbir@in.ibm.com>,
	aneesh.kumar@linux.vnet.ibm.com, dhaval@linux.vnet.ibm.com
Subject: Re: fair group scheduler not so fair?
Date: Thu, 29 May 2008 22:16:07 +0530	[thread overview]
Message-ID: <20080529164607.GC12836@linux.vnet.ibm.com> (raw)
In-Reply-To: <483DA5E7.5050600@nortel.com>

On Wed, May 28, 2008 at 12:35:19PM -0600, Chris Friesen wrote:
> Looking much better, but still some fairness issues with more complex 
> setups.
>
> pid 2477 in A, others in B
> 2477	99.5%
> 2478	49.9%
> 2479	49.9%
>
> move 2478 to A
> 2479	99.9%
> 2477	49.9%
> 2478	49.9%
>
> So far so good.  I then created C, and moved 2478 to it.  A 3-second "top" 
> gave almost a 15% error from the desired behaviour for one group:
>
> 2479	76.2%
> 2477	72.2%
> 2478	51.0%
>
>
> A 10-sec average was better, but we still see errors of 6%:
> 2478	72.8%
> 2477	64.0%
> 2479	63.2%

Found couple of issues:

	1. A minor bug in load_balance_fair() in calculation of moved_load:

		moved_load /= busiest_cfs_rq->load.weight + 1;

	   In place of busiest_cfs_rq->load.weight, the load before
	   moving tasks needs to be used. Fix in the updated patch below.

	2. We walk task groups sequentially in load_balance_fair() w/o
	   necessarily looking for the best group. This results in
	   load_balance_fair() in picking non optimal group/tasks
	   to pull. I have a hack below (strict = 1/0) to rectify this
	   problem, but we need a better algo to pick the best group
	   from which to pull tasks.

	3. sd->imbalance_pct (default = 125) specifies how much imbalance we 
	   tolerate. Lower the value, better will be the fairness. To check 
	   this, I changed default to 105, which is giving me better
	   results.

With the updated patch and imbalance_pct = 105, here's how my 60-sec avg looks 
like:

 4353 root      20   0  1384  232  176 R 67.0  0.0   2:47.23  1 hoga
 4354 root      20   0  1384  228  176 R 66.5  0.0   2:44.65  1 hogb
 4355 root      20   0  1384  228  176 R 66.3  0.0   2:28.18  0 hogb

Error is < 1%

> I then set up a scenario with 3 tasks in A, 2 in B, and 1 in C.  A 
> 10-second "top" gave errors of up to 6.5%:
> 2500	60.1%
> 2491	37.5%
> 2492	37.4%
> 2489	25.0%
> 2488	19.9%
> 2490	19.9%
>
> a re-test gave errors of up to 8.1%:
>
> 2534	74.8%
> 2533	30.1%
> 2532	30.0%
> 2529	25.0%
> 2530	20.0%
> 2531	20.0%
>
> Another retest gave perfect results initially:
>
> 2559	66.5%
> 2560	33.4%
> 2561	33.3%
> 2564	22.3%
> 2562	22.2%
> 2563	22.1%
>
> but moving 2564 from group A to C and then back to A disturbed the perfect 
> division of time and resulted in almost the same utilization pattern as 
> above:
>
> 2559	74.9%
> 2560	30.0%
> 2561	29.6%
> 2564	25.3%
> 2562	20.0%
> 2563	20.0%

Again with the updated patch and changed imbalance_pct, here's what I see:

 4458 root      20   0  1384  232  176 R 66.3  0.0   2:11.04  0 hogc
 4457 root      20   0  1384  232  176 R 33.7  0.0   1:06.19  0 hogb
 4456 root      20   0  1384  232  176 R 33.4  0.0   1:06.59  0 hogb
 4455 root      20   0  1384  228  176 R 22.5  0.0   0:44.09  0 hoga
 4453 root      20   0  1384  232  176 R 22.3  0.0   0:44.10  1 hoga
 4454 root      20   0  1384  228  176 R 22.2  0.0   0:43.94  1 hoga

(Error < 1%)

In summary, can you do this before running your tests:

1. Apply updated patch below on top of 2.6.26-rc3 + Peter's patches
(http://programming.kicks-ass.net/kernel-patches/sched-smp-group-fixes/)

2. Setup test env as below:

	# mkdir /cgroup
	# mount -t cgroup -ocpu none /cgroup
	# cd /cgroup

	# #Move all tasks to 'sys' group and give it low shares
	# mkdir sys
	# cd sys
	# for i in `cat ../tasks`
	  do
		echo $i > tasks
	  done
	# echo 100 > cpu.shares

	# cd /proc/sys/kernel/sched_domain
	# for i in `find . -name imbalance_pct`; do echo 105 > $i; done


---
 init/Kconfig         |    2 +-
 kernel/sched.c       |   12 +++++++++---
 kernel/sched_debug.c |    3 ++-
 kernel/sched_fair.c  |   26 +++++++++++++++++---------
 4 files changed, 29 insertions(+), 14 deletions(-)

Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -349,7 +349,7 @@ config RT_GROUP_SCHED
 	  See Documentation/sched-rt-group.txt for more information.
 
 choice
-	depends on GROUP_SCHED
+	depends on GROUP_SCHED && (FAIR_GROUP_SCHED || RT_GROUP_SCHED)
 	prompt "Basis for grouping tasks"
 	default USER_SCHED
 
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -1398,7 +1398,7 @@ static unsigned long
 balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
 	      unsigned long max_load_move, struct sched_domain *sd,
 	      enum cpu_idle_type idle, int *all_pinned,
-	      int *this_best_prio, struct rq_iterator *iterator);
+	      int *this_best_prio, struct rq_iterator *iterator, int strict);
 
 static int
 iter_move_one_task(struct rq *this_rq, int this_cpu, struct rq *busiest,
@@ -1534,6 +1534,9 @@ tg_shares_up(struct task_group *tg, int 
 	unsigned long shares = 0;
 	int i;
 
+	if (!tg->parent)
+		return;
+
 	for_each_cpu_mask(i, sd->span) {
 		rq_weight += tg->cfs_rq[i]->load.weight;
 		shares += tg->cfs_rq[i]->shares;
@@ -2896,7 +2899,7 @@ static unsigned long
 balance_tasks(struct rq *this_rq, int this_cpu, struct rq *busiest,
 	      unsigned long max_load_move, struct sched_domain *sd,
 	      enum cpu_idle_type idle, int *all_pinned,
-	      int *this_best_prio, struct rq_iterator *iterator)
+	      int *this_best_prio, struct rq_iterator *iterator, int strict)
 {
 	int loops = 0, pulled = 0, pinned = 0, skip_for_load;
 	struct task_struct *p;
@@ -2919,7 +2922,10 @@ next:
 	 * skip a task if it will be the highest priority task (i.e. smallest
 	 * prio value) on its new queue regardless of its load weight
 	 */
-	skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
+	if (strict)
+		skip_for_load = p->se.load.weight >= rem_load_move;
+	else
+		skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
 							 SCHED_LOAD_SCALE_FUZZ;
 	if ((skip_for_load && p->prio >= *this_best_prio) ||
 	    !can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) {
Index: current/kernel/sched_debug.c
===================================================================
--- current.orig/kernel/sched_debug.c
+++ current/kernel/sched_debug.c
@@ -119,7 +119,7 @@ void print_cfs_rq(struct seq_file *m, in
 	struct sched_entity *last;
 	unsigned long flags;
 
-#if !defined(CONFIG_CGROUP_SCHED) || !defined(CONFIG_USER_SCHED)
+#ifndef CONFIG_CGROUP_SCHED
 	SEQ_printf(m, "\ncfs_rq[%d]:\n", cpu);
 #else
 	char path[128] = "";
@@ -170,6 +170,7 @@ void print_cfs_rq(struct seq_file *m, in
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #ifdef CONFIG_SMP
 	SEQ_printf(m, "  .%-30s: %lu\n", "shares", cfs_rq->shares);
+	SEQ_printf(m, "  .%-30s: %lu\n", "h_load", cfs_rq->h_load);
 #endif
 #endif
 }
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -1386,9 +1386,6 @@ __load_balance_iterator(struct cfs_rq *c
 		next = next->next;
 	} while (next != &cfs_rq->tasks && !entity_is_task(se));
 
-	if (next == &cfs_rq->tasks)
-		return NULL;
-
 	cfs_rq->balance_iterator = next;
 
 	if (entity_is_task(se))
@@ -1415,7 +1412,7 @@ static unsigned long
 __load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		unsigned long max_load_move, struct sched_domain *sd,
 		enum cpu_idle_type idle, int *all_pinned, int *this_best_prio,
-		struct cfs_rq *cfs_rq)
+		struct cfs_rq *cfs_rq, int strict)
 {
 	struct rq_iterator cfs_rq_iterator;
 
@@ -1425,10 +1422,11 @@ __load_balance_fair(struct rq *this_rq, 
 
 	return balance_tasks(this_rq, this_cpu, busiest,
 			max_load_move, sd, idle, all_pinned,
-			this_best_prio, &cfs_rq_iterator);
+			this_best_prio, &cfs_rq_iterator, strict);
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+
 static unsigned long
 load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		  unsigned long max_load_move,
@@ -1438,13 +1436,17 @@ load_balance_fair(struct rq *this_rq, in
 	long rem_load_move = max_load_move;
 	int busiest_cpu = cpu_of(busiest);
 	struct task_group *tg;
+	int strict = 1;
 
 	update_h_load(cpu_of(busiest));
 
 	rcu_read_lock();
+
+retry:
 	list_for_each_entry(tg, &task_groups, list) {
 		struct cfs_rq *busiest_cfs_rq = tg->cfs_rq[busiest_cpu];
 		long rem_load, moved_load;
+		unsigned long busiest_cfs_rq_load;
 
 		/*
 		 * empty group
@@ -1452,25 +1454,31 @@ load_balance_fair(struct rq *this_rq, in
 		if (!busiest_cfs_rq->task_weight)
 			continue;
 
-		rem_load = rem_load_move * busiest_cfs_rq->load.weight;
+		busiest_cfs_rq_load = busiest_cfs_rq->load.weight;
+		rem_load = rem_load_move * busiest_cfs_rq_load;
 		rem_load /= busiest_cfs_rq->h_load + 1;
 
 		moved_load = __load_balance_fair(this_rq, this_cpu, busiest,
 				rem_load, sd, idle, all_pinned, this_best_prio,
-				tg->cfs_rq[busiest_cpu]);
+				tg->cfs_rq[busiest_cpu], strict);
 
 		if (!moved_load)
 			continue;
 
 		moved_load *= busiest_cfs_rq->h_load;
-		moved_load /= busiest_cfs_rq->load.weight + 1;
+		moved_load /= busiest_cfs_rq_load + 1;
 
 		rem_load_move -= moved_load;
 		if (rem_load_move < 0)
 			break;
 	}
+
+	if (rem_load_move && strict--)
+		goto retry;
+
 	rcu_read_unlock();
 
+
 	return max_load_move - rem_load_move;
 }
 #else
@@ -1482,7 +1490,7 @@ load_balance_fair(struct rq *this_rq, in
 {
 	return __load_balance_fair(this_rq, this_cpu, busiest,
 			max_load_move, sd, idle, all_pinned,
-			this_best_prio, &busiest->cfs);
+			this_best_prio, &busiest->cfs, 0);
 }
 #endif
 


-- 
Regards,
vatsa

  parent reply	other threads:[~2008-05-29 16:37 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-21 23:59 fair group scheduler not so fair? Chris Friesen
2008-05-22  6:56 ` Peter Zijlstra
2008-05-22 20:02   ` Chris Friesen
2008-05-22 20:07     ` Peter Zijlstra
2008-05-22 20:18       ` Li, Tong N
2008-05-22 21:13         ` Peter Zijlstra
2008-05-23  0:17           ` Chris Friesen
2008-05-23  7:44             ` Srivatsa Vaddagiri
2008-05-23  9:42         ` Srivatsa Vaddagiri
2008-05-23  9:39           ` Peter Zijlstra
2008-05-23 10:19             ` Srivatsa Vaddagiri
2008-05-23 10:16               ` Peter Zijlstra
2008-05-27 17:15 ` Srivatsa Vaddagiri
2008-05-27 18:13   ` Chris Friesen
2008-05-28 16:33     ` Srivatsa Vaddagiri
2008-05-28 18:35       ` Chris Friesen
2008-05-28 18:47         ` Dhaval Giani
2008-05-29  2:50         ` Srivatsa Vaddagiri
2008-05-29 16:46         ` Srivatsa Vaddagiri [this message]
2008-05-29 16:47           ` Srivatsa Vaddagiri
2008-05-29 21:30           ` Chris Friesen
2008-05-30  6:43             ` Dhaval Giani
2008-05-30 10:21               ` Srivatsa Vaddagiri
2008-05-30 11:36             ` Srivatsa Vaddagiri
2008-06-02 20:03               ` Chris Friesen
2008-05-27 17:28 ` Srivatsa Vaddagiri

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080529164607.GC12836@linux.vnet.ibm.com \
    --to=vatsa@linux.vnet.ibm.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=balbir@in.ibm.com \
    --cc=cfriesen@nortel.com \
    --cc=dhaval@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=pj@sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.