public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/30] SMP-group balancer - take 3
@ 2008-06-27 11:41 Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 01/30] sched: clean up some unused variables Peter Zijlstra
                   ` (31 more replies)
  0 siblings, 32 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

Hi,

Another go at SMP fairness for group scheduling.

This code needs some serious testing,..

However on my system performance doesn't tank as much as it used to.
I've ran sysbench and volanomark benchmarks.

The machine is a Quad core (Intel Q9450) with 4GB of RAM.
Fedora9 - x86_64

sysbench-0.4.8 + postgresql-8.3.3
volanomark-2.5.0.9 + openjdk-1.6.0

I've used cgroup group scheduling.

cgroup:/ - means all tasks are in the root group
cgroup:/foo - means all tasks are in a subgroup

mkdir /cgroup/foo
for i in `cat /cgroup/tasks`; do
  echo $i > /cgroup/foo/tasks
done

The patches are against: tip/auto-sched-next of a few days ago.

---

.25

[root@twins sysbench-0.4.8]# ./doit-psql-256-60sec 
  1:     transactions:                        50514  (841.90 per sec.)
  2:     transactions:                        98745  (1645.73 per sec.)
  4:     transactions:                        192682 (3211.31 per sec.)
  8:     transactions:                        192082 (3201.26 per sec.)
 16:     transactions:                        188891 (3147.95 per sec.)
 32:     transactions:                        182364 (3039.12 per sec.)
 64:     transactions:                        169412 (2822.94 per sec.)
128:     transactions:                        139505 (2323.95 per sec.)
256:     transactions:                        131516 (2188.98 per sec.)

[root@twins vmark]# LOOP_CLIENT_COUNT=1000 ./loopclient.sh 2>&1 | grep Average
Average throughput = 113350 messages per second
Average throughput = 112230 messages per second
Average throughput = 113125 messages per second


.26-rc

cgroup:/

[root@twins sysbench-0.4.8]# ./doit-psql-256-60sec 
  1:     transactions:                        50553  (842.54 per sec.)
  2:     transactions:                        98625  (1643.74 per sec.)
  4:     transactions:                        191351 (3189.12 per sec.)
  8:     transactions:                        193525 (3225.32 per sec.)
 16:     transactions:                        190516 (3175.10 per sec.)
 32:     transactions:                        186914 (3114.96 per sec.)
 64:     transactions:                        178940 (2981.78 per sec.)
128:     transactions:                        156430 (2606.00 per sec.)
256:     transactions:                        134929 (2246.63 per sec.)

[root@twins vmark]# LOOP_CLIENT_COUNT=1000 ./loopclient.sh 2>&1 | grep Average
Average throughput = 124089 messages per second
Average throughput = 121962 messages per second
Average throughput = 121223 messages per second


cgroup:/foo

[root@twins sysbench-0.4.8]# ./doit-psql-256-60sec 
  1:     transactions:                        50246  (837.43 per sec.)
  2:     transactions:                        97466  (1624.41 per sec.)
  4:     transactions:                        179609 (2993.43 per sec.)
  8:     transactions:                        190931 (3182.07 per sec.)
 16:     transactions:                        189882 (3164.50 per sec.)
 32:     transactions:                        184649 (3077.14 per sec.)
 64:     transactions:                        178200 (2969.46 per sec.)
128:     transactions:                        158835 (2646.14 per sec.)
256:     transactions:                        142100 (2366.51 per sec.)

[root@twins vmark]# LOOP_CLIENT_COUNT=1000 ./loopclient.sh 2>&1 | grep Average
Average throughput = 117789 messages per second
Average throughput = 118154 messages per second
Average throughput = 118945 messages per second


.26-rc-smp-group

cgroup:/

[root@twins sysbench-0.4.8]# ./doit-psql-256-60sec 
  1:     transactions:                        50137  (835.61 per sec.)
  2:     transactions:                        97406  (1623.41 per sec.)
  4:     transactions:                        170755 (2845.88 per sec.)
  8:     transactions:                        187406 (3123.35 per sec.)
 16:     transactions:                        186865 (3114.18 per sec.)
 32:     transactions:                        183559 (3059.03 per sec.)
 64:     transactions:                        176834 (2946.70 per sec.)
128:     transactions:                        158882 (2647.04 per sec.)
256:     transactions:                        145081 (2415.81 per sec.)

[root@twins vmark]# LOOP_CLIENT_COUNT=1000 ./loopclient.sh 2>&1 | grep Average
Average throughput = 121499 messages per second
Average throughput = 120181 messages per second
Average throughput = 119775 messages per second


cgroup:/foo

[root@twins sysbench-0.4.8]# ./doit-psql-256-60sec 
  1:     transactions:                        49564  (826.06 per sec.)
  2:     transactions:                        96642  (1610.67 per sec.)
  4:     transactions:                        183081 (3051.29 per sec.)
  8:     transactions:                        187553 (3125.79 per sec.)
 16:     transactions:                        185435 (3090.45 per sec.)
 32:     transactions:                        182314 (3038.25 per sec.)
 64:     transactions:                        174527 (2908.22 per sec.)
128:     transactions:                        159321 (2654.24 per sec.)
256:     transactions:                        140167 (2333.82 per sec.)

[root@twins vmark]# LOOP_CLIENT_COUNT=1000 ./loopclient.sh 2>&1 | grep Average
Average throughput = 130208 messages per second
Average throughput = 129086 messages per second
Average throughput = 129362 messages per second


-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 01/30] sched: clean up some unused variables
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 02/30] sched: revert the revert of: weight calculations Peter Zijlstra
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-rt-cleanup-unused-variable.patch --]
[-- Type: text/plain, Size: 1144 bytes --]

In file included from /mnt/build/linux-2.6/kernel/sched.c:1496:
/mnt/build/linux-2.6/kernel/sched_rt.c: In function '__enable_runtime':
/mnt/build/linux-2.6/kernel/sched_rt.c:339: warning: unused variable 'rd'
/mnt/build/linux-2.6/kernel/sched_rt.c: In function 'requeue_rt_entity':
/mnt/build/linux-2.6/kernel/sched_rt.c:692: warning: unused variable 'queue'

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched_rt.c |    2 --
 1 file changed, 2 deletions(-)

Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -336,7 +336,6 @@ static void disable_runtime(struct rq *r
 
 static void __enable_runtime(struct rq *rq)
 {
-	struct root_domain *rd = rq->rd;
 	struct rt_rq *rt_rq;
 
 	if (unlikely(!scheduler_running))
@@ -689,7 +688,6 @@ static
 void requeue_rt_entity(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
 {
 	struct rt_prio_array *array = &rt_rq->active;
-	struct list_head *queue = array->queue + rt_se_prio(rt_se);
 
 	if (on_rt_rq(rt_se)) {
 		list_del_init(&rt_se->run_list);

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 02/30] sched: revert the revert of: weight calculations
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 01/30] sched: clean up some unused variables Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-30 18:07   ` Balbir Singh
  2008-06-27 11:41 ` [PATCH 03/30] sched: fix calc_delta_asym() Peter Zijlstra
                   ` (29 subsequent siblings)
  31 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-revert-revert-sched-weight-calc.patch --]
[-- Type: text/plain, Size: 5702 bytes --]

Try again..

initial commit: 8f1bc385cfbab474db6c27b5af1e439614f3025c
revert: f9305d4a0968201b2818dbed0dc8cb0d4ee7aeb3

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

---
 kernel/sched.c          |    9 +---
 kernel/sched_fair.c     |  105 ++++++++++++++++++++++++++++++++----------------
 kernel/sched_features.h |    1 
 3 files changed, 76 insertions(+), 39 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1342,6 +1342,9 @@ static void __resched_task(struct task_s
  */
 #define SRR(x, y) (((x) + (1UL << ((y) - 1))) >> (y))
 
+/*
+ * delta *= weight / lw
+ */
 static unsigned long
 calc_delta_mine(unsigned long delta_exec, unsigned long weight,
 		struct load_weight *lw)
@@ -1369,12 +1372,6 @@ calc_delta_mine(unsigned long delta_exec
 	return (unsigned long)min(tmp, (u64)(unsigned long)LONG_MAX);
 }
 
-static inline unsigned long
-calc_delta_fair(unsigned long delta_exec, struct load_weight *lw)
-{
-	return calc_delta_mine(delta_exec, NICE_0_LOAD, lw);
-}
-
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
 	lw->weight += inc;
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -334,6 +334,34 @@ int sched_nr_latency_handler(struct ctl_
 #endif
 
 /*
+ * delta *= w / rw
+ */
+static inline unsigned long
+calc_delta_weight(unsigned long delta, struct sched_entity *se)
+{
+	for_each_sched_entity(se) {
+		delta = calc_delta_mine(delta,
+				se->load.weight, &cfs_rq_of(se)->load);
+	}
+
+	return delta;
+}
+
+/*
+ * delta *= rw / w
+ */
+static inline unsigned long
+calc_delta_fair(unsigned long delta, struct sched_entity *se)
+{
+	for_each_sched_entity(se) {
+		delta = calc_delta_mine(delta,
+				cfs_rq_of(se)->load.weight, &se->load);
+	}
+
+	return delta;
+}
+
+/*
  * The idea is to set a period in which each task runs once.
  *
  * When there are too many tasks (sysctl_sched_nr_latency) we have to stretch
@@ -362,47 +390,54 @@ static u64 __sched_period(unsigned long 
  */
 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	u64 slice = __sched_period(cfs_rq->nr_running);
-
-	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
-
-		slice *= se->load.weight;
-		do_div(slice, cfs_rq->load.weight);
-	}
-
-
-	return slice;
+	return calc_delta_weight(__sched_period(cfs_rq->nr_running), se);
 }
 
 /*
  * We calculate the vruntime slice of a to be inserted task
  *
- * vs = s/w = p/rw
+ * vs = s*rw/w = p
  */
 static u64 sched_vslice_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	unsigned long nr_running = cfs_rq->nr_running;
-	unsigned long weight;
-	u64 vslice;
 
 	if (!se->on_rq)
 		nr_running++;
 
-	vslice = __sched_period(nr_running);
+	return __sched_period(nr_running);
+}
+
+/*
+ * The goal of calc_delta_asym() is to be asymmetrically around NICE_0_LOAD, in
+ * that it favours >=0 over <0.
+ *
+ *   -20         |
+ *               |
+ *     0 --------+-------
+ *             .'
+ *    19     .'
+ *
+ */
+static unsigned long
+calc_delta_asym(unsigned long delta, struct sched_entity *se)
+{
+	struct load_weight lw = {
+		.weight = NICE_0_LOAD,
+		.inv_weight = 1UL << (WMULT_SHIFT-NICE_0_SHIFT)
+	};
 
 	for_each_sched_entity(se) {
-		cfs_rq = cfs_rq_of(se);
+		struct load_weight *se_lw = &se->load;
 
-		weight = cfs_rq->load.weight;
-		if (!se->on_rq)
-			weight += se->load.weight;
+		if (se->load.weight < NICE_0_LOAD)
+			se_lw = &lw;
 
-		vslice *= NICE_0_LOAD;
-		do_div(vslice, weight);
+		delta = calc_delta_mine(delta,
+				cfs_rq_of(se)->load.weight, se_lw);
 	}
 
-	return vslice;
+	return delta;
 }
 
 /*
@@ -419,11 +454,7 @@ __update_curr(struct cfs_rq *cfs_rq, str
 
 	curr->sum_exec_runtime += delta_exec;
 	schedstat_add(cfs_rq, exec_clock, delta_exec);
-	delta_exec_weighted = delta_exec;
-	if (unlikely(curr->load.weight != NICE_0_LOAD)) {
-		delta_exec_weighted = calc_delta_fair(delta_exec_weighted,
-							&curr->load);
-	}
+	delta_exec_weighted = calc_delta_fair(delta_exec, curr);
 	curr->vruntime += delta_exec_weighted;
 }
 
@@ -609,8 +640,17 @@ place_entity(struct cfs_rq *cfs_rq, stru
 
 	if (!initial) {
 		/* sleeps upto a single latency don't count. */
-		if (sched_feat(NEW_FAIR_SLEEPERS))
-			vruntime -= sysctl_sched_latency;
+		if (sched_feat(NEW_FAIR_SLEEPERS)) {
+			unsigned long thresh = sysctl_sched_latency;
+
+			/*
+			 * convert the sleeper threshold into virtual time
+			 */
+			if (sched_feat(NORMALIZED_SLEEPER))
+				thresh = calc_delta_fair(thresh, se);
+
+			vruntime -= thresh;
+		}
 
 		/* ensure we never gain time by being placed backwards. */
 		vruntime = max_vruntime(se->vruntime, vruntime);
@@ -1111,11 +1151,10 @@ static unsigned long wakeup_gran(struct 
 	unsigned long gran = sysctl_sched_wakeup_granularity;
 
 	/*
-	 * More easily preempt - nice tasks, while not making
-	 * it harder for + nice tasks.
+	 * More easily preempt - nice tasks, while not making it harder for
+	 * + nice tasks.
 	 */
-	if (unlikely(se->load.weight > NICE_0_LOAD))
-		gran = calc_delta_fair(gran, &se->load);
+	gran = calc_delta_asym(sysctl_sched_wakeup_granularity, se);
 
 	return gran;
 }
Index: linux-2.6/kernel/sched_features.h
===================================================================
--- linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -1,4 +1,5 @@
 SCHED_FEAT(NEW_FAIR_SLEEPERS, 1)
+SCHED_FEAT(NORMALIZED_SLEEPER, 1)
 SCHED_FEAT(WAKEUP_PREEMPT, 1)
 SCHED_FEAT(START_DEBIT, 1)
 SCHED_FEAT(AFFINE_WAKEUPS, 1)

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 03/30] sched: fix calc_delta_asym()
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 01/30] sched: clean up some unused variables Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 02/30] sched: revert the revert of: weight calculations Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 04/30] sched: fix calc_delta_asym Peter Zijlstra
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-asyn_gran.patch --]
[-- Type: text/plain, Size: 2358 bytes --]

calc_delta_asym() is supposed to do the same as calc_delta_fair() except
linearly shrink the result for negative nice processes - this causes them
to have a smaller preemption threshold so that they are more easily preempted.

The problem is that for task groups se->load.weight is the per cpu share of
the actual task group weight; take that into account.

Also provide a debug switch to disable the asymmetry (which I still don't
like - but it does greatly benefit some workloads)

This would explain the interactivity issues reported against group scheduling.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c     |   28 +++++++++++++++++++++++++++-
 kernel/sched_features.h |    1 +
 2 files changed, 28 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -430,6 +430,29 @@ calc_delta_asym(unsigned long delta, str
 	for_each_sched_entity(se) {
 		struct load_weight *se_lw = &se->load;
 
+#ifdef CONFIG_FAIR_SCHED_GROUP
+		struct cfs_rq *cfs_rq = se->my_q;
+		struct task_group *tg = NULL
+
+		if (cfs_rq)
+			tg = cfs_rq->tg;
+
+		if (tg && tg->shares < NICE_0_LOAD) {
+			/*
+			 * scale shares to what it would have been had
+			 * tg->weight been NICE_0_LOAD:
+			 *
+			 *   weight = 1024 * shares / tg->weight
+			 */
+			lw.weight *= se->load.weight;
+			lw.weight /= tg->shares;
+
+			lw.inv_weight = 0;
+
+			se_lw = &lw;
+		} else
+#endif
+
 		if (se->load.weight < NICE_0_LOAD)
 			se_lw = &lw;
 
@@ -1154,7 +1177,10 @@ static unsigned long wakeup_gran(struct 
 	 * More easily preempt - nice tasks, while not making it harder for
 	 * + nice tasks.
 	 */
-	gran = calc_delta_asym(sysctl_sched_wakeup_granularity, se);
+	if (sched_feat(ASYM_GRAN))
+		gran = calc_delta_asym(sysctl_sched_wakeup_granularity, se);
+	else
+		gran = calc_delta_fair(sysctl_sched_wakeup_granularity, se);
 
 	return gran;
 }
Index: linux-2.6/kernel/sched_features.h
===================================================================
--- linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -6,3 +6,4 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1)
 SCHED_FEAT(SYNC_WAKEUPS, 1)
 SCHED_FEAT(HRTICK, 1)
 SCHED_FEAT(DOUBLE_TICK, 0)
+SCHED_FEAT(ASYM_GRAN, 1)

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 04/30] sched: fix calc_delta_asym
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 03/30] sched: fix calc_delta_asym() Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 05/30] sched: revert revert of: fair-group: SMP-nice for group scheduling Peter Zijlstra
                   ` (27 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-fiddle-asym.patch --]
[-- Type: text/plain, Size: 1206 bytes --]

Ok, so why are we in this mess, it was:

  1/w

but now we mixed that rw in the mix like:

 rw/w

rw being \Sum w suggests: fiddling w, we should also fiddle rw, humm?

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -432,6 +432,7 @@ calc_delta_asym(unsigned long delta, str
 
 	for_each_sched_entity(se) {
 		struct load_weight *se_lw = &se->load;
+		unsigned long rw = cfs_rq_of(se)->load.weight;
 
 #ifdef CONFIG_FAIR_SCHED_GROUP
 		struct cfs_rq *cfs_rq = se->my_q;
@@ -453,14 +454,16 @@ calc_delta_asym(unsigned long delta, str
 			lw.inv_weight = 0;
 
 			se_lw = &lw;
+			rw += lw.weight - se->load.weight;
 		} else
 #endif
 
-		if (se->load.weight < NICE_0_LOAD)
+		if (se->load.weight < NICE_0_LOAD) {
 			se_lw = &lw;
+			rw += NICE_0_LOAD - se->load.weight;
+		}
 
-		delta = calc_delta_mine(delta,
-				cfs_rq_of(se)->load.weight, se_lw);
+		delta = calc_delta_mine(delta, rw, se_lw);
 	}
 
 	return delta;

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 05/30] sched: revert revert of: fair-group: SMP-nice for group scheduling
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 04/30] sched: fix calc_delta_asym Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 06/30] sched: sched_clock_cpu() based cpu_clock() Peter Zijlstra
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-revert-revert-smp-group-balancer.patch --]
[-- Type: text/plain, Size: 22165 bytes --]

Try again..

Initial commit: 18d95a2832c1392a2d63227a7a6d433cb9f2037e
Revert: 6363ca57c76b7b83639ca8c83fc285fa26a7880e

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
---
 include/linux/sched.h |    1 
 kernel/sched.c        |  430 ++++++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched_debug.c  |    5 
 kernel/sched_fair.c   |  124 +++++++++-----
 kernel/sched_rt.c     |    4 
 5 files changed, 489 insertions(+), 75 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -765,6 +765,7 @@ struct sched_domain {
 	struct sched_domain *child;	/* bottom domain must be null terminated */
 	struct sched_group *groups;	/* the balancing groups of the domain */
 	cpumask_t span;			/* span of all CPUs in this domain */
+	int first_cpu;			/* cache of the first cpu in this domain */
 	unsigned long min_interval;	/* Minimum balance interval ms */
 	unsigned long max_interval;	/* Maximum balance interval ms */
 	unsigned int busy_factor;	/* less balancing by factor if busy */
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -403,6 +403,43 @@ struct cfs_rq {
 	 */
 	struct list_head leaf_cfs_rq_list;
 	struct task_group *tg;	/* group that "owns" this runqueue */
+
+#ifdef CONFIG_SMP
+	unsigned long task_weight;
+	unsigned long shares;
+	/*
+	 * We need space to build a sched_domain wide view of the full task
+	 * group tree, in order to avoid depending on dynamic memory allocation
+	 * during the load balancing we place this in the per cpu task group
+	 * hierarchy. This limits the load balancing to one instance per cpu,
+	 * but more should not be needed anyway.
+	 */
+	struct aggregate_struct {
+		/*
+		 *   load = weight(cpus) * f(tg)
+		 *
+		 * Where f(tg) is the recursive weight fraction assigned to
+		 * this group.
+		 */
+		unsigned long load;
+
+		/*
+		 * part of the group weight distributed to this span.
+		 */
+		unsigned long shares;
+
+		/*
+		 * The sum of all runqueue weights within this span.
+		 */
+		unsigned long rq_weight;
+
+		/*
+		 * Weight contributed by tasks; this is the part we can
+		 * influence by moving tasks around.
+		 */
+		unsigned long task_weight;
+	} aggregate;
+#endif
 #endif
 };
 
@@ -1484,6 +1521,326 @@ static unsigned long source_load(int cpu
 static unsigned long target_load(int cpu, int type);
 static unsigned long cpu_avg_load_per_task(int cpu);
 static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+
+/*
+ * Group load balancing.
+ *
+ * We calculate a few balance domain wide aggregate numbers; load and weight.
+ * Given the pictures below, and assuming each item has equal weight:
+ *
+ *         root          1 - thread
+ *         / | \         A - group
+ *        A  1  B
+ *       /|\   / \
+ *      C 2 D 3   4
+ *      |   |
+ *      5   6
+ *
+ * load:
+ *    A and B get 1/3-rd of the total load. C and D get 1/3-rd of A's 1/3-rd,
+ *    which equals 1/9-th of the total load.
+ *
+ * shares:
+ *    The weight of this group on the selected cpus.
+ *
+ * rq_weight:
+ *    Direct sum of all the cpu's their rq weight, e.g. A would get 3 while
+ *    B would get 2.
+ *
+ * task_weight:
+ *    Part of the rq_weight contributed by tasks; all groups except B would
+ *    get 1, B gets 2.
+ */
+
+static inline struct aggregate_struct *
+aggregate(struct task_group *tg, struct sched_domain *sd)
+{
+	return &tg->cfs_rq[sd->first_cpu]->aggregate;
+}
+
+typedef void (*aggregate_func)(struct task_group *, struct sched_domain *);
+
+/*
+ * Iterate the full tree, calling @down when first entering a node and @up when
+ * leaving it for the final time.
+ */
+static
+void aggregate_walk_tree(aggregate_func down, aggregate_func up,
+			 struct sched_domain *sd)
+{
+	struct task_group *parent, *child;
+
+	rcu_read_lock();
+	parent = &root_task_group;
+down:
+	(*down)(parent, sd);
+	list_for_each_entry_rcu(child, &parent->children, siblings) {
+		parent = child;
+		goto down;
+
+up:
+		continue;
+	}
+	(*up)(parent, sd);
+
+	child = parent;
+	parent = parent->parent;
+	if (parent)
+		goto up;
+	rcu_read_unlock();
+}
+
+/*
+ * Calculate the aggregate runqueue weight.
+ */
+static
+void aggregate_group_weight(struct task_group *tg, struct sched_domain *sd)
+{
+	unsigned long rq_weight = 0;
+	unsigned long task_weight = 0;
+	int i;
+
+	for_each_cpu_mask(i, sd->span) {
+		rq_weight += tg->cfs_rq[i]->load.weight;
+		task_weight += tg->cfs_rq[i]->task_weight;
+	}
+
+	aggregate(tg, sd)->rq_weight = rq_weight;
+	aggregate(tg, sd)->task_weight = task_weight;
+}
+
+/*
+ * Compute the weight of this group on the given cpus.
+ */
+static
+void aggregate_group_shares(struct task_group *tg, struct sched_domain *sd)
+{
+	unsigned long shares = 0;
+	int i;
+
+	for_each_cpu_mask(i, sd->span)
+		shares += tg->cfs_rq[i]->shares;
+
+	if ((!shares && aggregate(tg, sd)->rq_weight) || shares > tg->shares)
+		shares = tg->shares;
+
+	aggregate(tg, sd)->shares = shares;
+}
+
+/*
+ * Compute the load fraction assigned to this group, relies on the aggregate
+ * weight and this group's parent's load, i.e. top-down.
+ */
+static
+void aggregate_group_load(struct task_group *tg, struct sched_domain *sd)
+{
+	unsigned long load;
+
+	if (!tg->parent) {
+		int i;
+
+		load = 0;
+		for_each_cpu_mask(i, sd->span)
+			load += cpu_rq(i)->load.weight;
+
+	} else {
+		load = aggregate(tg->parent, sd)->load;
+
+		/*
+		 * shares is our weight in the parent's rq so
+		 * shares/parent->rq_weight gives our fraction of the load
+		 */
+		load *= aggregate(tg, sd)->shares;
+		load /= aggregate(tg->parent, sd)->rq_weight + 1;
+	}
+
+	aggregate(tg, sd)->load = load;
+}
+
+static void __set_se_shares(struct sched_entity *se, unsigned long shares);
+
+/*
+ * Calculate and set the cpu's group shares.
+ */
+static void
+__update_group_shares_cpu(struct task_group *tg, struct sched_domain *sd,
+			  int tcpu)
+{
+	int boost = 0;
+	unsigned long shares;
+	unsigned long rq_weight;
+
+	if (!tg->se[tcpu])
+		return;
+
+	rq_weight = tg->cfs_rq[tcpu]->load.weight;
+
+	/*
+	 * If there are currently no tasks on the cpu pretend there is one of
+	 * average load so that when a new task gets to run here it will not
+	 * get delayed by group starvation.
+	 */
+	if (!rq_weight) {
+		boost = 1;
+		rq_weight = NICE_0_LOAD;
+	}
+
+	/*
+	 *           \Sum shares * rq_weight
+	 * shares =  -----------------------
+	 *               \Sum rq_weight
+	 *
+	 */
+	shares = aggregate(tg, sd)->shares * rq_weight;
+	shares /= aggregate(tg, sd)->rq_weight + 1;
+
+	/*
+	 * record the actual number of shares, not the boosted amount.
+	 */
+	tg->cfs_rq[tcpu]->shares = boost ? 0 : shares;
+
+	if (shares < MIN_SHARES)
+		shares = MIN_SHARES;
+	else if (shares > MAX_SHARES)
+		shares = MAX_SHARES;
+
+	__set_se_shares(tg->se[tcpu], shares);
+}
+
+/*
+ * Re-adjust the weights on the cpu the task came from and on the cpu the
+ * task went to.
+ */
+static void
+__move_group_shares(struct task_group *tg, struct sched_domain *sd,
+		    int scpu, int dcpu)
+{
+	unsigned long shares;
+
+	shares = tg->cfs_rq[scpu]->shares + tg->cfs_rq[dcpu]->shares;
+
+	__update_group_shares_cpu(tg, sd, scpu);
+	__update_group_shares_cpu(tg, sd, dcpu);
+
+	/*
+	 * ensure we never loose shares due to rounding errors in the
+	 * above redistribution.
+	 */
+	shares -= tg->cfs_rq[scpu]->shares + tg->cfs_rq[dcpu]->shares;
+	if (shares)
+		tg->cfs_rq[dcpu]->shares += shares;
+}
+
+/*
+ * Because changing a group's shares changes the weight of the super-group
+ * we need to walk up the tree and change all shares until we hit the root.
+ */
+static void
+move_group_shares(struct task_group *tg, struct sched_domain *sd,
+		  int scpu, int dcpu)
+{
+	while (tg) {
+		__move_group_shares(tg, sd, scpu, dcpu);
+		tg = tg->parent;
+	}
+}
+
+static
+void aggregate_group_set_shares(struct task_group *tg, struct sched_domain *sd)
+{
+	unsigned long shares = aggregate(tg, sd)->shares;
+	int i;
+
+	for_each_cpu_mask(i, sd->span) {
+		struct rq *rq = cpu_rq(i);
+		unsigned long flags;
+
+		spin_lock_irqsave(&rq->lock, flags);
+		__update_group_shares_cpu(tg, sd, i);
+		spin_unlock_irqrestore(&rq->lock, flags);
+	}
+
+	aggregate_group_shares(tg, sd);
+
+	/*
+	 * ensure we never loose shares due to rounding errors in the
+	 * above redistribution.
+	 */
+	shares -= aggregate(tg, sd)->shares;
+	if (shares) {
+		tg->cfs_rq[sd->first_cpu]->shares += shares;
+		aggregate(tg, sd)->shares += shares;
+	}
+}
+
+/*
+ * Calculate the accumulative weight and recursive load of each task group
+ * while walking down the tree.
+ */
+static
+void aggregate_get_down(struct task_group *tg, struct sched_domain *sd)
+{
+	aggregate_group_weight(tg, sd);
+	aggregate_group_shares(tg, sd);
+	aggregate_group_load(tg, sd);
+}
+
+/*
+ * Rebalance the cpu shares while walking back up the tree.
+ */
+static
+void aggregate_get_up(struct task_group *tg, struct sched_domain *sd)
+{
+	aggregate_group_set_shares(tg, sd);
+}
+
+static DEFINE_PER_CPU(spinlock_t, aggregate_lock);
+
+static void __init init_aggregate(void)
+{
+	int i;
+
+	for_each_possible_cpu(i)
+		spin_lock_init(&per_cpu(aggregate_lock, i));
+}
+
+static int get_aggregate(struct sched_domain *sd)
+{
+	if (!spin_trylock(&per_cpu(aggregate_lock, sd->first_cpu)))
+		return 0;
+
+	aggregate_walk_tree(aggregate_get_down, aggregate_get_up, sd);
+	return 1;
+}
+
+static void put_aggregate(struct sched_domain *sd)
+{
+	spin_unlock(&per_cpu(aggregate_lock, sd->first_cpu));
+}
+
+static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
+{
+	cfs_rq->shares = shares;
+}
+
+#else
+
+static inline void init_aggregate(void)
+{
+}
+
+static inline int get_aggregate(struct sched_domain *sd)
+{
+	return 0;
+}
+
+static inline void put_aggregate(struct sched_domain *sd)
+{
+}
+#endif
+
 #endif
 
 #include "sched_stats.h"
@@ -1498,26 +1855,14 @@ static int task_hot(struct task_struct *
 #define for_each_class(class) \
    for (class = sched_class_highest; class; class = class->next)
 
-static inline void inc_load(struct rq *rq, const struct task_struct *p)
-{
-	update_load_add(&rq->load, p->se.load.weight);
-}
-
-static inline void dec_load(struct rq *rq, const struct task_struct *p)
-{
-	update_load_sub(&rq->load, p->se.load.weight);
-}
-
-static void inc_nr_running(struct task_struct *p, struct rq *rq)
+static void inc_nr_running(struct rq *rq)
 {
 	rq->nr_running++;
-	inc_load(rq, p);
 }
 
-static void dec_nr_running(struct task_struct *p, struct rq *rq)
+static void dec_nr_running(struct rq *rq)
 {
 	rq->nr_running--;
-	dec_load(rq, p);
 }
 
 static void set_load_weight(struct task_struct *p)
@@ -1609,7 +1954,7 @@ static void activate_task(struct rq *rq,
 		rq->nr_uninterruptible--;
 
 	enqueue_task(rq, p, wakeup);
-	inc_nr_running(p, rq);
+	inc_nr_running(rq);
 }
 
 /*
@@ -1621,7 +1966,7 @@ static void deactivate_task(struct rq *r
 		rq->nr_uninterruptible++;
 
 	dequeue_task(rq, p, sleep);
-	dec_nr_running(p, rq);
+	dec_nr_running(rq);
 }
 
 /**
@@ -2274,7 +2619,7 @@ void wake_up_new_task(struct task_struct
 		 * management (if any):
 		 */
 		p->sched_class->task_new(rq, p);
-		inc_nr_running(p, rq);
+		inc_nr_running(rq);
 	}
 	check_preempt_curr(rq, p);
 #ifdef CONFIG_SMP
@@ -3265,9 +3610,12 @@ static int load_balance(int this_cpu, st
 	unsigned long imbalance;
 	struct rq *busiest;
 	unsigned long flags;
+	int unlock_aggregate;
 
 	cpus_setall(*cpus);
 
+	unlock_aggregate = get_aggregate(sd);
+
 	/*
 	 * When power savings policy is enabled for the parent domain, idle
 	 * sibling can pick up load irrespective of busy siblings. In this case,
@@ -3383,8 +3731,9 @@ redo:
 
 	if (!ld_moved && !sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
 	    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
-		return -1;
-	return ld_moved;
+		ld_moved = -1;
+
+	goto out;
 
 out_balanced:
 	schedstat_inc(sd, lb_balanced[idle]);
@@ -3399,8 +3748,13 @@ out_one_pinned:
 
 	if (!sd_idle && sd->flags & SD_SHARE_CPUPOWER &&
 	    !test_sd_parent(sd, SD_POWERSAVINGS_BALANCE))
-		return -1;
-	return 0;
+		ld_moved = -1;
+	else
+		ld_moved = 0;
+out:
+	if (unlock_aggregate)
+		put_aggregate(sd);
+	return ld_moved;
 }
 
 /*
@@ -4588,10 +4942,8 @@ void set_user_nice(struct task_struct *p
 		goto out_unlock;
 	}
 	on_rq = p->se.on_rq;
-	if (on_rq) {
+	if (on_rq)
 		dequeue_task(rq, p, 0);
-		dec_load(rq, p);
-	}
 
 	p->static_prio = NICE_TO_PRIO(nice);
 	set_load_weight(p);
@@ -4601,7 +4953,6 @@ void set_user_nice(struct task_struct *p
 
 	if (on_rq) {
 		enqueue_task(rq, p, 0);
-		inc_load(rq, p);
 		/*
 		 * If the task increased its priority or is running and
 		 * lowered its priority, then reschedule its CPU:
@@ -7040,6 +7391,7 @@ static int __build_sched_domains(const c
 			SD_INIT(sd, ALLNODES);
 			set_domain_attribute(sd, attr);
 			sd->span = *cpu_map;
+			sd->first_cpu = first_cpu(sd->span);
 			cpu_to_allnodes_group(i, cpu_map, &sd->groups, tmpmask);
 			p = sd;
 			sd_allnodes = 1;
@@ -7050,6 +7402,7 @@ static int __build_sched_domains(const c
 		SD_INIT(sd, NODE);
 		set_domain_attribute(sd, attr);
 		sched_domain_node_span(cpu_to_node(i), &sd->span);
+		sd->first_cpu = first_cpu(sd->span);
 		sd->parent = p;
 		if (p)
 			p->child = sd;
@@ -7061,6 +7414,7 @@ static int __build_sched_domains(const c
 		SD_INIT(sd, CPU);
 		set_domain_attribute(sd, attr);
 		sd->span = *nodemask;
+		sd->first_cpu = first_cpu(sd->span);
 		sd->parent = p;
 		if (p)
 			p->child = sd;
@@ -7072,6 +7426,7 @@ static int __build_sched_domains(const c
 		SD_INIT(sd, MC);
 		set_domain_attribute(sd, attr);
 		sd->span = cpu_coregroup_map(i);
+		sd->first_cpu = first_cpu(sd->span);
 		cpus_and(sd->span, sd->span, *cpu_map);
 		sd->parent = p;
 		p->child = sd;
@@ -7084,6 +7439,7 @@ static int __build_sched_domains(const c
 		SD_INIT(sd, SIBLING);
 		set_domain_attribute(sd, attr);
 		sd->span = per_cpu(cpu_sibling_map, i);
+		sd->first_cpu = first_cpu(sd->span);
 		cpus_and(sd->span, sd->span, *cpu_map);
 		sd->parent = p;
 		p->child = sd;
@@ -7781,6 +8137,7 @@ void __init sched_init(void)
 	}
 
 #ifdef CONFIG_SMP
+	init_aggregate();
 	init_defrootdomain();
 #endif
 
@@ -8346,14 +8703,11 @@ void sched_move_task(struct task_struct 
 #endif /* CONFIG_GROUP_SCHED */
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static void set_se_shares(struct sched_entity *se, unsigned long shares)
+static void __set_se_shares(struct sched_entity *se, unsigned long shares)
 {
 	struct cfs_rq *cfs_rq = se->cfs_rq;
-	struct rq *rq = cfs_rq->rq;
 	int on_rq;
 
-	spin_lock_irq(&rq->lock);
-
 	on_rq = se->on_rq;
 	if (on_rq)
 		dequeue_entity(cfs_rq, se, 0);
@@ -8363,8 +8717,17 @@ static void set_se_shares(struct sched_e
 
 	if (on_rq)
 		enqueue_entity(cfs_rq, se, 0);
+}
 
-	spin_unlock_irq(&rq->lock);
+static void set_se_shares(struct sched_entity *se, unsigned long shares)
+{
+	struct cfs_rq *cfs_rq = se->cfs_rq;
+	struct rq *rq = cfs_rq->rq;
+	unsigned long flags;
+
+	spin_lock_irqsave(&rq->lock, flags);
+	__set_se_shares(se, shares);
+	spin_unlock_irqrestore(&rq->lock, flags);
 }
 
 static DEFINE_MUTEX(shares_mutex);
@@ -8403,8 +8766,13 @@ int sched_group_set_shares(struct task_g
 	 * w/o tripping rebalance_share or load_balance_fair.
 	 */
 	tg->shares = shares;
-	for_each_possible_cpu(i)
+	for_each_possible_cpu(i) {
+		/*
+		 * force a rebalance
+		 */
+		cfs_rq_set_shares(tg->cfs_rq[i], 0);
 		set_se_shares(tg->se[i], shares);
+	}
 
 	/*
 	 * Enable load balance activity on this group, by inserting it back on
Index: linux-2.6/kernel/sched_debug.c
===================================================================
--- linux-2.6.orig/kernel/sched_debug.c
+++ linux-2.6/kernel/sched_debug.c
@@ -167,6 +167,11 @@ void print_cfs_rq(struct seq_file *m, in
 #endif
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#ifdef CONFIG_SMP
+	SEQ_printf(m, "  .%-30s: %lu\n", "shares", cfs_rq->shares);
+#endif
+#endif
 }
 
 void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -567,10 +567,27 @@ update_stats_curr_start(struct cfs_rq *c
  * Scheduling class queueing methods:
  */
 
+#if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
+static void
+add_cfs_task_weight(struct cfs_rq *cfs_rq, unsigned long weight)
+{
+	cfs_rq->task_weight += weight;
+}
+#else
+static inline void
+add_cfs_task_weight(struct cfs_rq *cfs_rq, unsigned long weight)
+{
+}
+#endif
+
 static void
 account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	update_load_add(&cfs_rq->load, se->load.weight);
+	if (!parent_entity(se))
+		inc_cpu_load(rq_of(cfs_rq), se->load.weight);
+	if (entity_is_task(se))
+		add_cfs_task_weight(cfs_rq, se->load.weight);
 	cfs_rq->nr_running++;
 	se->on_rq = 1;
 	list_add(&se->group_node, &cfs_rq->tasks);
@@ -580,6 +597,10 @@ static void
 account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	update_load_sub(&cfs_rq->load, se->load.weight);
+	if (!parent_entity(se))
+		dec_cpu_load(rq_of(cfs_rq), se->load.weight);
+	if (entity_is_task(se))
+		add_cfs_task_weight(cfs_rq, -se->load.weight);
 	cfs_rq->nr_running--;
 	se->on_rq = 0;
 	list_del_init(&se->group_node);
@@ -1372,75 +1393,90 @@ static struct task_struct *load_balance_
 	return __load_balance_iterator(cfs_rq, cfs_rq->balance_iterator);
 }
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-static int cfs_rq_best_prio(struct cfs_rq *cfs_rq)
+static unsigned long
+__load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
+		unsigned long max_load_move, struct sched_domain *sd,
+		enum cpu_idle_type idle, int *all_pinned, int *this_best_prio,
+		struct cfs_rq *cfs_rq)
 {
-	struct sched_entity *curr;
-	struct task_struct *p;
-
-	if (!cfs_rq->nr_running || !first_fair(cfs_rq))
-		return MAX_PRIO;
-
-	curr = cfs_rq->curr;
-	if (!curr)
-		curr = __pick_next_entity(cfs_rq);
+	struct rq_iterator cfs_rq_iterator;
 
-	p = task_of(curr);
+	cfs_rq_iterator.start = load_balance_start_fair;
+	cfs_rq_iterator.next = load_balance_next_fair;
+	cfs_rq_iterator.arg = cfs_rq;
 
-	return p->prio;
+	return balance_tasks(this_rq, this_cpu, busiest,
+			max_load_move, sd, idle, all_pinned,
+			this_best_prio, &cfs_rq_iterator);
 }
-#endif
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
 static unsigned long
 load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		  unsigned long max_load_move,
 		  struct sched_domain *sd, enum cpu_idle_type idle,
 		  int *all_pinned, int *this_best_prio)
 {
-	struct cfs_rq *busy_cfs_rq;
 	long rem_load_move = max_load_move;
-	struct rq_iterator cfs_rq_iterator;
-
-	cfs_rq_iterator.start = load_balance_start_fair;
-	cfs_rq_iterator.next = load_balance_next_fair;
+	int busiest_cpu = cpu_of(busiest);
+	struct task_group *tg;
 
-	for_each_leaf_cfs_rq(busiest, busy_cfs_rq) {
-#ifdef CONFIG_FAIR_GROUP_SCHED
-		struct cfs_rq *this_cfs_rq;
+	rcu_read_lock();
+	list_for_each_entry(tg, &task_groups, list) {
 		long imbalance;
-		unsigned long maxload;
+		unsigned long this_weight, busiest_weight;
+		long rem_load, max_load, moved_load;
+
+		/*
+		 * empty group
+		 */
+		if (!aggregate(tg, sd)->task_weight)
+			continue;
+
+		rem_load = rem_load_move * aggregate(tg, sd)->rq_weight;
+		rem_load /= aggregate(tg, sd)->load + 1;
+
+		this_weight = tg->cfs_rq[this_cpu]->task_weight;
+		busiest_weight = tg->cfs_rq[busiest_cpu]->task_weight;
 
-		this_cfs_rq = cpu_cfs_rq(busy_cfs_rq, this_cpu);
+		imbalance = (busiest_weight - this_weight) / 2;
 
-		imbalance = busy_cfs_rq->load.weight - this_cfs_rq->load.weight;
-		/* Don't pull if this_cfs_rq has more load than busy_cfs_rq */
-		if (imbalance <= 0)
+		if (imbalance < 0)
+			imbalance = busiest_weight;
+
+		max_load = max(rem_load, imbalance);
+		moved_load = __load_balance_fair(this_rq, this_cpu, busiest,
+				max_load, sd, idle, all_pinned, this_best_prio,
+				tg->cfs_rq[busiest_cpu]);
+
+		if (!moved_load)
 			continue;
 
-		/* Don't pull more than imbalance/2 */
-		imbalance /= 2;
-		maxload = min(rem_load_move, imbalance);
+		move_group_shares(tg, sd, busiest_cpu, this_cpu);
 
-		*this_best_prio = cfs_rq_best_prio(this_cfs_rq);
-#else
-# define maxload rem_load_move
-#endif
-		/*
-		 * pass busy_cfs_rq argument into
-		 * load_balance_[start|next]_fair iterators
-		 */
-		cfs_rq_iterator.arg = busy_cfs_rq;
-		rem_load_move -= balance_tasks(this_rq, this_cpu, busiest,
-					       maxload, sd, idle, all_pinned,
-					       this_best_prio,
-					       &cfs_rq_iterator);
+		moved_load *= aggregate(tg, sd)->load;
+		moved_load /= aggregate(tg, sd)->rq_weight + 1;
 
-		if (rem_load_move <= 0)
+		rem_load_move -= moved_load;
+		if (rem_load_move < 0)
 			break;
 	}
+	rcu_read_unlock();
 
 	return max_load_move - rem_load_move;
 }
+#else
+static unsigned long
+load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
+		  unsigned long max_load_move,
+		  struct sched_domain *sd, enum cpu_idle_type idle,
+		  int *all_pinned, int *this_best_prio)
+{
+	return __load_balance_fair(this_rq, this_cpu, busiest,
+			max_load_move, sd, idle, all_pinned,
+			this_best_prio, &busiest->cfs);
+}
+#endif
 
 static int
 move_one_task_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -671,6 +671,8 @@ static void enqueue_task_rt(struct rq *r
 		rt_se->timeout = 0;
 
 	enqueue_rt_entity(rt_se);
+
+	inc_cpu_load(rq, p->se.load.weight);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
@@ -679,6 +681,8 @@ static void dequeue_task_rt(struct rq *r
 
 	update_curr_rt(rq);
 	dequeue_rt_entity(rt_se);
+
+	dec_cpu_load(rq, p->se.load.weight);
 }
 
 /*

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 06/30] sched: sched_clock_cpu() based cpu_clock()
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 05/30] sched: revert revert of: fair-group: SMP-nice for group scheduling Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 07/30] sched: fix wakeup granularity and buddy granularity Peter Zijlstra
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-cpu_clock.patch --]
[-- Type: text/plain, Size: 3257 bytes --]

with sched_clock_cpu() being reasonably in sync between cpus (max 1 jiffy 
difference) use this to provide cpu_clock().

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c       |   80 ---------------------------------------------------
 kernel/sched_clock.c |   12 +++++++
 2 files changed, 12 insertions(+), 80 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -809,82 +807,6 @@ static inline u64 global_rt_runtime(void
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
 
-unsigned long long time_sync_thresh = 100000;
-
-static DEFINE_PER_CPU(unsigned long long, time_offset);
-static DEFINE_PER_CPU(unsigned long long, prev_cpu_time);
-
-/*
- * Global lock which we take every now and then to synchronize
- * the CPUs time. This method is not warp-safe, but it's good
- * enough to synchronize slowly diverging time sources and thus
- * it's good enough for tracing:
- */
-static DEFINE_SPINLOCK(time_sync_lock);
-static unsigned long long prev_global_time;
-
-static unsigned long long __sync_cpu_clock(unsigned long long time, int cpu)
-{
-	/*
-	 * We want this inlined, to not get tracer function calls
-	 * in this critical section:
-	 */
-	spin_acquire(&time_sync_lock.dep_map, 0, 0, _THIS_IP_);
-	__raw_spin_lock(&time_sync_lock.raw_lock);
-
-	if (time < prev_global_time) {
-		per_cpu(time_offset, cpu) += prev_global_time - time;
-		time = prev_global_time;
-	} else {
-		prev_global_time = time;
-	}
-
-	__raw_spin_unlock(&time_sync_lock.raw_lock);
-	spin_release(&time_sync_lock.dep_map, 1, _THIS_IP_);
-
-	return time;
-}
-
-static unsigned long long __cpu_clock(int cpu)
-{
-	unsigned long long now;
-
-	/*
-	 * Only call sched_clock() if the scheduler has already been
-	 * initialized (some code might call cpu_clock() very early):
-	 */
-	if (unlikely(!scheduler_running))
-		return 0;
-
-	now = sched_clock_cpu(cpu);
-
-	return now;
-}
-
-/*
- * For kernel-internal use: high-speed (but slightly incorrect) per-cpu
- * clock constructed from sched_clock():
- */
-unsigned long long cpu_clock(int cpu)
-{
-	unsigned long long prev_cpu_time, time, delta_time;
-	unsigned long flags;
-
-	local_irq_save(flags);
-	prev_cpu_time = per_cpu(prev_cpu_time, cpu);
-	time = __cpu_clock(cpu) + per_cpu(time_offset, cpu);
-	delta_time = time-prev_cpu_time;
-
-	if (unlikely(delta_time > time_sync_thresh)) {
-		time = __sync_cpu_clock(time, cpu);
-		per_cpu(prev_cpu_time, cpu) = time;
-	}
-	local_irq_restore(flags);
-
-	return time;
-}
-EXPORT_SYMBOL_GPL(cpu_clock);
-
 #ifndef prepare_arch_switch
 # define prepare_arch_switch(next)	do { } while (0)
 #endif
Index: linux-2.6/kernel/sched_clock.c
===================================================================
--- linux-2.6.orig/kernel/sched_clock.c
+++ linux-2.6/kernel/sched_clock.c
@@ -244,3 +244,15 @@ unsigned long long __attribute__((weak))
 {
 	return (unsigned long long)jiffies * (NSEC_PER_SEC / HZ);
 }
+
+unsigned long long cpu_clock(int cpu)
+{
+	unsigned long long clock;
+	unsigned long flags;
+
+	raw_local_irq_save(flags);
+	clock = sched_clock_cpu(cpu);
+	raw_local_irq_restore(flags);
+
+	return clock;
+}

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 07/30] sched: fix wakeup granularity and buddy granularity
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 06/30] sched: sched_clock_cpu() based cpu_clock() Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 08/30] sched: add full schedstats to /proc/sched_debug Peter Zijlstra
                   ` (24 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-fix-buddy.patch --]
[-- Type: text/plain, Size: 2275 bytes --]

Uncouple buddy selection from wakeup granularity.

The initial idea was that buddies could run ahead as far as a normal task
can - do this by measuring a pair 'slice' just as we do for a normal task.

This means we can drop the wakeup_granularity back to 5ms.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |    1 +
 kernel/sched_fair.c |   11 +++++------
 2 files changed, 6 insertions(+), 6 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2008-05-12 00:42:40.000000000 +0200
+++ linux-2.6/kernel/sched.c	2008-05-12 00:43:54.000000000 +0200
@@ -370,6 +370,7 @@
 
 	u64 exec_clock;
 	u64 min_vruntime;
+	u64 pair_start;
 
 	struct rb_root tasks_timeline;
 	struct rb_node *rb_leftmost;
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c	2008-05-12 00:42:41.000000000 +0200
+++ linux-2.6/kernel/sched_fair.c	2008-05-12 00:47:05.000000000 +0200
@@ -63,13 +63,13 @@ unsigned int __read_mostly sysctl_sched_
 
 /*
  * SCHED_OTHER wake-up granularity.
- * (default: 10 msec * (1 + ilog(ncpus)), units: nanoseconds)
+ * (default: 5 msec * (1 + ilog(ncpus)), units: nanoseconds)
  *
  * This option delays the preemption effects of decoupled workloads
  * and reduces their over-scheduling. Synchronous workloads will still
  * have immediate wakeup/sleep latencies.
  */
-unsigned int sysctl_sched_wakeup_granularity = 10000000UL;
+unsigned int sysctl_sched_wakeup_granularity = 5000000UL;
 
 const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
 
@@ -787,17 +787,16 @@
 	se->prev_sum_exec_runtime = se->sum_exec_runtime;
 }
 
-static int
-wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
-
 static struct sched_entity *
 pick_next(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (!cfs_rq->next)
-		return se;
+	struct rq *rq = rq_of(cfs_rq);
+	u64 pair_slice = rq->clock - cfs_rq->pair_start;
 
-	if (wakeup_preempt_entity(cfs_rq->next, se) != 0)
+	if (!cfs_rq->next || pair_slice > sched_slice(cfs_rq, cfs_rq->next)) {
+		cfs_rq->pair_start = rq->clock;
 		return se;
+	}
 
 	return cfs_rq->next;
 }

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 08/30] sched: add full schedstats to /proc/sched_debug
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 07/30] sched: fix wakeup granularity and buddy granularity Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 09/30] sched: fix sched_domain aggregation Peter Zijlstra
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-debug-schedstat.patch --]
[-- Type: text/plain, Size: 1043 bytes --]

show all the schedstats in /debug/sched_debug as well.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_debug.c |   19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/sched_debug.c
===================================================================
--- linux-2.6.orig/kernel/sched_debug.c
+++ linux-2.6/kernel/sched_debug.c
@@ -162,8 +162,23 @@ void print_cfs_rq(struct seq_file *m, in
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_SCHEDSTATS
-	SEQ_printf(m, "  .%-30s: %d\n", "bkl_count",
-			rq->bkl_count);
+#define P(n) SEQ_printf(m, "  .%-30s: %d\n", #n, rq->n);
+
+	P(yld_exp_empty);
+	P(yld_act_empty);
+	P(yld_both_empty);
+	P(yld_count);
+
+	P(sched_switch);
+	P(sched_count);
+	P(sched_goidle);
+
+	P(ttwu_count);
+	P(ttwu_local);
+
+	P(bkl_count);
+
+#undef P
 #endif
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 09/30] sched: fix sched_domain aggregation
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (7 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 08/30] sched: add full schedstats to /proc/sched_debug Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 10/30] sched: update aggregate when holding the RQs Peter Zijlstra
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-fair-smp-lb.patch --]
[-- Type: text/plain, Size: 12770 bytes --]

Keeping the aggregate on the first cpu of the sched domain has two problems:
 - it could collide between different sched domains on different cpus
 - it could slow things down because of the remote accesses

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    1 
 kernel/sched.c        |  113 +++++++++++++++++++++++---------------------------
 kernel/sched_fair.c   |   12 ++---
 3 files changed, 60 insertions(+), 66 deletions(-)

Index: linux-2.6-2/include/linux/sched.h
===================================================================
--- linux-2.6-2.orig/include/linux/sched.h
+++ linux-2.6-2/include/linux/sched.h
@@ -766,7 +766,6 @@ struct sched_domain {
 	struct sched_domain *child;	/* bottom domain must be null terminated */
 	struct sched_group *groups;	/* the balancing groups of the domain */
 	cpumask_t span;			/* span of all CPUs in this domain */
-	int first_cpu;			/* cache of the first cpu in this domain */
 	unsigned long min_interval;	/* Minimum balance interval ms */
 	unsigned long max_interval;	/* Maximum balance interval ms */
 	unsigned int busy_factor;	/* less balancing by factor if busy */
Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -1539,12 +1539,12 @@ static int task_hot(struct task_struct *
  */
 
 static inline struct aggregate_struct *
-aggregate(struct task_group *tg, struct sched_domain *sd)
+aggregate(struct task_group *tg, int cpu)
 {
-	return &tg->cfs_rq[sd->first_cpu]->aggregate;
+	return &tg->cfs_rq[cpu]->aggregate;
 }
 
-typedef void (*aggregate_func)(struct task_group *, struct sched_domain *);
+typedef void (*aggregate_func)(struct task_group *, int, struct sched_domain *);
 
 /*
  * Iterate the full tree, calling @down when first entering a node and @up when
@@ -1552,14 +1552,14 @@ typedef void (*aggregate_func)(struct ta
  */
 static
 void aggregate_walk_tree(aggregate_func down, aggregate_func up,
-			 struct sched_domain *sd)
+			 int cpu, struct sched_domain *sd)
 {
 	struct task_group *parent, *child;
 
 	rcu_read_lock();
 	parent = &root_task_group;
 down:
-	(*down)(parent, sd);
+	(*down)(parent, cpu, sd);
 	list_for_each_entry_rcu(child, &parent->children, siblings) {
 		parent = child;
 		goto down;
@@ -1567,7 +1567,7 @@ down:
 up:
 		continue;
 	}
-	(*up)(parent, sd);
+	(*up)(parent, cpu, sd);
 
 	child = parent;
 	parent = parent->parent;
@@ -1579,8 +1579,8 @@ up:
 /*
  * Calculate the aggregate runqueue weight.
  */
-static
-void aggregate_group_weight(struct task_group *tg, struct sched_domain *sd)
+static void
+aggregate_group_weight(struct task_group *tg, int cpu, struct sched_domain *sd)
 {
 	unsigned long rq_weight = 0;
 	unsigned long task_weight = 0;
@@ -1591,15 +1591,15 @@ void aggregate_group_weight(struct task_
 		task_weight += tg->cfs_rq[i]->task_weight;
 	}
 
-	aggregate(tg, sd)->rq_weight = rq_weight;
-	aggregate(tg, sd)->task_weight = task_weight;
+	aggregate(tg, cpu)->rq_weight = rq_weight;
+	aggregate(tg, cpu)->task_weight = task_weight;
 }
 
 /*
  * Compute the weight of this group on the given cpus.
  */
-static
-void aggregate_group_shares(struct task_group *tg, struct sched_domain *sd)
+static void
+aggregate_group_shares(struct task_group *tg, int cpu, struct sched_domain *sd)
 {
 	unsigned long shares = 0;
 	int i;
@@ -1607,18 +1607,18 @@ void aggregate_group_shares(struct task_
 	for_each_cpu_mask(i, sd->span)
 		shares += tg->cfs_rq[i]->shares;
 
-	if ((!shares && aggregate(tg, sd)->rq_weight) || shares > tg->shares)
+	if ((!shares && aggregate(tg, cpu)->rq_weight) || shares > tg->shares)
 		shares = tg->shares;
 
-	aggregate(tg, sd)->shares = shares;
+	aggregate(tg, cpu)->shares = shares;
 }
 
 /*
  * Compute the load fraction assigned to this group, relies on the aggregate
  * weight and this group's parent's load, i.e. top-down.
  */
-static
-void aggregate_group_load(struct task_group *tg, struct sched_domain *sd)
+static void
+aggregate_group_load(struct task_group *tg, int cpu, struct sched_domain *sd)
 {
 	unsigned long load;
 
@@ -1630,17 +1630,17 @@ void aggregate_group_load(struct task_gr
 			load += cpu_rq(i)->load.weight;
 
 	} else {
-		load = aggregate(tg->parent, sd)->load;
+		load = aggregate(tg->parent, cpu)->load;
 
 		/*
 		 * shares is our weight in the parent's rq so
 		 * shares/parent->rq_weight gives our fraction of the load
 		 */
-		load *= aggregate(tg, sd)->shares;
-		load /= aggregate(tg->parent, sd)->rq_weight + 1;
+		load *= aggregate(tg, cpu)->shares;
+		load /= aggregate(tg->parent, cpu)->rq_weight + 1;
 	}
 
-	aggregate(tg, sd)->load = load;
+	aggregate(tg, cpu)->load = load;
 }
 
 static void __set_se_shares(struct sched_entity *se, unsigned long shares);
@@ -1649,8 +1649,8 @@ static void __set_se_shares(struct sched
  * Calculate and set the cpu's group shares.
  */
 static void
-__update_group_shares_cpu(struct task_group *tg, struct sched_domain *sd,
-			  int tcpu)
+__update_group_shares_cpu(struct task_group *tg, int cpu,
+			  struct sched_domain *sd, int tcpu)
 {
 	int boost = 0;
 	unsigned long shares;
@@ -1677,8 +1677,8 @@ __update_group_shares_cpu(struct task_gr
 	 *               \Sum rq_weight
 	 *
 	 */
-	shares = aggregate(tg, sd)->shares * rq_weight;
-	shares /= aggregate(tg, sd)->rq_weight + 1;
+	shares = aggregate(tg, cpu)->shares * rq_weight;
+	shares /= aggregate(tg, cpu)->rq_weight + 1;
 
 	/*
 	 * record the actual number of shares, not the boosted amount.
@@ -1698,15 +1698,15 @@ __update_group_shares_cpu(struct task_gr
  * task went to.
  */
 static void
-__move_group_shares(struct task_group *tg, struct sched_domain *sd,
+__move_group_shares(struct task_group *tg, int cpu, struct sched_domain *sd,
 		    int scpu, int dcpu)
 {
 	unsigned long shares;
 
 	shares = tg->cfs_rq[scpu]->shares + tg->cfs_rq[dcpu]->shares;
 
-	__update_group_shares_cpu(tg, sd, scpu);
-	__update_group_shares_cpu(tg, sd, dcpu);
+	__update_group_shares_cpu(tg, cpu, sd, scpu);
+	__update_group_shares_cpu(tg, cpu, sd, dcpu);
 
 	/*
 	 * ensure we never loose shares due to rounding errors in the
@@ -1722,19 +1722,19 @@ __move_group_shares(struct task_group *t
  * we need to walk up the tree and change all shares until we hit the root.
  */
 static void
-move_group_shares(struct task_group *tg, struct sched_domain *sd,
+move_group_shares(struct task_group *tg, int cpu, struct sched_domain *sd,
 		  int scpu, int dcpu)
 {
 	while (tg) {
-		__move_group_shares(tg, sd, scpu, dcpu);
+		__move_group_shares(tg, cpu, sd, scpu, dcpu);
 		tg = tg->parent;
 	}
 }
 
-static
-void aggregate_group_set_shares(struct task_group *tg, struct sched_domain *sd)
+static void
+aggregate_group_set_shares(struct task_group *tg, int cpu, struct sched_domain *sd)
 {
-	unsigned long shares = aggregate(tg, sd)->shares;
+	unsigned long shares = aggregate(tg, cpu)->shares;
 	int i;
 
 	for_each_cpu_mask(i, sd->span) {
@@ -1742,20 +1742,20 @@ void aggregate_group_set_shares(struct t
 		unsigned long flags;
 
 		spin_lock_irqsave(&rq->lock, flags);
-		__update_group_shares_cpu(tg, sd, i);
+		__update_group_shares_cpu(tg, cpu, sd, i);
 		spin_unlock_irqrestore(&rq->lock, flags);
 	}
 
-	aggregate_group_shares(tg, sd);
+	aggregate_group_shares(tg, cpu, sd);
 
 	/*
 	 * ensure we never loose shares due to rounding errors in the
 	 * above redistribution.
 	 */
-	shares -= aggregate(tg, sd)->shares;
+	shares -= aggregate(tg, cpu)->shares;
 	if (shares) {
-		tg->cfs_rq[sd->first_cpu]->shares += shares;
-		aggregate(tg, sd)->shares += shares;
+		tg->cfs_rq[cpu]->shares += shares;
+		aggregate(tg, cpu)->shares += shares;
 	}
 }
 
@@ -1763,21 +1763,21 @@ void aggregate_group_set_shares(struct t
  * Calculate the accumulative weight and recursive load of each task group
  * while walking down the tree.
  */
-static
-void aggregate_get_down(struct task_group *tg, struct sched_domain *sd)
+static void
+aggregate_get_down(struct task_group *tg, int cpu, struct sched_domain *sd)
 {
-	aggregate_group_weight(tg, sd);
-	aggregate_group_shares(tg, sd);
-	aggregate_group_load(tg, sd);
+	aggregate_group_weight(tg, cpu, sd);
+	aggregate_group_shares(tg, cpu, sd);
+	aggregate_group_load(tg, cpu, sd);
 }
 
 /*
  * Rebalance the cpu shares while walking back up the tree.
  */
-static
-void aggregate_get_up(struct task_group *tg, struct sched_domain *sd)
+static void
+aggregate_get_up(struct task_group *tg, int cpu, struct sched_domain *sd)
 {
-	aggregate_group_set_shares(tg, sd);
+	aggregate_group_set_shares(tg, cpu, sd);
 }
 
 static DEFINE_PER_CPU(spinlock_t, aggregate_lock);
@@ -1790,18 +1790,18 @@ static void __init init_aggregate(void)
 		spin_lock_init(&per_cpu(aggregate_lock, i));
 }
 
-static int get_aggregate(struct sched_domain *sd)
+static int get_aggregate(int cpu, struct sched_domain *sd)
 {
-	if (!spin_trylock(&per_cpu(aggregate_lock, sd->first_cpu)))
+	if (!spin_trylock(&per_cpu(aggregate_lock, cpu)))
 		return 0;
 
-	aggregate_walk_tree(aggregate_get_down, aggregate_get_up, sd);
+	aggregate_walk_tree(aggregate_get_down, aggregate_get_up, cpu, sd);
 	return 1;
 }
 
-static void put_aggregate(struct sched_domain *sd)
+static void put_aggregate(int cpu, struct sched_domain *sd)
 {
-	spin_unlock(&per_cpu(aggregate_lock, sd->first_cpu));
+	spin_unlock(&per_cpu(aggregate_lock, cpu));
 }
 
 static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
@@ -1815,12 +1815,12 @@ static inline void init_aggregate(void)
 {
 }
 
-static inline int get_aggregate(struct sched_domain *sd)
+static inline int get_aggregate(int cpu, struct sched_domain *sd)
 {
 	return 0;
 }
 
-static inline void put_aggregate(struct sched_domain *sd)
+static inline void put_aggregate(int cpu, struct sched_domain *sd)
 {
 }
 #endif
@@ -3604,7 +3604,7 @@ static int load_balance(int this_cpu, st
 
 	cpus_setall(*cpus);
 
-	unlock_aggregate = get_aggregate(sd);
+	unlock_aggregate = get_aggregate(this_cpu, sd);
 
 	/*
 	 * When power savings policy is enabled for the parent domain, idle
@@ -3743,7 +3743,7 @@ out_one_pinned:
 		ld_moved = 0;
 out:
 	if (unlock_aggregate)
-		put_aggregate(sd);
+		put_aggregate(this_cpu, sd);
 	return ld_moved;
 }
 
@@ -7337,7 +7337,6 @@ static int __build_sched_domains(const c
 			SD_INIT(sd, ALLNODES);
 			set_domain_attribute(sd, attr);
 			sd->span = *cpu_map;
-			sd->first_cpu = first_cpu(sd->span);
 			cpu_to_allnodes_group(i, cpu_map, &sd->groups, tmpmask);
 			p = sd;
 			sd_allnodes = 1;
@@ -7348,7 +7347,6 @@ static int __build_sched_domains(const c
 		SD_INIT(sd, NODE);
 		set_domain_attribute(sd, attr);
 		sched_domain_node_span(cpu_to_node(i), &sd->span);
-		sd->first_cpu = first_cpu(sd->span);
 		sd->parent = p;
 		if (p)
 			p->child = sd;
@@ -7360,7 +7358,6 @@ static int __build_sched_domains(const c
 		SD_INIT(sd, CPU);
 		set_domain_attribute(sd, attr);
 		sd->span = *nodemask;
-		sd->first_cpu = first_cpu(sd->span);
 		sd->parent = p;
 		if (p)
 			p->child = sd;
@@ -7372,7 +7369,6 @@ static int __build_sched_domains(const c
 		SD_INIT(sd, MC);
 		set_domain_attribute(sd, attr);
 		sd->span = cpu_coregroup_map(i);
-		sd->first_cpu = first_cpu(sd->span);
 		cpus_and(sd->span, sd->span, *cpu_map);
 		sd->parent = p;
 		p->child = sd;
@@ -7385,7 +7381,6 @@ static int __build_sched_domains(const c
 		SD_INIT(sd, SIBLING);
 		set_domain_attribute(sd, attr);
 		sd->span = per_cpu(cpu_sibling_map, i);
-		sd->first_cpu = first_cpu(sd->span);
 		cpus_and(sd->span, sd->span, *cpu_map);
 		sd->parent = p;
 		p->child = sd;
Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -1403,11 +1403,11 @@ load_balance_fair(struct rq *this_rq, in
 		/*
 		 * empty group
 		 */
-		if (!aggregate(tg, sd)->task_weight)
+		if (!aggregate(tg, this_cpu)->task_weight)
 			continue;
 
-		rem_load = rem_load_move * aggregate(tg, sd)->rq_weight;
-		rem_load /= aggregate(tg, sd)->load + 1;
+		rem_load = rem_load_move * aggregate(tg, this_cpu)->rq_weight;
+		rem_load /= aggregate(tg, this_cpu)->load + 1;
 
 		this_weight = tg->cfs_rq[this_cpu]->task_weight;
 		busiest_weight = tg->cfs_rq[busiest_cpu]->task_weight;
@@ -1425,10 +1425,10 @@ load_balance_fair(struct rq *this_rq, in
 		if (!moved_load)
 			continue;
 
-		move_group_shares(tg, sd, busiest_cpu, this_cpu);
+		move_group_shares(tg, this_cpu, sd, busiest_cpu, this_cpu);
 
-		moved_load *= aggregate(tg, sd)->load;
-		moved_load /= aggregate(tg, sd)->rq_weight + 1;
+		moved_load *= aggregate(tg, this_cpu)->load;
+		moved_load /= aggregate(tg, this_cpu)->rq_weight + 1;
 
 		rem_load_move -= moved_load;
 		if (rem_load_move < 0)

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 10/30] sched: update aggregate when holding the RQs
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (8 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 09/30] sched: fix sched_domain aggregation Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 11/30] sched: kill task_group balancing Peter Zijlstra
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-agg-update-move_tasks.patch --]
[-- Type: text/plain, Size: 1868 bytes --]

It was observed that in __update_group_shares_cpu() 

  rq_weight > aggregate()->rq_weight

This is caused by forks/wakeups in between the initial aggregate pass and
locking of the RQs for load balance. To avoid this situation partially re-do
the aggregation once we have the RQs locked (which avoids new tasks from
appearing).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1703,6 +1703,11 @@ aggregate_get_up(struct task_group *tg, 
 	aggregate_group_set_shares(tg, cpu, sd);
 }
 
+static void
+aggregate_get_nop(struct task_group *tg, int cpu, struct sched_domain *sd)
+{
+}
+
 static DEFINE_PER_CPU(spinlock_t, aggregate_lock);
 
 static void __init init_aggregate(void)
@@ -1722,6 +1727,11 @@ static int get_aggregate(int cpu, struct
 	return 1;
 }
 
+static void update_aggregate(int cpu, struct sched_domain *sd)
+{
+	aggregate_walk_tree(aggregate_get_down, aggregate_get_nop, cpu, sd);
+}
+
 static void put_aggregate(int cpu, struct sched_domain *sd)
 {
 	spin_unlock(&per_cpu(aggregate_lock, cpu));
@@ -1743,6 +1753,10 @@ static inline int get_aggregate(int cpu,
 	return 0;
 }
 
+static inline void update_aggregate(int cpu, struct sched_domain *sd)
+{
+}
+
 static inline void put_aggregate(int cpu, struct sched_domain *sd)
 {
 }
@@ -2180,6 +2194,12 @@ find_idlest_group(struct sched_domain *s
 	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
+	/*
+	 * now that we have both rqs locked the rq weight won't change
+	 * anymore - so update the stats.
+	 */
+	update_aggregate(this_cpu, sd);
+
 	do {
 		unsigned long load, avg_load;
 		int local_group;

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 11/30] sched: kill task_group balancing
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (9 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 10/30] sched: update aggregate when holding the RQs Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 12/30] sched: dont micro manage share losses Peter Zijlstra
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-group-balance-fix.patch --]
[-- Type: text/plain, Size: 1718 bytes --]

From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

The idea was to balance groups until we've reached the global goal, however
Vatsa rightly pointed out that we might never reach that goal this way -
hence take out this logic.

[ the initial rationale for this 'feature' was to promote max concurrency
  within a group - it does not however affect fairness ]

Reported-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |   15 ++-------------
 1 file changed, 2 insertions(+), 13 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1427,9 +1427,7 @@ load_balance_fair(struct rq *this_rq, in
 
 	rcu_read_lock();
 	list_for_each_entry(tg, &task_groups, list) {
-		long imbalance;
-		unsigned long this_weight, busiest_weight;
-		long rem_load, max_load, moved_load;
+		long rem_load, moved_load;
 
 		/*
 		 * empty group
@@ -1440,17 +1438,8 @@ load_balance_fair(struct rq *this_rq, in
 		rem_load = rem_load_move * aggregate(tg, this_cpu)->rq_weight;
 		rem_load /= aggregate(tg, this_cpu)->load + 1;
 
-		this_weight = tg->cfs_rq[this_cpu]->task_weight;
-		busiest_weight = tg->cfs_rq[busiest_cpu]->task_weight;
-
-		imbalance = (busiest_weight - this_weight) / 2;
-
-		if (imbalance < 0)
-			imbalance = busiest_weight;
-
-		max_load = max(rem_load, imbalance);
 		moved_load = __load_balance_fair(this_rq, this_cpu, busiest,
-				max_load, sd, idle, all_pinned, this_best_prio,
+				rem_load, sd, idle, all_pinned, this_best_prio,
 				tg->cfs_rq[busiest_cpu]);
 
 		if (!moved_load)

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 12/30] sched: dont micro manage share losses
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (10 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 11/30] sched: kill task_group balancing Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 13/30] sched: no need to aggregate task_weight Peter Zijlstra
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-aggregate-no-fixups.patch --]
[-- Type: text/plain, Size: 1999 bytes --]

We used to try and contain the loss of 'shares' by playing arithmetic
games. Replace that by noticing that at the top sched_domain we'll
always have the full weight in shares to distribute.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   26 +++-----------------------
 1 file changed, 3 insertions(+), 23 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1561,6 +1561,9 @@ aggregate_group_shares(struct task_group
 	if ((!shares && aggregate(tg, cpu)->rq_weight) || shares > tg->shares)
 		shares = tg->shares;
 
+	if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
+		shares = tg->shares;
+
 	aggregate(tg, cpu)->shares = shares;
 }
 
@@ -1652,20 +1655,8 @@ static void
 __move_group_shares(struct task_group *tg, int cpu, struct sched_domain *sd,
 		    int scpu, int dcpu)
 {
-	unsigned long shares;
-
-	shares = tg->cfs_rq[scpu]->shares + tg->cfs_rq[dcpu]->shares;
-
 	__update_group_shares_cpu(tg, cpu, sd, scpu);
 	__update_group_shares_cpu(tg, cpu, sd, dcpu);
-
-	/*
-	 * ensure we never loose shares due to rounding errors in the
-	 * above redistribution.
-	 */
-	shares -= tg->cfs_rq[scpu]->shares + tg->cfs_rq[dcpu]->shares;
-	if (shares)
-		tg->cfs_rq[dcpu]->shares += shares;
 }
 
 /*
@@ -1685,7 +1676,6 @@ move_group_shares(struct task_group *tg,
 static void
 aggregate_group_set_shares(struct task_group *tg, int cpu, struct sched_domain *sd)
 {
-	unsigned long shares = aggregate(tg, cpu)->shares;
 	int i;
 
 	for_each_cpu_mask(i, sd->span) {
@@ -1698,16 +1688,6 @@ aggregate_group_set_shares(struct task_g
 	}
 
 	aggregate_group_shares(tg, cpu, sd);
-
-	/*
-	 * ensure we never loose shares due to rounding errors in the
-	 * above redistribution.
-	 */
-	shares -= aggregate(tg, cpu)->shares;
-	if (shares) {
-		tg->cfs_rq[cpu]->shares += shares;
-		aggregate(tg, cpu)->shares += shares;
-	}
 }
 
 /*

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 13/30] sched: no need to aggregate task_weight
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (11 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 12/30] sched: dont micro manage share losses Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 14/30] sched: simplify the group load balancer Peter Zijlstra
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-agg-no-task_weight.patch --]
[-- Type: text/plain, Size: 2016 bytes --]

We only need to know the task_weight of the busiest rq - nothing to do
if there are no tasks there.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |   16 +---------------
 kernel/sched_fair.c |    2 +-
 2 files changed, 2 insertions(+), 16 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -432,12 +432,6 @@ struct cfs_rq {
 		 * The sum of all runqueue weights within this span.
 		 */
 		unsigned long rq_weight;
-
-		/*
-		 * Weight contributed by tasks; this is the part we can
-		 * influence by moving tasks around.
-		 */
-		unsigned long task_weight;
 	} aggregate;
 #endif
 #endif
@@ -1483,10 +1477,6 @@ static int task_hot(struct task_struct *
  * rq_weight:
  *    Direct sum of all the cpu's their rq weight, e.g. A would get 3 while
  *    B would get 2.
- *
- * task_weight:
- *    Part of the rq_weight contributed by tasks; all groups except B would
- *    get 1, B gets 2.
  */
 
 static inline struct aggregate_struct *
@@ -1534,16 +1524,12 @@ static void
 aggregate_group_weight(struct task_group *tg, int cpu, struct sched_domain *sd)
 {
 	unsigned long rq_weight = 0;
-	unsigned long task_weight = 0;
 	int i;
 
-	for_each_cpu_mask(i, sd->span) {
+	for_each_cpu_mask(i, sd->span)
 		rq_weight += tg->cfs_rq[i]->load.weight;
-		task_weight += tg->cfs_rq[i]->task_weight;
-	}
 
 	aggregate(tg, cpu)->rq_weight = rq_weight;
-	aggregate(tg, cpu)->task_weight = task_weight;
 }
 
 /*
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1418,7 +1418,7 @@ load_balance_fair(struct rq *this_rq, in
 		/*
 		 * empty group
 		 */
-		if (!aggregate(tg, this_cpu)->task_weight)
+		if (!tg->cfs_rq[busiest_cpu]->task_weight)
 			continue;
 
 		rem_load = rem_load_move * aggregate(tg, this_cpu)->rq_weight;

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 14/30] sched: simplify the group load balancer
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (12 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 13/30] sched: no need to aggregate task_weight Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 15/30] sched: fix newidle smp group balancing Peter Zijlstra
                   ` (17 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-per-cpu-load.patch --]
[-- Type: text/plain, Size: 13674 bytes --]

While thinking about the previous patch - I realized that using per domain
aggregate load values in load_balance_fair() is wrong. We should use the
load value for that CPU.

By not needing per domain hierarchical load values we don't need to store
per domain aggregate shares, which greatly simplifies all the math.

It basically falls apart in two separate computations:
 - per domain update of the shares
 - per CPU update of the hierarchical load

Also get rid of the move_group_shares() stuff - just re-compute the shares
again after a successful load balance.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |  286 +++++++++++-----------------------------------------
 kernel/sched_fair.c |   15 +-
 2 files changed, 72 insertions(+), 229 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -406,34 +406,23 @@ struct cfs_rq {
 	struct task_group *tg;	/* group that "owns" this runqueue */
 
 #ifdef CONFIG_SMP
-	unsigned long task_weight;
-	unsigned long shares;
 	/*
-	 * We need space to build a sched_domain wide view of the full task
-	 * group tree, in order to avoid depending on dynamic memory allocation
-	 * during the load balancing we place this in the per cpu task group
-	 * hierarchy. This limits the load balancing to one instance per cpu,
-	 * but more should not be needed anyway.
+	 * the part of load.weight contributed by tasks
 	 */
-	struct aggregate_struct {
-		/*
-		 *   load = weight(cpus) * f(tg)
-		 *
-		 * Where f(tg) is the recursive weight fraction assigned to
-		 * this group.
-		 */
-		unsigned long load;
+	unsigned long task_weight;
 
-		/*
-		 * part of the group weight distributed to this span.
-		 */
-		unsigned long shares;
+	/*
+	 *   h_load = weight * f(tg)
+	 *
+	 * Where f(tg) is the recursive weight fraction assigned to
+	 * this group.
+	 */
+	unsigned long h_load;
 
-		/*
-		 * The sum of all runqueue weights within this span.
-		 */
-		unsigned long rq_weight;
-	} aggregate;
+	/*
+	 * this cpu's part of tg->shares
+	 */
+	unsigned long shares;
 #endif
 #endif
 };
@@ -1443,47 +1432,14 @@ static int task_hot(struct task_struct *
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-/*
- * Group load balancing.
- *
- * We calculate a few balance domain wide aggregate numbers; load and weight.
- * Given the pictures below, and assuming each item has equal weight:
- *
- *         root          1 - thread
- *         / | \         A - group
- *        A  1  B
- *       /|\   / \
- *      C 2 D 3   4
- *      |   |
- *      5   6
- *
- * load:
- *    A and B get 1/3-rd of the total load. C and D get 1/3-rd of A's 1/3-rd,
- *    which equals 1/9-th of the total load.
- *
- * shares:
- *    The weight of this group on the selected cpus.
- *
- * rq_weight:
- *    Direct sum of all the cpu's their rq weight, e.g. A would get 3 while
- *    B would get 2.
- */
-
-static inline struct aggregate_struct *
-aggregate(struct task_group *tg, int cpu)
-{
-	return &tg->cfs_rq[cpu]->aggregate;
-}
-
-typedef void (*aggregate_func)(struct task_group *, int, struct sched_domain *);
+typedef void (*tg_visitor)(struct task_group *, int, struct sched_domain *);
 
 /*
  * Iterate the full tree, calling @down when first entering a node and @up when
  * leaving it for the final time.
  */
-static
-void aggregate_walk_tree(aggregate_func down, aggregate_func up,
-			 int cpu, struct sched_domain *sd)
+static void
+walk_tg_tree(tg_visitor down, tg_visitor up, int cpu, struct sched_domain *sd)
 {
 	struct task_group *parent, *child;
 
@@ -1507,72 +1463,6 @@ up:
 	rcu_read_unlock();
 }
 
-/*
- * Calculate the aggregate runqueue weight.
- */
-static void
-aggregate_group_weight(struct task_group *tg, int cpu, struct sched_domain *sd)
-{
-	unsigned long rq_weight = 0;
-	int i;
-
-	for_each_cpu_mask(i, sd->span)
-		rq_weight += tg->cfs_rq[i]->load.weight;
-
-	aggregate(tg, cpu)->rq_weight = rq_weight;
-}
-
-/*
- * Compute the weight of this group on the given cpus.
- */
-static void
-aggregate_group_shares(struct task_group *tg, int cpu, struct sched_domain *sd)
-{
-	unsigned long shares = 0;
-	int i;
-
-	for_each_cpu_mask(i, sd->span)
-		shares += tg->cfs_rq[i]->shares;
-
-	if ((!shares && aggregate(tg, cpu)->rq_weight) || shares > tg->shares)
-		shares = tg->shares;
-
-	if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
-		shares = tg->shares;
-
-	aggregate(tg, cpu)->shares = shares;
-}
-
-/*
- * Compute the load fraction assigned to this group, relies on the aggregate
- * weight and this group's parent's load, i.e. top-down.
- */
-static void
-aggregate_group_load(struct task_group *tg, int cpu, struct sched_domain *sd)
-{
-	unsigned long load;
-
-	if (!tg->parent) {
-		int i;
-
-		load = 0;
-		for_each_cpu_mask(i, sd->span)
-			load += cpu_rq(i)->load.weight;
-
-	} else {
-		load = aggregate(tg->parent, cpu)->load;
-
-		/*
-		 * shares is our weight in the parent's rq so
-		 * shares/parent->rq_weight gives our fraction of the load
-		 */
-		load *= aggregate(tg, cpu)->shares;
-		load /= aggregate(tg->parent, cpu)->rq_weight + 1;
-	}
-
-	aggregate(tg, cpu)->load = load;
-}
-
 static void __set_se_shares(struct sched_entity *se, unsigned long shares);
 
 /*
@@ -1580,16 +1470,16 @@ static void __set_se_shares(struct sched
  */
 static void
 __update_group_shares_cpu(struct task_group *tg, int cpu,
-			  struct sched_domain *sd, int tcpu)
+			  unsigned long sd_shares, unsigned long sd_rq_weight)
 {
 	int boost = 0;
 	unsigned long shares;
 	unsigned long rq_weight;
 
-	if (!tg->se[tcpu])
+	if (!tg->se[cpu])
 		return;
 
-	rq_weight = tg->cfs_rq[tcpu]->load.weight;
+	rq_weight = tg->cfs_rq[cpu]->load.weight;
 
 	/*
 	 * If there are currently no tasks on the cpu pretend there is one of
@@ -1601,124 +1491,97 @@ __update_group_shares_cpu(struct task_gr
 		rq_weight = NICE_0_LOAD;
 	}
 
+	if (unlikely(rq_weight > sd_rq_weight))
+		rq_weight = sd_rq_weight;
+
 	/*
 	 *           \Sum shares * rq_weight
 	 * shares =  -----------------------
 	 *               \Sum rq_weight
 	 *
 	 */
-	shares = aggregate(tg, cpu)->shares * rq_weight;
-	shares /= aggregate(tg, cpu)->rq_weight + 1;
+	shares = (sd_shares * rq_weight) / (sd_rq_weight + 1);
 
 	/*
 	 * record the actual number of shares, not the boosted amount.
 	 */
-	tg->cfs_rq[tcpu]->shares = boost ? 0 : shares;
+	tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
 
 	if (shares < MIN_SHARES)
 		shares = MIN_SHARES;
 	else if (shares > MAX_SHARES)
 		shares = MAX_SHARES;
 
-	__set_se_shares(tg->se[tcpu], shares);
+	__set_se_shares(tg->se[cpu], shares);
 }
 
 /*
- * Re-adjust the weights on the cpu the task came from and on the cpu the
- * task went to.
+ * Re-compute the task group their per cpu shares over the given domain.
+ * This needs to be done in a bottom-up fashion because the rq weight of a
+ * parent group depends on the shares of its child groups.
  */
 static void
-__move_group_shares(struct task_group *tg, int cpu, struct sched_domain *sd,
-		    int scpu, int dcpu)
+tg_shares_up(struct task_group *tg, int cpu, struct sched_domain *sd)
 {
-	__update_group_shares_cpu(tg, cpu, sd, scpu);
-	__update_group_shares_cpu(tg, cpu, sd, dcpu);
-}
+	unsigned long rq_weight = 0;
+	unsigned long shares = 0;
+	int i;
 
-/*
- * Because changing a group's shares changes the weight of the super-group
- * we need to walk up the tree and change all shares until we hit the root.
- */
-static void
-move_group_shares(struct task_group *tg, int cpu, struct sched_domain *sd,
-		  int scpu, int dcpu)
-{
-	while (tg) {
-		__move_group_shares(tg, cpu, sd, scpu, dcpu);
-		tg = tg->parent;
+	for_each_cpu_mask(i, sd->span) {
+		rq_weight += tg->cfs_rq[i]->load.weight;
+		shares += tg->cfs_rq[i]->shares;
 	}
-}
 
-static void
-aggregate_group_set_shares(struct task_group *tg, int cpu, struct sched_domain *sd)
-{
-	int i;
+	if ((!shares && rq_weight) || shares > tg->shares)
+		shares = tg->shares;
+
+	if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
+		shares = tg->shares;
 
 	for_each_cpu_mask(i, sd->span) {
 		struct rq *rq = cpu_rq(i);
 		unsigned long flags;
 
 		spin_lock_irqsave(&rq->lock, flags);
-		__update_group_shares_cpu(tg, cpu, sd, i);
+		__update_group_shares_cpu(tg, i, shares, rq_weight);
 		spin_unlock_irqrestore(&rq->lock, flags);
 	}
-
-	aggregate_group_shares(tg, cpu, sd);
 }
 
 /*
- * Calculate the accumulative weight and recursive load of each task group
- * while walking down the tree.
+ * Compute the cpu's hierarchical load factor for each task group.
+ * This needs to be done in a top-down fashion because the load of a child
+ * group is a fraction of its parents load.
  */
 static void
-aggregate_get_down(struct task_group *tg, int cpu, struct sched_domain *sd)
+tg_load_down(struct task_group *tg, int cpu, struct sched_domain *sd)
 {
-	aggregate_group_weight(tg, cpu, sd);
-	aggregate_group_shares(tg, cpu, sd);
-	aggregate_group_load(tg, cpu, sd);
-}
-
-/*
- * Rebalance the cpu shares while walking back up the tree.
- */
-static void
-aggregate_get_up(struct task_group *tg, int cpu, struct sched_domain *sd)
-{
-	aggregate_group_set_shares(tg, cpu, sd);
-}
-
-static void
-aggregate_get_nop(struct task_group *tg, int cpu, struct sched_domain *sd)
-{
-}
-
-static DEFINE_PER_CPU(spinlock_t, aggregate_lock);
+	unsigned long load;
 
-static void __init init_aggregate(void)
-{
-	int i;
+	if (!tg->parent) {
+		load = cpu_rq(cpu)->load.weight;
+	} else {
+		load = tg->parent->cfs_rq[cpu]->h_load;
+		load *= tg->cfs_rq[cpu]->shares;
+		load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
+	}
 
-	for_each_possible_cpu(i)
-		spin_lock_init(&per_cpu(aggregate_lock, i));
+	tg->cfs_rq[cpu]->h_load = load;
 }
 
-static int get_aggregate(int cpu, struct sched_domain *sd)
+static void
+tg_nop(struct task_group *tg, int cpu, struct sched_domain *sd)
 {
-	if (!spin_trylock(&per_cpu(aggregate_lock, cpu)))
-		return 0;
-
-	aggregate_walk_tree(aggregate_get_down, aggregate_get_up, cpu, sd);
-	return 1;
 }
 
-static void update_aggregate(int cpu, struct sched_domain *sd)
+static void update_shares(struct sched_domain *sd)
 {
-	aggregate_walk_tree(aggregate_get_down, aggregate_get_nop, cpu, sd);
+	walk_tg_tree(tg_nop, tg_shares_up, 0, sd);
 }
 
-static void put_aggregate(int cpu, struct sched_domain *sd)
+static void update_h_load(int cpu)
 {
-	spin_unlock(&per_cpu(aggregate_lock, cpu));
+	walk_tg_tree(tg_load_down, tg_nop, cpu, NULL);
 }
 
 static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
@@ -1728,22 +1591,10 @@ static void cfs_rq_set_shares(struct cfs
 
 #else
 
-static inline void init_aggregate(void)
-{
-}
-
-static inline int get_aggregate(int cpu, struct sched_domain *sd)
-{
-	return 0;
-}
-
-static inline void update_aggregate(int cpu, struct sched_domain *sd)
+static inline void update_shares(struct sched_domain *sd)
 {
 }
 
-static inline void put_aggregate(int cpu, struct sched_domain *sd)
-{
-}
 #endif
 
 #endif
@@ -2172,12 +2023,6 @@ find_idlest_group(struct sched_domain *s
 	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
-	/*
-	 * now that we have both rqs locked the rq weight won't change
-	 * anymore - so update the stats.
-	 */
-	update_aggregate(this_cpu, sd);
-
 	do {
 		unsigned long load, avg_load;
 		int local_group;
@@ -3521,12 +3366,9 @@ static int load_balance(int this_cpu, st
 	unsigned long imbalance;
 	struct rq *busiest;
 	unsigned long flags;
-	int unlock_aggregate;
 
 	cpus_setall(*cpus);
 
-	unlock_aggregate = get_aggregate(this_cpu, sd);
-
 	/*
 	 * When power savings policy is enabled for the parent domain, idle
 	 * sibling can pick up load irrespective of busy siblings. In this case,
@@ -3540,6 +3382,7 @@ static int load_balance(int this_cpu, st
 	schedstat_inc(sd, lb_count[idle]);
 
 redo:
+	update_shares(sd);
 	group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
 				   cpus, balance);
 
@@ -3663,8 +3506,8 @@ out_one_pinned:
 	else
 		ld_moved = 0;
 out:
-	if (unlock_aggregate)
-		put_aggregate(this_cpu, sd);
+	if (ld_moved)
+		update_shares(sd);
 	return ld_moved;
 }
 
@@ -8043,7 +7886,6 @@ void __init sched_init(void)
 	}
 
 #ifdef CONFIG_SMP
-	init_aggregate();
 	init_defrootdomain();
 #endif
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1421,17 +1421,20 @@ load_balance_fair(struct rq *this_rq, in
 	struct task_group *tg;
 
 	rcu_read_lock();
+	update_h_load(busiest_cpu);
+
 	list_for_each_entry(tg, &task_groups, list) {
+		struct cfs_rq *busiest_cfs_rq = tg->cfs_rq[busiest_cpu];
 		long rem_load, moved_load;
 
 		/*
 		 * empty group
 		 */
-		if (!tg->cfs_rq[busiest_cpu]->task_weight)
+		if (!busiest_cfs_rq->task_weight)
 			continue;
 
-		rem_load = rem_load_move * aggregate(tg, this_cpu)->rq_weight;
-		rem_load /= aggregate(tg, this_cpu)->load + 1;
+		rem_load = rem_load_move * busiest_cfs_rq->load.weight;
+		rem_load /= busiest_cfs_rq->h_load + 1;
 
 		moved_load = __load_balance_fair(this_rq, this_cpu, busiest,
 				rem_load, sd, idle, all_pinned, this_best_prio,
@@ -1440,10 +1443,8 @@ load_balance_fair(struct rq *this_rq, in
 		if (!moved_load)
 			continue;
 
-		move_group_shares(tg, this_cpu, sd, busiest_cpu, this_cpu);
-
-		moved_load *= aggregate(tg, this_cpu)->load;
-		moved_load /= aggregate(tg, this_cpu)->rq_weight + 1;
+		moved_load *= busiest_cfs_rq->h_load;
+		moved_load /= busiest_cfs_rq->load.weight + 1;
 
 		rem_load_move -= moved_load;
 		if (rem_load_move < 0)

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 15/30] sched: fix newidle smp group balancing
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (13 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 14/30] sched: simplify the group load balancer Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 16/30] sched: fix sched_balance_self() " Peter Zijlstra
                   ` (16 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-fix-newidle.patch --]
[-- Type: text/plain, Size: 1334 bytes --]

Re-compute the shares on newidle - so we can make a decision based on
recent data.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1579,6 +1579,13 @@ static void update_shares(struct sched_d
 	walk_tg_tree(tg_nop, tg_shares_up, 0, sd);
 }
 
+static void update_shares_locked(struct rq *rq, struct sched_domain *sd)
+{
+	spin_unlock(&rq->lock);
+	update_shares(sd);
+	spin_lock(&rq->lock);
+}
+
 static void update_h_load(int cpu)
 {
 	walk_tg_tree(tg_load_down, tg_nop, cpu, NULL);
@@ -1595,6 +1602,10 @@ static inline void update_shares(struct 
 {
 }
 
+static inline void update_shares_locked(struct rq *rq, struct sched_domain *sd)
+{
+}
+
 #endif
 
 #endif
@@ -3543,6 +3554,7 @@ load_balance_newidle(int this_cpu, struc
 
 	schedstat_inc(sd, lb_count[CPU_NEWLY_IDLE]);
 redo:
+	update_shares_locked(this_rq, sd);
 	group = find_busiest_group(sd, this_cpu, &imbalance, CPU_NEWLY_IDLE,
 				   &sd_idle, cpus, NULL);
 	if (!group) {
@@ -3586,6 +3598,7 @@ redo:
 	} else
 		sd->nr_balance_failed = 0;
 
+	update_shares_locked(this_rq, sd);
 	return ld_moved;
 
 out_balanced:

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 16/30] sched: fix sched_balance_self() smp group balancing
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (14 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 15/30] sched: fix newidle smp group balancing Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 17/30] sched: persistent average load per task Peter Zijlstra
                   ` (15 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-fix-sched_balance_self.patch --]
[-- Type: text/plain, Size: 555 bytes --]

Finding the least idle cpu is more accurate when done with updated shares.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2171,6 +2171,9 @@ static int sched_balance_self(int cpu, i
 			sd = tmp;
 	}
 
+	if (sd)
+		update_shares(sd);
+
 	while (sd) {
 		cpumask_t span, tmpmask;
 		struct sched_group *group;

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 17/30] sched: persistent average load per task
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (15 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 16/30] sched: fix sched_balance_self() " Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 18/30] sched: hierarchical load vs affine wakeups Peter Zijlstra
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-avg_load_per_task.patch --]
[-- Type: text/plain, Size: 1815 bytes --]

Remove the fall-back to SCHED_LOAD_SCALE by remembering the previous value of
cpu_avg_load_per_task() - this is useful because of the hierarchical group
model in which task weight can be much smaller.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -554,6 +554,8 @@ struct rq {
 	int cpu;
 	int online;
 
+	unsigned long avg_load_per_task;
+
 	struct task_struct *migration_thread;
 	struct list_head migration_queue;
 #endif
@@ -1427,9 +1429,18 @@ static inline void dec_cpu_load(struct r
 #ifdef CONFIG_SMP
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
-static unsigned long cpu_avg_load_per_task(int cpu);
 static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd);
 
+static unsigned long cpu_avg_load_per_task(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+
+	if (rq->nr_running)
+		rq->avg_load_per_task = rq->load.weight / rq->nr_running;
+
+	return rq->avg_load_per_task;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
 typedef void (*tg_visitor)(struct task_group *, int, struct sched_domain *);
@@ -2011,18 +2022,6 @@ static unsigned long target_load(int cpu
 }
 
 /*
- * Return the average load per task on the cpu's run queue
- */
-static unsigned long cpu_avg_load_per_task(int cpu)
-{
-	struct rq *rq = cpu_rq(cpu);
-	unsigned long total = weighted_cpuload(cpu);
-	unsigned long n = rq->nr_running;
-
-	return n ? total / n : SCHED_LOAD_SCALE;
-}
-
-/*
  * find_idlest_group finds and returns the least busy CPU group within the
  * domain.
  */

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 18/30] sched: hierarchical load vs affine wakeups
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (16 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 17/30] sched: persistent average load per task Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 19/30] sched: hierarchical load vs find_busiest_group Peter Zijlstra
                   ` (13 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-wake_affine.patch --]
[-- Type: text/plain, Size: 1506 bytes --]

With hierarchical grouping we can't just compare task weight to rq weight - we
need to scale the weight appropriately.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |   23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1071,6 +1071,25 @@ static inline int wake_idle(int cpu, str
 
 static const struct sched_class fair_sched_class;
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static unsigned long task_h_load(struct task_struct *p)
+{
+	unsigned long h_load = p->se.load.weight;
+	struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
+
+	update_h_load(task_cpu(p));
+
+	h_load = calc_delta_mine(h_load, cfs_rq->h_load, &cfs_rq->load);
+
+	return h_load;
+}
+#else
+static unsigned long task_h_load(struct task_struct *p)
+{
+	return p->se.load.weight;
+}
+#endif
+
 static int
 wake_affine(struct rq *rq, struct sched_domain *this_sd, struct rq *this_rq,
 	    struct task_struct *p, int prev_cpu, int this_cpu, int sync,
@@ -1091,9 +1110,9 @@ wake_affine(struct rq *rq, struct sched_
 	 * of the current CPU:
 	 */
 	if (sync)
-		tl -= current->se.load.weight;
+		tl -= task_h_load(current);
 
-	balanced = 100*(tl + p->se.load.weight) <= imbalance*load;
+	balanced = 100*(tl + task_h_load(p)) <= imbalance*load;
 
 	/*
 	 * If the currently running task will sleep within

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 19/30] sched: hierarchical load vs find_busiest_group
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (17 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 18/30] sched: hierarchical load vs affine wakeups Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 20/30] sched: fix load scaling in group balancing Peter Zijlstra
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-find_busiest_group.patch --]
[-- Type: text/plain, Size: 2904 bytes --]

find_busiest_group() has some assumptions about task weight being in the
NICE_0_LOAD range. Hierarchical task groups break this assumption - fix this
by replacing it with the average task weight, which will adapt the situation.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c |   26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -3111,6 +3111,7 @@ find_busiest_group(struct sched_domain *
 	max_load = this_load = total_load = total_pwr = 0;
 	busiest_load_per_task = busiest_nr_running = 0;
 	this_load_per_task = this_nr_running = 0;
+
 	if (idle == CPU_NOT_IDLE)
 		load_idx = sd->busy_idx;
 	else if (idle == CPU_NEWLY_IDLE)
@@ -3125,6 +3126,8 @@ find_busiest_group(struct sched_domain *
 		int __group_imb = 0;
 		unsigned int balance_cpu = -1, first_idle_cpu = 0;
 		unsigned long sum_nr_running, sum_weighted_load;
+		unsigned long sum_avg_load_per_task;
+		unsigned long avg_load_per_task;
 
 		local_group = cpu_isset(this_cpu, group->cpumask);
 
@@ -3133,6 +3136,8 @@ find_busiest_group(struct sched_domain *
 
 		/* Tally up the load of all CPUs in the group */
 		sum_weighted_load = sum_nr_running = avg_load = 0;
+		sum_avg_load_per_task = avg_load_per_task = 0;
+
 		max_cpu_load = 0;
 		min_cpu_load = ~0UL;
 
@@ -3166,6 +3171,8 @@ find_busiest_group(struct sched_domain *
 			avg_load += load;
 			sum_nr_running += rq->nr_running;
 			sum_weighted_load += weighted_cpuload(i);
+
+			sum_avg_load_per_task += cpu_avg_load_per_task(i);
 		}
 
 		/*
@@ -3187,7 +3194,20 @@ find_busiest_group(struct sched_domain *
 		avg_load = sg_div_cpu_power(group,
 				avg_load * SCHED_LOAD_SCALE);
 
-		if ((max_cpu_load - min_cpu_load) > SCHED_LOAD_SCALE)
+
+		/*
+		 * Consider the group unbalanced when the imbalance is larger
+		 * than the average weight of two tasks.
+		 *
+		 * APZ: with cgroup the avg task weight can vary wildly and
+		 *      might not be a suitable number - should we keep a
+		 *      normalized nr_running number somewhere that negates
+		 *      the hierarchy?
+		 */
+		avg_load_per_task = sg_div_cpu_power(group,
+				sum_avg_load_per_task * SCHED_LOAD_SCALE);
+
+		if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
 			__group_imb = 1;
 
 		group_capacity = group->__cpu_power / SCHED_LOAD_SCALE;
@@ -3328,9 +3348,9 @@ small_imbalance:
 			if (busiest_load_per_task > this_load_per_task)
 				imbn = 1;
 		} else
-			this_load_per_task = SCHED_LOAD_SCALE;
+			this_load_per_task = cpu_avg_load_per_task(this_cpu);
 
-		if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
+		if (max_load - this_load + 2*busiest_load_per_task >=
 					busiest_load_per_task * imbn) {
 			*imbalance = busiest_load_per_task;
 			return busiest;

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 20/30] sched: fix load scaling in group balancing
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (18 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 19/30] sched: hierarchical load vs find_busiest_group Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 21/30] sched: fix task_h_load() Peter Zijlstra
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-fix-load_balance-h_load.patch --]
[-- Type: text/plain, Size: 1673 bytes --]

doing the load balance will change cfs_rq->load.weight (that's the whole point)
but since that's part of the scale factor, we'll scale back with a different
amount.

Weight getting smaller would result in an inflated moved_load which causes
it to stop balancing too soon.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched_fair.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1449,6 +1449,8 @@ load_balance_fair(struct rq *this_rq, in
 
 	list_for_each_entry(tg, &task_groups, list) {
 		struct cfs_rq *busiest_cfs_rq = tg->cfs_rq[busiest_cpu];
+		unsigned long busiest_h_load = busiest_cfs_rq->h_load;
+		unsigned long busiest_weight = busiest_cfs_rq->load.weight;
 		long rem_load, moved_load;
 
 		/*
@@ -1457,8 +1459,8 @@ load_balance_fair(struct rq *this_rq, in
 		if (!busiest_cfs_rq->task_weight)
 			continue;
 
-		rem_load = rem_load_move * busiest_cfs_rq->load.weight;
-		rem_load /= busiest_cfs_rq->h_load + 1;
+		rem_load = rem_load_move * busiest_weight;
+		rem_load /= busiest_h_load + 1;
 
 		moved_load = __load_balance_fair(this_rq, this_cpu, busiest,
 				rem_load, sd, idle, all_pinned, this_best_prio,
@@ -1467,8 +1469,8 @@ load_balance_fair(struct rq *this_rq, in
 		if (!moved_load)
 			continue;
 
-		moved_load *= busiest_cfs_rq->h_load;
-		moved_load /= busiest_cfs_rq->load.weight + 1;
+		moved_load *= busiest_h_load;
+		moved_load /= busiest_weight + 1;
 
 		rem_load_move -= moved_load;
 		if (rem_load_move < 0)

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 21/30] sched: fix task_h_load()
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (19 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 20/30] sched: fix load scaling in group balancing Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 22/30] sched: remove prio preference from balance decisions Peter Zijlstra
                   ` (10 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-fix-task_h_load.patch --]
[-- Type: text/plain, Size: 2816 bytes --]

Currently task_h_load() computes the load of a task and uses that to either
subtract it from the total, or add to it.

However, removing or adding a task need not have any effect on the total load
at all. Imagine adding a task to a group that is local to one cpu - in that
case the total load of that cpu is unaffected.

So properly compute addition/removal:

 s_i = S * rw_i / \Sum_j rw_j
 s'_i = S * (rw_i + wl) / (\Sum_j rw_j + wg)

then s'_i - s_i gives the change in load.

Where s_i is the shares for cpu i, S the group weight, rw_i the runqueue weight
for that cpu, wl the weight we add (subtract) and wg the weight contribution to
the runqueue.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched_fair.c |   49 ++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 40 insertions(+), 9 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1071,22 +1071,53 @@ static inline int wake_idle(int cpu, str
 static const struct sched_class fair_sched_class;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static unsigned long task_h_load(struct task_struct *p)
+static unsigned long effective_load(struct task_group *tg, long wl, int cpu)
 {
-	unsigned long h_load = p->se.load.weight;
-	struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
+	struct sched_entity *se = tg->se[cpu];
+	long wg = wl;
 
-	update_h_load(task_cpu(p));
+	for_each_sched_entity(se) {
+#define D(n) (likely(n) ? (n) : 1)
+
+		long S, Srw, rw, s, sn;
+
+		S = se->my_q->tg->shares;
+		s = se->my_q->shares;
+		rw = se->my_q->load.weight;
 
-	h_load = calc_delta_mine(h_load, cfs_rq->h_load, &cfs_rq->load);
+		Srw = S * rw / D(s);
+		sn = S * (rw + wl) / D(Srw + wg);
+
+		wl = sn - s;
+		wg = 0;
+#undef D
+	}
 
-	return h_load;
+	return wl;
 }
+
+static unsigned long task_load_sub(struct task_struct *p)
+{
+	return effective_load(task_group(p), -(long)p->se.load.weight, task_cpu(p));
+}
+
+static unsigned long task_load_add(struct task_struct *p, int cpu)
+{
+	return effective_load(task_group(p), p->se.load.weight, cpu);
+}
+
 #else
-static unsigned long task_h_load(struct task_struct *p)
+
+static unsigned long task_load_sub(struct task_struct *p)
+{
+	return -p->se.load.weight;
+}
+
+static unsigned long task_load_add(struct task_struct *p, int cpu)
 {
 	return p->se.load.weight;
 }
+
 #endif
 
 static int
@@ -1109,9 +1140,9 @@ wake_affine(struct rq *rq, struct sched_
 	 * of the current CPU:
 	 */
 	if (sync)
-		tl -= task_h_load(current);
+		tl += task_load_sub(current);
 
-	balanced = 100*(tl + task_h_load(p)) <= imbalance*load;
+	balanced = 100*(tl + task_load_add(p, this_cpu)) <= imbalance*load;
 
 	/*
 	 * If the currently running task will sleep within

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 22/30] sched: remove prio preference from balance decisions
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (20 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 21/30] sched: fix task_h_load() Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 23/30] sched: optimize effective_load() Peter Zijlstra
                   ` (9 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-skip_for_load2.patch --]
[-- Type: text/plain, Size: 1414 bytes --]

Priority looses much of its meaning in a hierarchical context. So don't
use it in balance decisions.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched.c |   12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2884,7 +2884,7 @@ balance_tasks(struct rq *this_rq, int th
 	      enum cpu_idle_type idle, int *all_pinned,
 	      int *this_best_prio, struct rq_iterator *iterator)
 {
-	int loops = 0, pulled = 0, pinned = 0, skip_for_load;
+	int loops = 0, pulled = 0, pinned = 0;
 	struct task_struct *p;
 	long rem_load_move = max_load_move;
 
@@ -2900,14 +2900,8 @@ balance_tasks(struct rq *this_rq, int th
 next:
 	if (!p || loops++ > sysctl_sched_nr_migrate)
 		goto out;
-	/*
-	 * To help distribute high priority tasks across CPUs we don't
-	 * skip a task if it will be the highest priority task (i.e. smallest
-	 * prio value) on its new queue regardless of its load weight
-	 */
-	skip_for_load = (p->se.load.weight >> 1) > rem_load_move +
-							 SCHED_LOAD_SCALE_FUZZ;
-	if ((skip_for_load && p->prio >= *this_best_prio) ||
+
+	if ((p->se.load.weight >> 1) > rem_load_move ||
 	    !can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) {
 		p = iterator->next(iterator->arg);
 		goto next;

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 23/30] sched: optimize effective_load()
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (21 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 22/30] sched: remove prio preference from balance decisions Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 24/30] sched: disable source/target_load bias Peter Zijlstra
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-simplyfy-effective-load.patch --]
[-- Type: text/plain, Size: 1084 bytes --]

s_i = S * rw_i / \Sum_j rw_j

 -> \Sum_j rw_j = S * rw_i / s_i

 -> s'_i = S * (rw_i + w) / (\Sum_j rw_j + w)

delta s = s' - s = S * (rw + w) / ((S * rw / s) + w)
        = s * (S * (rw + w) / (S * rw + s * w) - 1)

 a = S*(rw+w), b = S*rw + s*w

delta s = s * (a-b) / b

IOW, trade one divide for two multiplies

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1079,16 +1079,16 @@ static unsigned long effective_load(stru
 	for_each_sched_entity(se) {
 #define D(n) (likely(n) ? (n) : 1)
 
-		long S, Srw, rw, s, sn;
+		long S, rw, s, a, b;
 
 		S = se->my_q->tg->shares;
 		s = se->my_q->shares;
 		rw = se->my_q->load.weight;
 
-		Srw = S * rw / D(s);
-		sn = S * (rw + wl) / D(Srw + wg);
+		a = S*(rw + wl);
+		b = S*rw + s*wg;
 
-		wl = sn - s;
+		wl = s*(a-b)/D(b);
 		wg = 0;
 #undef D
 	}

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 24/30] sched: disable source/target_load bias
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (22 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 23/30] sched: optimize effective_load() Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 25/30] sched: fix shares boost logic Peter Zijlstra
                   ` (7 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-kill-source-target-load.patch --]
[-- Type: text/plain, Size: 1353 bytes --]

The bias given by source/target_load functions can be very large, disable
it by default to get faster convergence.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched.c          |    4 ++--
 kernel/sched_features.h |    1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2000,7 +2000,7 @@ static unsigned long source_load(int cpu
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
 
-	if (type == 0)
+	if (type == 0 || !sched_feat(LB_BIAS))
 		return total;
 
 	return min(rq->cpu_load[type-1], total);
@@ -2015,7 +2015,7 @@ static unsigned long target_load(int cpu
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
 
-	if (type == 0)
+	if (type == 0 || !sched_feat(LB_BIAS))
 		return total;
 
 	return max(rq->cpu_load[type-1], total);
Index: linux-2.6/kernel/sched_features.h
===================================================================
--- linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -7,3 +7,4 @@ SCHED_FEAT(SYNC_WAKEUPS, 1)
 SCHED_FEAT(HRTICK, 1)
 SCHED_FEAT(DOUBLE_TICK, 0)
 SCHED_FEAT(ASYM_GRAN, 1)
+SCHED_FEAT(LB_BIAS, 0)
\ No newline at end of file

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 25/30] sched: fix shares boost logic
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (23 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 24/30] sched: disable source/target_load bias Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 26/30] sched: update shares on wakeup Peter Zijlstra
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-fixup-tg_shares_up.patch --]
[-- Type: text/plain, Size: 751 bytes --]

In case the domain is empty, pretend there is a single task on each cpu, so
that together with the boost logic we end up giving 1/n shares to each
cpu.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1543,6 +1543,9 @@ tg_shares_up(struct task_group *tg, int 
 	if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
 		shares = tg->shares;
 
+	if (!rq_weight)
+		rq_weight = cpus_weight(sd->span) * NICE_0_LOAD;
+
 	for_each_cpu_mask(i, sd->span) {
 		struct rq *rq = cpu_rq(i);
 		unsigned long flags;

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 26/30] sched: update shares on wakeup
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (24 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 25/30] sched: fix shares boost logic Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 27/30] sched: fix mult overflow Peter Zijlstra
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-throttle-update_shares.patch --]
[-- Type: text/plain, Size: 3814 bytes --]

We found that the affine wakeup code needs rather accurate load figures
to be effective. The trouble is that updating the load figures is fairly
expensive with group scheduling. Therefore ratelimit the updating.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/sched.h   |    3 +++
 kernel/sched.c          |   30 +++++++++++++++++++++++++++++-
 kernel/sched_features.h |    3 ++-
 kernel/sysctl.c         |    8 ++++++++
 4 files changed, 42 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -778,6 +778,12 @@ late_initcall(sched_init_debug);
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
 
 /*
+ * ratelimit for updating the group shares.
+ * default: 0.5ms
+ */
+const_debug unsigned int sysctl_sched_shares_ratelimit = 500000;
+
+/*
  * period over which we measure -rt task cpu usage in us.
  * default: 1s
  */
@@ -1590,7 +1596,13 @@ tg_nop(struct task_group *tg, int cpu, s
 
 static void update_shares(struct sched_domain *sd)
 {
-	walk_tg_tree(tg_nop, tg_shares_up, 0, sd);
+	u64 now = cpu_clock(raw_smp_processor_id());
+	s64 elapsed = now - sd->last_update;
+
+	if (elapsed >= (s64)(u64)sysctl_sched_shares_ratelimit) {
+		sd->last_update = now;
+		walk_tg_tree(tg_nop, tg_shares_up, 0, sd);
+	}
 }
 
 static void update_shares_locked(struct rq *rq, struct sched_domain *sd)
@@ -2199,6 +2211,22 @@ static int try_to_wake_up(struct task_st
 	if (!sched_feat(SYNC_WAKEUPS))
 		sync = 0;
 
+#ifdef CONFIG_SMP
+	if (sched_feat(LB_WAKEUP_UPDATE)) {
+		struct sched_domain *sd;
+
+		this_cpu = raw_smp_processor_id();
+		cpu = task_cpu(p);
+
+		for_each_domain(this_cpu, sd) {
+			if (cpu_isset(cpu, sd->span)) {
+				update_shares(sd);
+				break;
+			}
+		}
+	}
+#endif
+
 	smp_wmb();
 	rq = task_rq_lock(p, &flags);
 	old_state = p->state;
Index: linux-2.6/kernel/sched_features.h
===================================================================
--- linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -8,4 +8,5 @@ SCHED_FEAT(SYNC_WAKEUPS, 1)
 SCHED_FEAT(HRTICK, 1)
 SCHED_FEAT(DOUBLE_TICK, 0)
 SCHED_FEAT(ASYM_GRAN, 1)
-SCHED_FEAT(LB_BIAS, 0)
\ No newline at end of file
+SCHED_FEAT(LB_BIAS, 0)
+SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -783,6 +783,8 @@ struct sched_domain {
 	unsigned int balance_interval;	/* initialise to 1. units in ms. */
 	unsigned int nr_balance_failed; /* initialise to 0 */
 
+	u64 last_update;
+
 #ifdef CONFIG_SCHEDSTATS
 	/* load_balance() stats */
 	unsigned int lb_count[CPU_MAX_IDLE_TYPES];
@@ -1605,6 +1607,7 @@ extern unsigned int sysctl_sched_child_r
 extern unsigned int sysctl_sched_features;
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
+extern unsigned int sysctl_sched_shares_ratelimit;
 
 int sched_nr_latency_handler(struct ctl_table *table, int write,
 		struct file *file, void __user *buffer, size_t *length,
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -269,6 +269,14 @@ static struct ctl_table kern_table[] = {
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_shares_ratelimit",
+		.data		= &sysctl_sched_shares_ratelimit,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "sched_child_runs_first",
 		.data		= &sysctl_sched_child_runs_first,
 		.maxlen		= sizeof(unsigned int),

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 27/30] sched: fix mult overflow
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (25 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 26/30] sched: update shares on wakeup Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 28/30] sched: correct wakeup weight calculations Peter Zijlstra
                   ` (4 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-fix-mult-overflow.patch --]
[-- Type: text/plain, Size: 1502 bytes --]

From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

It was observed these mults can overflow.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1518,7 +1518,7 @@ load_balance_fair(struct rq *this_rq, in
 		struct cfs_rq *busiest_cfs_rq = tg->cfs_rq[busiest_cpu];
 		unsigned long busiest_h_load = busiest_cfs_rq->h_load;
 		unsigned long busiest_weight = busiest_cfs_rq->load.weight;
-		long rem_load, moved_load;
+		u64 rem_load, moved_load;
 
 		/*
 		 * empty group
@@ -1526,8 +1526,8 @@ load_balance_fair(struct rq *this_rq, in
 		if (!busiest_cfs_rq->task_weight)
 			continue;
 
-		rem_load = rem_load_move * busiest_weight;
-		rem_load /= busiest_h_load + 1;
+		rem_load = (u64)rem_load_move * busiest_weight;
+		rem_load = div_u64(rem_load, busiest_h_load + 1);
 
 		moved_load = __load_balance_fair(this_rq, this_cpu, busiest,
 				rem_load, sd, idle, all_pinned, this_best_prio,
@@ -1537,7 +1537,7 @@ load_balance_fair(struct rq *this_rq, in
 			continue;
 
 		moved_load *= busiest_h_load;
-		moved_load /= busiest_weight + 1;
+		moved_load = div_u64(moved_load, busiest_weight + 1);
 
 		rem_load_move -= moved_load;
 		if (rem_load_move < 0)

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 28/30] sched: correct wakeup weight calculations
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (26 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 27/30] sched: fix mult overflow Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 29/30] sched: incremental effective_load() Peter Zijlstra
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-effective_load.patch --]
[-- Type: text/plain, Size: 3792 bytes --]

rw_i = {2, 4, 1, 0}
s_i = {2/7, 4/7, 1/7, 0}

wakeup on cpu0, weight=1

rw'_i = {3, 4, 1, 0}
s'_i = {3/8, 4/8, 1/8, 0}

s_0 = S * rw_0 / \Sum rw_j ->
  \Sum rw_j = S*rw_0/s_0 = 1*2*7/2 = 7 (correct)

s'_0 = S * (rw_0 + 1) / (\Sum rw_j + 1) =
       1 * (2+1) / (7+1) = 3/8 (correct

so we find that adding 1 to cpu0 gains 5/56 in weight
if say the other cpu were, cpu1, we'd also have to calculate its 4/56 loss

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched_fair.c |   48 ++++++++++++++++++++++++++----------------------
 1 file changed, 26 insertions(+), 22 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1074,10 +1074,10 @@ static inline int wake_idle(int cpu, str
 static const struct sched_class fair_sched_class;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static unsigned long effective_load(struct task_group *tg, long wl, int cpu)
+static unsigned long effective_load(struct task_group *tg, int cpu,
+		unsigned long wl, unsigned long wg)
 {
 	struct sched_entity *se = tg->se[cpu];
-	long wg = wl;
 
 	for_each_sched_entity(se) {
 #define D(n) (likely(n) ? (n) : 1)
@@ -1092,6 +1092,13 @@ static unsigned long effective_load(stru
 		b = S*rw + s*wg;
 
 		wl = s*(a-b)/D(b);
+		/*
+		 * Assume the group is already running and will
+		 * thus already be accounted for in the weight.
+		 *
+		 * That is, moving shares between CPUs, does not
+		 * alter the group weight.
+		 */
 		wg = 0;
 #undef D
 	}
@@ -1099,26 +1106,12 @@ static unsigned long effective_load(stru
 	return wl;
 }
 
-static unsigned long task_load_sub(struct task_struct *p)
-{
-	return effective_load(task_group(p), -(long)p->se.load.weight, task_cpu(p));
-}
-
-static unsigned long task_load_add(struct task_struct *p, int cpu)
-{
-	return effective_load(task_group(p), p->se.load.weight, cpu);
-}
-
 #else
 
-static unsigned long task_load_sub(struct task_struct *p)
+static inline unsigned long effective_load(struct task_group *tg, int cpu,
+		unsigned long wl, unsigned long wg)
 {
-	return -p->se.load.weight;
-}
-
-static unsigned long task_load_add(struct task_struct *p, int cpu)
-{
-	return p->se.load.weight;
+	return wl;
 }
 
 #endif
@@ -1130,8 +1123,10 @@ wake_affine(struct rq *rq, struct sched_
 	    unsigned int imbalance)
 {
 	struct task_struct *curr = this_rq->curr;
+	struct task_group *tg;
 	unsigned long tl = this_load;
 	unsigned long tl_per_task;
+	unsigned long weight;
 	int balanced;
 
 	if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
@@ -1142,10 +1137,19 @@ wake_affine(struct rq *rq, struct sched_
 	 * effect of the currently running task from the load
 	 * of the current CPU:
 	 */
-	if (sync)
-		tl += task_load_sub(current);
+	if (sync) {
+		tg = task_group(current);
+		weight = current->se.load.weight;
+
+		tl += effective_load(tg, this_cpu, -weight, -weight);
+		load += effective_load(tg, prev_cpu, 0, -weight);
+	}
+
+	tg = task_group(p);
+	weight = p->se.load.weight;
 
-	balanced = 100*(tl + task_load_add(p, this_cpu)) <= imbalance*load;
+	balanced = 100*(tl + effective_load(tg, this_cpu, weight, weight)) <=
+		imbalance*(load + effective_load(tg, prev_cpu, 0, weight));
 
 	/*
 	 * If the currently running task will sleep within
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -363,6 +363,10 @@ static inline void set_task_rq(struct ta
 #else
 
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
+static inline struct task_group *task_group(struct task_struct *p)
+{
+	return NULL;
+}
 
 #endif	/* CONFIG_GROUP_SCHED */
 

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 29/30] sched: incremental effective_load()
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (27 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 28/30] sched: correct wakeup weight calculations Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 11:41 ` [PATCH 30/30] sched: bias effective_load() error towards failing wake_affine() Peter Zijlstra
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-incremental-wake_affine.patch --]
[-- Type: text/plain, Size: 2182 bytes --]

Increase the accuracy of the effective_load values.

Not only consider the current increment (as per the attempted wakeup), but
also consider the delta between when we last adjusted the shares and the
current situation.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched.c      |    6 ++++++
 kernel/sched_fair.c |   18 +++++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1074,10 +1074,22 @@ static inline int wake_idle(int cpu, str
 static const struct sched_class fair_sched_class;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static unsigned long effective_load(struct task_group *tg, int cpu,
-		unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu,
+		long wl, long wg)
 {
 	struct sched_entity *se = tg->se[cpu];
+	long more_w;
+
+	if (!tg->parent)
+		return wl;
+
+	/*
+	 * Instead of using this increment, also add the difference
+	 * between when the shares were last updated and now.
+	 */
+	more_w = se->my_q->load.weight - se->my_q->rq_weight;
+	wl += more_w;
+	wg += more_w;
 
 	for_each_sched_entity(se) {
 #define D(n) (likely(n) ? (n) : 1)
@@ -1086,7 +1098,7 @@ static unsigned long effective_load(stru
 
 		S = se->my_q->tg->shares;
 		s = se->my_q->shares;
-		rw = se->my_q->load.weight;
+		rw = se->my_q->rq_weight;
 
 		a = S*(rw + wl);
 		b = S*rw + s*wg;
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -427,6 +427,11 @@ struct cfs_rq {
 	 * this cpu's part of tg->shares
 	 */
 	unsigned long shares;
+
+	/*
+	 * load.weight at the time we set shares
+	 */
+	unsigned long rq_weight;
 #endif
 #endif
 };
@@ -1527,6 +1532,7 @@ __update_group_shares_cpu(struct task_gr
 	 * record the actual number of shares, not the boosted amount.
 	 */
 	tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
+	tg->cfs_rq[cpu]->rq_weight = rq_weight;
 
 	if (shares < MIN_SHARES)
 		shares = MIN_SHARES;

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 30/30] sched: bias effective_load() error towards failing wake_affine().
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (28 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 29/30] sched: incremental effective_load() Peter Zijlstra
@ 2008-06-27 11:41 ` Peter Zijlstra
  2008-06-27 12:46 ` [PATCH 00/30] SMP-group balancer - take 3 Ingo Molnar
  2008-06-27 17:33 ` Dhaval Giani
  31 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-06-27 11:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-asymetric-effective-load.patch --]
[-- Type: text/plain, Size: 2463 bytes --]

Measurement shows that the difference between cgroup:/ and cgroup:/foo
wake_affine() results is that the latter succeeds significantly more.

Therefore bias the calculations towards failing the test.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 kernel/sched_fair.c     |   28 ++++++++++++++++++++++++++++
 kernel/sched_features.h |    1 +
 2 files changed, 29 insertions(+)

Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1074,6 +1074,27 @@ static inline int wake_idle(int cpu, str
 static const struct sched_class fair_sched_class;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+/*
+ * effective_load() calculates the load change as seen from the root_task_group
+ *
+ * Adding load to a group doesn't make a group heavier, but can cause movement
+ * of group shares between cpus. Assuming the shares were perfectly aligned one
+ * can calculate the shift in shares.
+ *
+ * The problem is that perfectly aligning the shares is rather expensive, hence
+ * we try to avoid doing that too often - see update_shares(), which ratelimits
+ * this change.
+ *
+ * We compensate this by not only taking the current delta into account, but
+ * also considering the delta between when the shares were last adjusted and
+ * now.
+ *
+ * We still saw a performance dip, some tracing learned us that between
+ * cgroup:/ and cgroup:/foo balancing the number of affine wakeups increased
+ * significantly. Therefore try to bias the error in direction of failing
+ * the affine wakeup.
+ *
+ */
 static long effective_load(struct task_group *tg, int cpu,
 		long wl, long wg)
 {
@@ -1084,6 +1105,13 @@ static long effective_load(struct task_g
 		return wl;
 
 	/*
+	 * By not taking the decrease of shares on the other cpu into
+	 * account our error leans towards reducing the affine wakeups.
+	 */
+	if (!wl && sched_feat(ASYM_EFF_LOAD))
+		return wl;
+
+	/*
 	 * Instead of using this increment, also add the difference
 	 * between when the shares were last updated and now.
 	 */
Index: linux-2.6/kernel/sched_features.h
===================================================================
--- linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -10,3 +10,4 @@ SCHED_FEAT(DOUBLE_TICK, 0)
 SCHED_FEAT(ASYM_GRAN, 1)
 SCHED_FEAT(LB_BIAS, 0)
 SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
+SCHED_FEAT(ASYM_EFF_LOAD, 1)

-- 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/30] SMP-group balancer - take 3
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (29 preceding siblings ...)
  2008-06-27 11:41 ` [PATCH 30/30] sched: bias effective_load() error towards failing wake_affine() Peter Zijlstra
@ 2008-06-27 12:46 ` Ingo Molnar
  2008-06-27 17:33 ` Dhaval Giani
  31 siblings, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2008-06-27 12:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Srivatsa Vaddagiri, Mike Galbraith


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Hi,
> 
> Another go at SMP fairness for group scheduling.
> 
> This code needs some serious testing,..
> 
> However on my system performance doesn't tank as much as it used to. 
> I've ran sysbench and volanomark benchmarks.
> 
> The machine is a Quad core (Intel Q9450) with 4GB of RAM.
> Fedora9 - x86_64
> 
> sysbench-0.4.8 + postgresql-8.3.3
> volanomark-2.5.0.9 + openjdk-1.6.0
> 
> I've used cgroup group scheduling.

cool. I have applied your patches to a new temporary topic, 
tip/sched/devel.smp-group-balance. If that works out fine in testing 
then we can merge it back into sched/devel.

Thanks Peter,

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/30] SMP-group balancer - take 3
  2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
                   ` (30 preceding siblings ...)
  2008-06-27 12:46 ` [PATCH 00/30] SMP-group balancer - take 3 Ingo Molnar
@ 2008-06-27 17:33 ` Dhaval Giani
  2008-06-28 17:08   ` Dhaval Giani
  31 siblings, 1 reply; 39+ messages in thread
From: Dhaval Giani @ 2008-06-27 17:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith

On Fri, Jun 27, 2008 at 01:41:09PM +0200, Peter Zijlstra wrote:
> Hi,
> 
> Another go at SMP fairness for group scheduling.
> 
> This code needs some serious testing,..
> 
> However on my system performance doesn't tank as much as it used to.
> I've ran sysbench and volanomark benchmarks.
> 
> The machine is a Quad core (Intel Q9450) with 4GB of RAM.
> Fedora9 - x86_64
> 
> sysbench-0.4.8 + postgresql-8.3.3
> volanomark-2.5.0.9 + openjdk-1.6.0
> 
> I've used cgroup group scheduling.
> 
> cgroup:/ - means all tasks are in the root group
> cgroup:/foo - means all tasks are in a subgroup
> 
> mkdir /cgroup/foo
> for i in `cat /cgroup/tasks`; do
>   echo $i > /cgroup/foo/tasks
> done
> 
> The patches are against: tip/auto-sched-next of a few days ago.
> 
> ---
> 
> .25
> 
> [root@twins sysbench-0.4.8]# ./doit-psql-256-60sec 
>   1:     transactions:                        50514  (841.90 per sec.)
>   2:     transactions:                        98745  (1645.73 per sec.)
>   4:     transactions:                        192682 (3211.31 per sec.)
>   8:     transactions:                        192082 (3201.26 per sec.)
>  16:     transactions:                        188891 (3147.95 per sec.)
>  32:     transactions:                        182364 (3039.12 per sec.)
>  64:     transactions:                        169412 (2822.94 per sec.)
> 128:     transactions:                        139505 (2323.95 per sec.)
> 256:     transactions:                        131516 (2188.98 per sec.)
> 
> [root@twins vmark]# LOOP_CLIENT_COUNT=1000 ./loopclient.sh 2>&1 | grep Average
> Average throughput = 113350 messages per second
> Average throughput = 112230 messages per second
> Average throughput = 113125 messages per second
> 
> 
> .26-rc
> 
> cgroup:/
> 
> [root@twins sysbench-0.4.8]# ./doit-psql-256-60sec 
>   1:     transactions:                        50553  (842.54 per sec.)
>   2:     transactions:                        98625  (1643.74 per sec.)
>   4:     transactions:                        191351 (3189.12 per sec.)
>   8:     transactions:                        193525 (3225.32 per sec.)
>  16:     transactions:                        190516 (3175.10 per sec.)
>  32:     transactions:                        186914 (3114.96 per sec.)
>  64:     transactions:                        178940 (2981.78 per sec.)
> 128:     transactions:                        156430 (2606.00 per sec.)
> 256:     transactions:                        134929 (2246.63 per sec.)
> 
> [root@twins vmark]# LOOP_CLIENT_COUNT=1000 ./loopclient.sh 2>&1 | grep Average
> Average throughput = 124089 messages per second
> Average throughput = 121962 messages per second
> Average throughput = 121223 messages per second
> 
> 
> cgroup:/foo
> 
> [root@twins sysbench-0.4.8]# ./doit-psql-256-60sec 
>   1:     transactions:                        50246  (837.43 per sec.)
>   2:     transactions:                        97466  (1624.41 per sec.)
>   4:     transactions:                        179609 (2993.43 per sec.)
>   8:     transactions:                        190931 (3182.07 per sec.)
>  16:     transactions:                        189882 (3164.50 per sec.)
>  32:     transactions:                        184649 (3077.14 per sec.)
>  64:     transactions:                        178200 (2969.46 per sec.)
> 128:     transactions:                        158835 (2646.14 per sec.)
> 256:     transactions:                        142100 (2366.51 per sec.)
> 
> [root@twins vmark]# LOOP_CLIENT_COUNT=1000 ./loopclient.sh 2>&1 | grep Average
> Average throughput = 117789 messages per second
> Average throughput = 118154 messages per second
> Average throughput = 118945 messages per second
> 
> 
> .26-rc-smp-group
> 
> cgroup:/
> 
> [root@twins sysbench-0.4.8]# ./doit-psql-256-60sec 
>   1:     transactions:                        50137  (835.61 per sec.)
>   2:     transactions:                        97406  (1623.41 per sec.)
>   4:     transactions:                        170755 (2845.88 per sec.)
>   8:     transactions:                        187406 (3123.35 per sec.)
>  16:     transactions:                        186865 (3114.18 per sec.)
>  32:     transactions:                        183559 (3059.03 per sec.)
>  64:     transactions:                        176834 (2946.70 per sec.)
> 128:     transactions:                        158882 (2647.04 per sec.)
> 256:     transactions:                        145081 (2415.81 per sec.)
> 
> [root@twins vmark]# LOOP_CLIENT_COUNT=1000 ./loopclient.sh 2>&1 | grep Average
> Average throughput = 121499 messages per second
> Average throughput = 120181 messages per second
> Average throughput = 119775 messages per second
> 
> 
> cgroup:/foo
> 
> [root@twins sysbench-0.4.8]# ./doit-psql-256-60sec 
>   1:     transactions:                        49564  (826.06 per sec.)
>   2:     transactions:                        96642  (1610.67 per sec.)
>   4:     transactions:                        183081 (3051.29 per sec.)
>   8:     transactions:                        187553 (3125.79 per sec.)
>  16:     transactions:                        185435 (3090.45 per sec.)
>  32:     transactions:                        182314 (3038.25 per sec.)
>  64:     transactions:                        174527 (2908.22 per sec.)
> 128:     transactions:                        159321 (2654.24 per sec.)
> 256:     transactions:                        140167 (2333.82 per sec.)
> 
> [root@twins vmark]# LOOP_CLIENT_COUNT=1000 ./loopclient.sh 2>&1 | grep Average
> Average throughput = 130208 messages per second
> Average throughput = 129086 messages per second
> Average throughput = 129362 messages per second

Some fairness numbers from tip/master

kernel compiles with even number of threads
/cgroup/a
[dhaval@mordor a]$ time make -j8
real    1m53.033s
user    1m28.785s
sys     0m22.224s

/cgroup/b
[dhaval@mordor b]$ time make -j16
real    1m51.826s
user    1m29.022s
sys     0m21.911s

kernel compile with odd number of threads
/cgroup/a
[dhaval@mordor a]$ time make -j7
real    1m49.441s
user    1m26.962s
sys     0m21.698s

/cgroup/b
[dhaval@mordor b]$ time make -j13
real    1m50.418s
user    1m26.888s
sys     0m21.508s

Running infinite loops in parallel (5 in one group, 2 in another)

8789 - 8793 belong to /cgroup/a
8794, 8795 belong /cgroup/b

When we start.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
 8795 dhaval    20   0  1720  264  212 R 54.6  0.0   0:06.31 test              
 8794 dhaval    20   0  1720  264  212 R 45.6  0.0   0:06.91 test              
 8790 dhaval    20   0  1720  264  212 R 23.0  0.0   0:07.29 test              
 8789 dhaval    20   0  1720  260  212 R 22.6  0.0   0:07.80 test              
 8791 dhaval    20   0  1720  264  212 R 18.3  0.0   0:07.28 test              
 8792 dhaval    20   0  1720  260  212 R 18.3  0.0   0:07.01 test              
 8793 dhaval    20   0  1720  260  212 R 18.0  0.0   0:06.93 test              

After sometime

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
 8794 dhaval    20   0  1720  264  212 R 49.9  0.0   0:46.98 test              
 8795 dhaval    20   0  1720  264  212 R 49.9  0.0   0:52.61 test              
 8793 dhaval    20   0  1720  260  212 R 20.3  0.0   0:24.96 test              
 8789 dhaval    20   0  1720  260  212 R 20.0  0.0   0:24.83 test              
 8790 dhaval    20   0  1720  264  212 R 20.0  0.0   0:24.32 test              
 8791 dhaval    20   0  1720  264  212 R 20.0  0.0   0:23.29 test              
 8792 dhaval    20   0  1720  260  212 R 20.0  0.0   0:25.04 test              

But these numbers are not very stable. Also it takes a long time (~1min)
to converge here.

The results look really good though.

-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/30] SMP-group balancer - take 3
  2008-06-27 17:33 ` Dhaval Giani
@ 2008-06-28 17:08   ` Dhaval Giani
  2008-06-30 12:59     ` Ingo Molnar
  0 siblings, 1 reply; 39+ messages in thread
From: Dhaval Giani @ 2008-06-28 17:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith

Hi,

I get this at bootup

------------[ cut here ]------------
WARNING: at kernel/lockdep.c:2738 check_flags+0x8a/0x12d()
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.26-rc8-tip #5
 [<c0226971>] warn_on_slowpath+0x41/0x7b
 [<c024207e>] ? trace_hardirqs_off+0xb/0xd
 [<c0207fef>] ? native_sched_clock+0x8b/0x9d
 [<c022d490>] ? __sysctl_head_next+0x98/0x9f
 [<c057b286>] ? _spin_unlock+0x1d/0x20
 [<c022d490>] ? __sysctl_head_next+0x98/0x9f
 [<c0244a54>] ? __lock_acquire+0xd96/0xda5
 [<c024189b>] check_flags+0x8a/0x12d
 [<c0244a9e>] lock_acquire+0x3b/0x89
 [<c021cb50>] ? tg_shares_up+0x0/0x170
 [<c021b074>] walk_tg_tree+0x2c/0x9f
 [<c021b048>] ? walk_tg_tree+0x0/0x9f
 [<c02190f7>] ? tg_nop+0x0/0x5
 [<c0220d24>] update_shares+0x54/0x5d
 [<c0220d86>] try_to_wake_up+0x59/0x22b
 [<c0220f80>] wake_up_process+0xf/0x11
 [<c0237c94>] kthread_create+0x68/0x98
 [<c0234d75>] ? worker_thread+0x0/0xc2
 [<c0235207>] __create_workqueue_key+0x19e/0x1ee
 [<c0234d75>] ? worker_thread+0x0/0xc2
 [<c076dd9b>] init_workqueues+0x4c/0x5d
 [<c075b36e>] kernel_init+0xcf/0x255
 [<c0329858>] ? trace_hardirqs_on_thunk+0xc/0x10
 [<c024379e>] ? trace_hardirqs_on_caller+0x10b/0x136
 [<c0329858>] ? trace_hardirqs_on_thunk+0xc/0x10
 [<c0203a92>] ? restore_nocheck_notrace+0x0/0xe
 [<c075b29f>] ? kernel_init+0x0/0x255
 [<c075b29f>] ? kernel_init+0x0/0x255
 [<c0204623>] kernel_thread_helper+0x7/0x10
 =======================
---[ end trace 4eaa2a86a8e2da22 ]---
possible reason: unannotated irqs-on.
irq event stamp: 1892
hardirqs last  enabled at (1891): [<c02437d4>] trace_hardirqs_on+0xb/0xd
hardirqs last disabled at (1892): [<c024207e>]
trace_hardirqs_off+0xb/0xd
softirqs last  enabled at (1548): [<c022b6eb>] __do_softirq+0x13e/0x146
softirqs last disabled at (1541): [<c022b72d>] do_softirq+0x3a/0x52

-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/30] SMP-group balancer - take 3
  2008-06-28 17:08   ` Dhaval Giani
@ 2008-06-30 12:59     ` Ingo Molnar
  2008-06-30 14:53       ` Dhaval Giani
  0 siblings, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2008-06-30 12:59 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Peter Zijlstra, linux-kernel, Srivatsa Vaddagiri, Mike Galbraith


* Dhaval Giani <dhaval@linux.vnet.ibm.com> wrote:

> Hi,
> 
> I get this at bootup
> 
> ------------[ cut here ]------------
> WARNING: at kernel/lockdep.c:2738 check_flags+0x8a/0x12d()
> Modules linked in:
> Pid: 1, comm: swapper Not tainted 2.6.26-rc8-tip #5

please check latest tip/master. This is the commit that should fix it:

----------------
| commit 2d452c9b10caeec455eb5e56a0ef4ed485178213
| Author: Ingo Molnar <mingo@elte.hu>
| Date:   Sun Jun 29 15:01:59 2008 +0200
|
|     sched: sched_clock_cpu() based cpu_clock(), lockdep fix
|
|     Vegard Nossum reported:
|
|     > WARNING: at kernel/lockdep.c:2738 check_flags+0x142/0x160()
----------------

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/30] SMP-group balancer - take 3
  2008-06-30 12:59     ` Ingo Molnar
@ 2008-06-30 14:53       ` Dhaval Giani
  2008-07-01 10:57         ` Dhaval Giani
  0 siblings, 1 reply; 39+ messages in thread
From: Dhaval Giani @ 2008-06-30 14:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, linux-kernel, Srivatsa Vaddagiri, Mike Galbraith

On Mon, Jun 30, 2008 at 02:59:56PM +0200, Ingo Molnar wrote:
> 
> * Dhaval Giani <dhaval@linux.vnet.ibm.com> wrote:
> 
> > Hi,
> > 
> > I get this at bootup
> > 
> > ------------[ cut here ]------------
> > WARNING: at kernel/lockdep.c:2738 check_flags+0x8a/0x12d()
> > Modules linked in:
> > Pid: 1, comm: swapper Not tainted 2.6.26-rc8-tip #5
> 
> please check latest tip/master. This is the commit that should fix it:
> 

Nope, does not :(. Still get,

------------[ cut here ]------------
WARNING: at kernel/lockdep.c:2662 check_flags+0x7c/0x10b()
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.26-rc8 #2
 [<c0122a6d>] warn_on_slowpath+0x41/0x5d
 [<c013ba0e>] ? find_usage_backwards+0xb4/0xd5
 [<c013ba0e>] ? find_usage_backwards+0xb4/0xd5
 [<c013ba0e>] ? find_usage_backwards+0xb4/0xd5
 [<c013bc2b>] ? check_usage+0x23/0x58
 [<c013bcd1>] ? check_prev_add_irq+0x71/0x85
 [<c013be48>] ? check_prev_add+0x3b/0x17f
 [<c013bfe6>] ? check_prevs_add+0x5a/0xb2
 [<c013c0e8>] ? validate_chain+0xaa/0x29c
 [<c013def5>] check_flags+0x7c/0x10b
 [<c013dfb4>] lock_acquire+0x30/0x7e
 [<c01187b2>] ? tg_shares_up+0x0/0x100
 [<c01186b6>] walk_tg_tree+0x2c/0x96
 [<c011868a>] ? walk_tg_tree+0x0/0x96
 [<c0118907>] ? tg_nop+0x0/0x5
 [<c011894e>] update_shares+0x42/0x4a
 [<c011b87a>] try_to_wake_up+0x4c/0x11f
 [<c011b95c>] wake_up_process+0xf/0x11
 [<c01331b5>] kthread_create+0x6c/0x9c
 [<c0130739>] ? worker_thread+0x0/0xd2
 [<c024218c>] ? __spin_lock_init+0x24/0x47
 [<c0130ceb>] create_workqueue_thread+0x2b/0x45
 [<c0130739>] ? worker_thread+0x0/0xd2
 [<c0130e3a>] __create_workqueue_key+0x115/0x14d
 [<c05c8854>] ? kernel_init+0x0/0x93
 [<c05d7594>] init_workqueues+0x4c/0x5d
 [<c05c880d>] do_basic_setup+0x8/0x1e
 [<c05c88ac>] kernel_init+0x58/0x93
 [<c0104557>] kernel_thread_helper+0x7/0x10
 =======================
---[ end trace 4eaa2a86a8e2da22 ]---
possible reason: unannotated irqs-on.
irq event stamp: 10216
hardirqs last  enabled at (10215): [<c013e52b>]
debug_check_no_locks_freed+0x9d/0xa7
hardirqs last disabled at (10216): [<c0107f91>]
native_sched_clock+0x50/0xb8
softirqs last  enabled at (9922): [<c0127171>] __do_softirq+0xdf/0xe6
softirqs last disabled at (9915): [<c01271b1>] do_softirq+0x39/0x51

-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/30] sched: revert the revert of: weight calculations
  2008-06-27 11:41 ` [PATCH 02/30] sched: revert the revert of: weight calculations Peter Zijlstra
@ 2008-06-30 18:07   ` Balbir Singh
  2008-07-15 20:16     ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Balbir Singh @ 2008-06-30 18:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2008-06-27 13:41:11]:

> Try again..
> 
> initial commit: 8f1bc385cfbab474db6c27b5af1e439614f3025c
> revert: f9305d4a0968201b2818dbed0dc8cb0d4ee7aeb3
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
> 
> ---
>  kernel/sched.c          |    9 +---
>  kernel/sched_fair.c     |  105 ++++++++++++++++++++++++++++++++----------------
>  kernel/sched_features.h |    1 
>  3 files changed, 76 insertions(+), 39 deletions(-)
> 
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -1342,6 +1342,9 @@ static void __resched_task(struct task_s
>   */
>  #define SRR(x, y) (((x) + (1UL << ((y) - 1))) >> (y))
> 
> +/*
> + * delta *= weight / lw
> + */
>  static unsigned long
>  calc_delta_mine(unsigned long delta_exec, unsigned long weight,
>  		struct load_weight *lw)
> @@ -1369,12 +1372,6 @@ calc_delta_mine(unsigned long delta_exec
>  	return (unsigned long)min(tmp, (u64)(unsigned long)LONG_MAX);
>  }
> 
> -static inline unsigned long
> -calc_delta_fair(unsigned long delta_exec, struct load_weight *lw)
> -{
> -	return calc_delta_mine(delta_exec, NICE_0_LOAD, lw);
> -}
> -
>  static inline void update_load_add(struct load_weight *lw, unsigned long inc)
>  {
>  	lw->weight += inc;
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -334,6 +334,34 @@ int sched_nr_latency_handler(struct ctl_
>  #endif
> 
>  /*
> + * delta *= w / rw
> + */
> +static inline unsigned long
> +calc_delta_weight(unsigned long delta, struct sched_entity *se)
> +{
> +	for_each_sched_entity(se) {
> +		delta = calc_delta_mine(delta,
> +				se->load.weight, &cfs_rq_of(se)->load);
> +	}
> +
> +	return delta;
> +}
> +
> +/*
> + * delta *= rw / w
> + */
> +static inline unsigned long
> +calc_delta_fair(unsigned long delta, struct sched_entity *se)
> +{
> +	for_each_sched_entity(se) {
> +		delta = calc_delta_mine(delta,
> +				cfs_rq_of(se)->load.weight, &se->load);
> +	}
> +
> +	return delta;
> +}
> +

These functions can do with better comments

delta is scaled up as we move up the hierarchy

Why is calc_delta_weight() different from calc_delta_fair()?

> +/*
>   * The idea is to set a period in which each task runs once.
>   *
>   * When there are too many tasks (sysctl_sched_nr_latency) we have to stretch
> @@ -362,47 +390,54 @@ static u64 __sched_period(unsigned long 
>   */
>  static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -	u64 slice = __sched_period(cfs_rq->nr_running);
> -
> -	for_each_sched_entity(se) {
> -		cfs_rq = cfs_rq_of(se);
> -
> -		slice *= se->load.weight;
> -		do_div(slice, cfs_rq->load.weight);
> -	}
> -
> -
> -	return slice;
> +	return calc_delta_weight(__sched_period(cfs_rq->nr_running), se);
>  }
> 
>  /*
>   * We calculate the vruntime slice of a to be inserted task
>   *
> - * vs = s/w = p/rw
> + * vs = s*rw/w = p
>   */
>  static u64 sched_vslice_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>  	unsigned long nr_running = cfs_rq->nr_running;
> -	unsigned long weight;
> -	u64 vslice;
> 
>  	if (!se->on_rq)
>  		nr_running++;
> 
> -	vslice = __sched_period(nr_running);
> +	return __sched_period(nr_running);

Do we always return a constant value based on nr_running? Am I
misreading the diff by any chance?

> +}
> +
> +/*
> + * The goal of calc_delta_asym() is to be asymmetrically around NICE_0_LOAD, in
> + * that it favours >=0 over <0.
> + *
> + *   -20         |
> + *               |
> + *     0 --------+-------
> + *             .'
> + *    19     .'
> + *
> + */
> +static unsigned long
> +calc_delta_asym(unsigned long delta, struct sched_entity *se)
> +{
> +	struct load_weight lw = {
> +		.weight = NICE_0_LOAD,
> +		.inv_weight = 1UL << (WMULT_SHIFT-NICE_0_SHIFT)
> +	};

Could you please explain this

weight is 1 << 10
and inv_weight is 1 << 22



> 
>  	for_each_sched_entity(se) {
> -		cfs_rq = cfs_rq_of(se);
> +		struct load_weight *se_lw = &se->load;
> 
> -		weight = cfs_rq->load.weight;
> -		if (!se->on_rq)
> -			weight += se->load.weight;
> +		if (se->load.weight < NICE_0_LOAD)
> +			se_lw = &lw;

Why do we do this?

> 
> -		vslice *= NICE_0_LOAD;
> -		do_div(vslice, weight);
> +		delta = calc_delta_mine(delta,
> +				cfs_rq_of(se)->load.weight, se_lw);
>  	}
> 
> -	return vslice;
> +	return delta;
>  }
> 
>  /*
> @@ -419,11 +454,7 @@ __update_curr(struct cfs_rq *cfs_rq, str
> 
>  	curr->sum_exec_runtime += delta_exec;
>  	schedstat_add(cfs_rq, exec_clock, delta_exec);
> -	delta_exec_weighted = delta_exec;
> -	if (unlikely(curr->load.weight != NICE_0_LOAD)) {
> -		delta_exec_weighted = calc_delta_fair(delta_exec_weighted,
> -							&curr->load);
> -	}
> +	delta_exec_weighted = calc_delta_fair(delta_exec, curr);
>  	curr->vruntime += delta_exec_weighted;
>  }
> 
> @@ -609,8 +640,17 @@ place_entity(struct cfs_rq *cfs_rq, stru
> 
>  	if (!initial) {
>  		/* sleeps upto a single latency don't count. */
> -		if (sched_feat(NEW_FAIR_SLEEPERS))
> -			vruntime -= sysctl_sched_latency;
> +		if (sched_feat(NEW_FAIR_SLEEPERS)) {
> +			unsigned long thresh = sysctl_sched_latency;
> +
> +			/*
> +			 * convert the sleeper threshold into virtual time
> +			 */
> +			if (sched_feat(NORMALIZED_SLEEPER))
> +				thresh = calc_delta_fair(thresh, se);
> +
> +			vruntime -= thresh;
> +		}
> 
>  		/* ensure we never gain time by being placed backwards. */
>  		vruntime = max_vruntime(se->vruntime, vruntime);
> @@ -1111,11 +1151,10 @@ static unsigned long wakeup_gran(struct 
>  	unsigned long gran = sysctl_sched_wakeup_granularity;
> 
>  	/*
> -	 * More easily preempt - nice tasks, while not making
> -	 * it harder for + nice tasks.
> +	 * More easily preempt - nice tasks, while not making it harder for
> +	 * + nice tasks.
>  	 */
> -	if (unlikely(se->load.weight > NICE_0_LOAD))
> -		gran = calc_delta_fair(gran, &se->load);
> +	gran = calc_delta_asym(sysctl_sched_wakeup_granularity, se);
> 
>  	return gran;
>  }
> Index: linux-2.6/kernel/sched_features.h
> ===================================================================
> --- linux-2.6.orig/kernel/sched_features.h
> +++ linux-2.6/kernel/sched_features.h
> @@ -1,4 +1,5 @@
>  SCHED_FEAT(NEW_FAIR_SLEEPERS, 1)
> +SCHED_FEAT(NORMALIZED_SLEEPER, 1)
>  SCHED_FEAT(WAKEUP_PREEMPT, 1)
>  SCHED_FEAT(START_DEBIT, 1)
>  SCHED_FEAT(AFFINE_WAKEUPS, 1)

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/30] SMP-group balancer - take 3
  2008-06-30 14:53       ` Dhaval Giani
@ 2008-07-01 10:57         ` Dhaval Giani
  0 siblings, 0 replies; 39+ messages in thread
From: Dhaval Giani @ 2008-07-01 10:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, linux-kernel, Srivatsa Vaddagiri, Mike Galbraith

On Mon, Jun 30, 2008 at 08:23:57PM +0530, Dhaval Giani wrote:
> On Mon, Jun 30, 2008 at 02:59:56PM +0200, Ingo Molnar wrote:
> > 
> > * Dhaval Giani <dhaval@linux.vnet.ibm.com> wrote:
> > 
> > > Hi,
> > > 
> > > I get this at bootup
> > > 
> > > ------------[ cut here ]------------
> > > WARNING: at kernel/lockdep.c:2738 check_flags+0x8a/0x12d()
> > > Modules linked in:
> > > Pid: 1, comm: swapper Not tainted 2.6.26-rc8-tip #5
> > 
> > please check latest tip/master. This is the commit that should fix it:
> > 
> 
> Nope, does not :(. Still get,
> 

Ah, turns out my git-fetch did not work so well. I just pulled the
latest tip, and it seems to have been fixed. Sorry for the noise.

Thanks,
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/30] sched: revert the revert of: weight calculations
  2008-06-30 18:07   ` Balbir Singh
@ 2008-07-15 20:16     ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2008-07-15 20:16 UTC (permalink / raw)
  To: balbir; +Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Mike Galbraith

On Mon, 2008-06-30 at 23:37 +0530, Balbir Singh wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2008-06-27 13:41:11]:

> >  /*
> > + * delta *= w / rw
> > + */
> > +static inline unsigned long
> > +calc_delta_weight(unsigned long delta, struct sched_entity *se)
> > +{
> > +	for_each_sched_entity(se) {
> > +		delta = calc_delta_mine(delta,
> > +				se->load.weight, &cfs_rq_of(se)->load);
> > +	}
> > +
> > +	return delta;
> > +}
> > +
> > +/*
> > + * delta *= rw / w
> > + */
> > +static inline unsigned long
> > +calc_delta_fair(unsigned long delta, struct sched_entity *se)
> > +{
> > +	for_each_sched_entity(se) {
> > +		delta = calc_delta_mine(delta,
> > +				cfs_rq_of(se)->load.weight, &se->load);
> > +	}
> > +
> > +	return delta;
> > +}
> > +
> 
> These functions can do with better comments

you mean like: 

/*
 * delta *= \Prod_{i} rw_{i} / w_{i} ?
 */

?

> delta is scaled up as we move up the hierarchy
> 
> Why is calc_delta_weight() different from calc_delta_fair()?

Because they do the opposite operation.

I agree though that perhaps the names could have been chosen better.
I've wondered about that at several occasions but so far failed to come
up with anything sane.

> > +/*
> >   * The idea is to set a period in which each task runs once.
> >   *
> >   * When there are too many tasks (sysctl_sched_nr_latency) we have to stretch
> > @@ -362,47 +390,54 @@ static u64 __sched_period(unsigned long 
> >   */
> >  static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > -	u64 slice = __sched_period(cfs_rq->nr_running);
> > -
> > -	for_each_sched_entity(se) {
> > -		cfs_rq = cfs_rq_of(se);
> > -
> > -		slice *= se->load.weight;
> > -		do_div(slice, cfs_rq->load.weight);
> > -	}
> > -
> > -
> > -	return slice;
> > +	return calc_delta_weight(__sched_period(cfs_rq->nr_running), se);
> >  }
> > 
> >  /*
> >   * We calculate the vruntime slice of a to be inserted task
> >   *
> > - * vs = s/w = p/rw
> > + * vs = s*rw/w = p
> >   */
> >  static u64 sched_vslice_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> >  	unsigned long nr_running = cfs_rq->nr_running;
> > -	unsigned long weight;
> > -	u64 vslice;
> > 
> >  	if (!se->on_rq)
> >  		nr_running++;
> > 
> > -	vslice = __sched_period(nr_running);
> > +	return __sched_period(nr_running);
> 
> Do we always return a constant value based on nr_running? Am I
> misreading the diff by any chance?

static u64 __sched_period(unsigned long nr_running)
{
        u64 period = sysctl_sched_latency;
        unsigned long nr_latency = sched_nr_latency;

        if (unlikely(nr_running > nr_latency)) {
                period = sysctl_sched_min_granularity;
                period *= nr_running;
        }

        return period;
}

its not exactly constant..

> > +}
> > +
> > +/*
> > + * The goal of calc_delta_asym() is to be asymmetrically around NICE_0_LOAD, in
> > + * that it favours >=0 over <0.
> > + *
> > + *   -20         |
> > + *               |
> > + *     0 --------+-------
> > + *             .'
> > + *    19     .'
> > + *
> > + */
> > +static unsigned long
> > +calc_delta_asym(unsigned long delta, struct sched_entity *se)
> > +{
> > +	struct load_weight lw = {
> > +		.weight = NICE_0_LOAD,
> > +		.inv_weight = 1UL << (WMULT_SHIFT-NICE_0_SHIFT)
> > +	};
> 
> Could you please explain this
> 
> weight is 1 << 10
> and inv_weight is 1 << 22

we have the relation that:

 x/weight ~= (x*inv_weight) >> 32

or

 inv_weight = (1<<32) / weight

See kernel/sched.c:calc_delta_mine()

when weight is 1<<10, that reduces to 1<<(32-10) = 1<<22

> > 
> >  	for_each_sched_entity(se) {
> > -		cfs_rq = cfs_rq_of(se);
> > +		struct load_weight *se_lw = &se->load;
> > 
> > -		weight = cfs_rq->load.weight;
> > -		if (!se->on_rq)
> > -			weight += se->load.weight;
> > +		if (se->load.weight < NICE_0_LOAD)
> > +			se_lw = &lw;
> 
> Why do we do this?

You're basically asking what the _asym part is about, right?

So, what this patch does is change the virtual time calculation from:

 1 / w, to rw / w

[ actuallly to: \Prod_{i} rw_{i}/w_{i} ]

Now wakeup_gran() has this asymetry:

> > 	/*
> > -	 * More easily preempt - nice tasks, while not making
> > -	 * it harder for + nice tasks.
> >  	 */
> > -	if (unlikely(se->load.weight > NICE_0_LOAD))
> > -		gran = calc_delta_fair(gran, &se->load);

calc_delta_asym() tries to generalize that to the new scheme. As you can
see from the next two patches the code in this patch isn't perfect. This
patch just restores the status quo to before the revert, the next
patches continue.



^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2008-07-15 20:16 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-27 11:41 [PATCH 00/30] SMP-group balancer - take 3 Peter Zijlstra
2008-06-27 11:41 ` [PATCH 01/30] sched: clean up some unused variables Peter Zijlstra
2008-06-27 11:41 ` [PATCH 02/30] sched: revert the revert of: weight calculations Peter Zijlstra
2008-06-30 18:07   ` Balbir Singh
2008-07-15 20:16     ` Peter Zijlstra
2008-06-27 11:41 ` [PATCH 03/30] sched: fix calc_delta_asym() Peter Zijlstra
2008-06-27 11:41 ` [PATCH 04/30] sched: fix calc_delta_asym Peter Zijlstra
2008-06-27 11:41 ` [PATCH 05/30] sched: revert revert of: fair-group: SMP-nice for group scheduling Peter Zijlstra
2008-06-27 11:41 ` [PATCH 06/30] sched: sched_clock_cpu() based cpu_clock() Peter Zijlstra
2008-06-27 11:41 ` [PATCH 07/30] sched: fix wakeup granularity and buddy granularity Peter Zijlstra
2008-06-27 11:41 ` [PATCH 08/30] sched: add full schedstats to /proc/sched_debug Peter Zijlstra
2008-06-27 11:41 ` [PATCH 09/30] sched: fix sched_domain aggregation Peter Zijlstra
2008-06-27 11:41 ` [PATCH 10/30] sched: update aggregate when holding the RQs Peter Zijlstra
2008-06-27 11:41 ` [PATCH 11/30] sched: kill task_group balancing Peter Zijlstra
2008-06-27 11:41 ` [PATCH 12/30] sched: dont micro manage share losses Peter Zijlstra
2008-06-27 11:41 ` [PATCH 13/30] sched: no need to aggregate task_weight Peter Zijlstra
2008-06-27 11:41 ` [PATCH 14/30] sched: simplify the group load balancer Peter Zijlstra
2008-06-27 11:41 ` [PATCH 15/30] sched: fix newidle smp group balancing Peter Zijlstra
2008-06-27 11:41 ` [PATCH 16/30] sched: fix sched_balance_self() " Peter Zijlstra
2008-06-27 11:41 ` [PATCH 17/30] sched: persistent average load per task Peter Zijlstra
2008-06-27 11:41 ` [PATCH 18/30] sched: hierarchical load vs affine wakeups Peter Zijlstra
2008-06-27 11:41 ` [PATCH 19/30] sched: hierarchical load vs find_busiest_group Peter Zijlstra
2008-06-27 11:41 ` [PATCH 20/30] sched: fix load scaling in group balancing Peter Zijlstra
2008-06-27 11:41 ` [PATCH 21/30] sched: fix task_h_load() Peter Zijlstra
2008-06-27 11:41 ` [PATCH 22/30] sched: remove prio preference from balance decisions Peter Zijlstra
2008-06-27 11:41 ` [PATCH 23/30] sched: optimize effective_load() Peter Zijlstra
2008-06-27 11:41 ` [PATCH 24/30] sched: disable source/target_load bias Peter Zijlstra
2008-06-27 11:41 ` [PATCH 25/30] sched: fix shares boost logic Peter Zijlstra
2008-06-27 11:41 ` [PATCH 26/30] sched: update shares on wakeup Peter Zijlstra
2008-06-27 11:41 ` [PATCH 27/30] sched: fix mult overflow Peter Zijlstra
2008-06-27 11:41 ` [PATCH 28/30] sched: correct wakeup weight calculations Peter Zijlstra
2008-06-27 11:41 ` [PATCH 29/30] sched: incremental effective_load() Peter Zijlstra
2008-06-27 11:41 ` [PATCH 30/30] sched: bias effective_load() error towards failing wake_affine() Peter Zijlstra
2008-06-27 12:46 ` [PATCH 00/30] SMP-group balancer - take 3 Ingo Molnar
2008-06-27 17:33 ` Dhaval Giani
2008-06-28 17:08   ` Dhaval Giani
2008-06-30 12:59     ` Ingo Molnar
2008-06-30 14:53       ` Dhaval Giani
2008-07-01 10:57         ` Dhaval Giani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox