[RFC][PATCH 0/3] Try and make cpu-cgroup suck less

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH 0/3] Try and make cpu-cgroup suck less
@ 2010-08-28 22:30 Peter Zijlstra
  2010-08-28 22:30 ` [RFC][PATCH 1/3] sched: Rewrite tg_shares_up Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Peter Zijlstra @ 2010-08-28 22:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Paul Turner, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon

These patches try and make cpu-cgroup suck less.. 

Didn't test the RT changes much at all, did poke a little at the FAIR bits.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 1/3] sched: Rewrite tg_shares_up
  2010-08-28 22:30 [RFC][PATCH 0/3] Try and make cpu-cgroup suck less Peter Zijlstra
@ 2010-08-28 22:30 ` Peter Zijlstra
  2010-08-30 17:20   ` Srivatsa Vaddagiri
  2010-09-03  3:09   ` Paul Turner
  2010-08-28 22:30 ` [RFC][PATCH 2/3] sched: On-demand cfs_rq list Peter Zijlstra
  2010-08-28 22:30 ` [RFC][PATCH 3/3] sched: On-demand tg_shares_up() Peter Zijlstra
  2 siblings, 2 replies; 12+ messages in thread
From: Peter Zijlstra @ 2010-08-28 22:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Paul Turner, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Peter Zijlstra

[-- Attachment #1: sched-rework-tg_shares_up.patch --]
[-- Type: text/plain, Size: 18700 bytes --]

By tracking a per-cpu load-avg for each cfs_rq and folding it into a
global task_group load on each tick we can rework tg_shares_up to be
strictly per-cpu.

This should improve cpu-cgroup performance for smp systems
significantly.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h   |    2 
 kernel/sched.c          |  173 ++++++++++++------------------------------------
 kernel/sched_debug.c    |   11 ++-
 kernel/sched_fair.c     |  153 ++++++++++++++++++++++++++----------------
 kernel/sched_features.h |    2 
 kernel/sysctl.c         |   17 ----
 6 files changed, 147 insertions(+), 211 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1858,8 +1858,6 @@ static inline void wake_up_idle_cpu(int 
 extern unsigned int sysctl_sched_latency;
 extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
-extern unsigned int sysctl_sched_shares_ratelimit;
-extern unsigned int sysctl_sched_shares_thresh;
 extern unsigned int sysctl_sched_child_runs_first;
 
 enum sched_tunable_scaling {
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -253,6 +253,8 @@ struct task_group {
 	/* runqueue "owned" by this group on each cpu */
 	struct cfs_rq **cfs_rq;
 	unsigned long shares;
+
+	atomic_t load_weight;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -359,15 +361,11 @@ struct cfs_rq {
 	 */
 	unsigned long h_load;
 
-	/*
-	 * this cpu's part of tg->shares
-	 */
-	unsigned long shares;
+	u64 load_avg;
+	u64 load_period;
+	u64 load_stamp;
 
-	/*
-	 * load.weight at the time we set shares
-	 */
-	unsigned long rq_weight;
+	unsigned long load_contribution;
 #endif
 #endif
 };
@@ -793,20 +791,6 @@ late_initcall(sched_init_debug);
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
 
 /*
- * ratelimit for updating the group shares.
- * default: 0.25ms
- */
-unsigned int sysctl_sched_shares_ratelimit = 250000;
-unsigned int normalized_sysctl_sched_shares_ratelimit = 250000;
-
-/*
- * Inject some fuzzyness into changing the per-cpu group shares
- * this avoids remote rq-locks at the expense of fairness.
- * default: 4
- */
-unsigned int sysctl_sched_shares_thresh = 4;
-
-/*
  * period over which we average the RT time consumption, measured
  * in ms.
  *
@@ -1351,6 +1335,12 @@ static inline void update_load_sub(struc
 	lw->inv_weight = 0;
 }
 
+static inline void update_load_set(struct load_weight *lw, unsigned long w)
+{
+	lw->weight = w;
+	lw->inv_weight = 0;
+}
+
 /*
  * To aid in avoiding the subversion of "niceness" due to uneven distribution
  * of tasks with abnormal "nice" values across CPUs the contribution that
@@ -1539,97 +1529,44 @@ static unsigned long cpu_avg_load_per_ta
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-static __read_mostly unsigned long __percpu *update_shares_data;
-
-static void __set_se_shares(struct sched_entity *se, unsigned long shares);
-
-/*
- * Calculate and set the cpu's group shares.
- */
-static void update_group_shares_cpu(struct task_group *tg, int cpu,
-				    unsigned long sd_shares,
-				    unsigned long sd_rq_weight,
-				    unsigned long *usd_rq_weight)
-{
-	unsigned long shares, rq_weight;
-	int boost = 0;
-
-	rq_weight = usd_rq_weight[cpu];
-	if (!rq_weight) {
-		boost = 1;
-		rq_weight = NICE_0_LOAD;
-	}
-
-	/*
-	 *             \Sum_j shares_j * rq_weight_i
-	 * shares_i =  -----------------------------
-	 *                  \Sum_j rq_weight_j
-	 */
-	shares = (sd_shares * rq_weight) / sd_rq_weight;
-	shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);
-
-	if (abs(shares - tg->se[cpu]->load.weight) >
-			sysctl_sched_shares_thresh) {
-		struct rq *rq = cpu_rq(cpu);
-		unsigned long flags;
-
-		raw_spin_lock_irqsave(&rq->lock, flags);
-		tg->cfs_rq[cpu]->rq_weight = boost ? 0 : rq_weight;
-		tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
-		__set_se_shares(tg->se[cpu], shares);
-		raw_spin_unlock_irqrestore(&rq->lock, flags);
-	}
-}
+static void update_cfs_load(struct cfs_rq *cfs_rq);
+static void update_cfs_shares(struct cfs_rq *cfs_rq);
 
 /*
- * Re-compute the task group their per cpu shares over the given domain.
- * This needs to be done in a bottom-up fashion because the rq weight of a
- * parent group depends on the shares of its child groups.
+ * update tg->load_weight by folding this cpu's load_avg
  */
 static int tg_shares_up(struct task_group *tg, void *data)
 {
-	unsigned long weight, rq_weight = 0, sum_weight = 0, shares = 0;
-	unsigned long *usd_rq_weight;
-	struct sched_domain *sd = data;
+	unsigned long load_avg;
+	struct cfs_rq *cfs_rq;
 	unsigned long flags;
-	int i;
+	int cpu = (long)data;
+	struct rq *rq;
 
-	if (!tg->se[0])
+	if (!tg->se[cpu])
 		return 0;
 
-	local_irq_save(flags);
-	usd_rq_weight = per_cpu_ptr(update_shares_data, smp_processor_id());
-
-	for_each_cpu(i, sched_domain_span(sd)) {
-		weight = tg->cfs_rq[i]->load.weight;
-		usd_rq_weight[i] = weight;
-
-		rq_weight += weight;
-		/*
-		 * If there are currently no tasks on the cpu pretend there
-		 * is one of average load so that when a new task gets to
-		 * run here it will not get delayed by group starvation.
-		 */
-		if (!weight)
-			weight = NICE_0_LOAD;
+	rq = cpu_rq(cpu);
+	cfs_rq = tg->cfs_rq[cpu];
 
-		sum_weight += weight;
-		shares += tg->cfs_rq[i]->shares;
-	}
+	raw_spin_lock_irqsave(&rq->lock, flags);
 
-	if (!rq_weight)
-		rq_weight = sum_weight;
+	update_rq_clock(rq);
+	update_cfs_load(cfs_rq);
 
-	if ((!shares && rq_weight) || shares > tg->shares)
-		shares = tg->shares;
+	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
+	load_avg -= cfs_rq->load_contribution;
 
-	if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
-		shares = tg->shares;
+	atomic_add(load_avg, &tg->load_weight);
+	cfs_rq->load_contribution += load_avg;
 
-	for_each_cpu(i, sched_domain_span(sd))
-		update_group_shares_cpu(tg, i, shares, rq_weight, usd_rq_weight);
+	/*
+	 * We need to update shares after updating tg->load_weight in
+	 * order to adjust the weight of groups with long running tasks.
+	 */
+	update_cfs_shares(cfs_rq);
 
-	local_irq_restore(flags);
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
 
 	return 0;
 }
@@ -1648,7 +1585,7 @@ static int tg_load_down(struct task_grou
 		load = cpu_rq(cpu)->load.weight;
 	} else {
 		load = tg->parent->cfs_rq[cpu]->h_load;
-		load *= tg->cfs_rq[cpu]->shares;
+		load *= tg->se[cpu]->load.weight;
 		load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
 	}
 
@@ -1657,21 +1594,16 @@ static int tg_load_down(struct task_grou
 	return 0;
 }
 
-static void update_shares(struct sched_domain *sd)
+static void update_shares(long cpu)
 {
-	s64 elapsed;
-	u64 now;
-
 	if (root_task_group_empty())
 		return;
 
-	now = local_clock();
-	elapsed = now - sd->last_update;
+	/*
+	 * XXX: replace with an on-demand list
+	 */
 
-	if (elapsed >= (s64)(u64)sysctl_sched_shares_ratelimit) {
-		sd->last_update = now;
-		walk_tg_tree(tg_nop, tg_shares_up, sd);
-	}
+	walk_tg_tree(tg_nop, tg_shares_up, (void *)cpu);
 }
 
 static void update_h_load(long cpu)
@@ -1681,7 +1613,7 @@ static void update_h_load(long cpu)
 
 #else
 
-static inline void update_shares(struct sched_domain *sd)
+static inline void update_shares(int cpu)
 {
 }
 
@@ -1806,15 +1738,6 @@ static void double_rq_unlock(struct rq *
 
 #endif
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
-{
-#ifdef CONFIG_SMP
-	cfs_rq->shares = shares;
-#endif
-}
-#endif
-
 static void calc_load_account_idle(struct rq *this_rq);
 static void update_sysctl(void);
 static int get_update_sysctl_factor(void);
@@ -5400,7 +5323,6 @@ static void update_sysctl(void)
 	SET_SYSCTL(sched_min_granularity);
 	SET_SYSCTL(sched_latency);
 	SET_SYSCTL(sched_wakeup_granularity);
-	SET_SYSCTL(sched_shares_ratelimit);
 #undef SET_SYSCTL
 }
 
@@ -7652,8 +7574,7 @@ static void init_tg_cfs_entry(struct tas
 		se->cfs_rq = parent->my_q;
 
 	se->my_q = cfs_rq;
-	se->load.weight = tg->shares;
-	se->load.inv_weight = 0;
+	update_load_set(&se->load, tg->shares);
 	se->parent = parent;
 }
 #endif
@@ -7746,10 +7667,6 @@ void __init sched_init(void)
 
 #endif /* CONFIG_CGROUP_SCHED */
 
-#if defined CONFIG_FAIR_GROUP_SCHED && defined CONFIG_SMP
-	update_shares_data = __alloc_percpu(nr_cpu_ids * sizeof(unsigned long),
-					    __alignof__(unsigned long));
-#endif
 	for_each_possible_cpu(i) {
 		struct rq *rq;
 
@@ -8317,8 +8234,7 @@ static void __set_se_shares(struct sched
 	if (on_rq)
 		dequeue_entity(cfs_rq, se, 0);
 
-	se->load.weight = shares;
-	se->load.inv_weight = 0;
+	update_load_set(&se->load, shares);
 
 	if (on_rq)
 		enqueue_entity(cfs_rq, se, 0);
@@ -8375,7 +8291,6 @@ int sched_group_set_shares(struct task_g
 		/*
 		 * force a rebalance
 		 */
-		cfs_rq_set_shares(tg->cfs_rq[i], 0);
 		set_se_shares(tg->se[i], shares);
 	}
 
Index: linux-2.6/kernel/sched_debug.c
===================================================================
--- linux-2.6.orig/kernel/sched_debug.c
+++ linux-2.6/kernel/sched_debug.c
@@ -204,13 +204,18 @@ void print_cfs_rq(struct seq_file *m, in
 			SPLIT_NS(spread0));
 	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
 	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_avg",
+			SPLIT_NS(cfs_rq->load_avg));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_period",
+			SPLIT_NS(cfs_rq->load_period));
+	SEQ_printf(m, "  .%-30s: %ld\n", "load_contrib",
+			cfs_rq->load_contribution);
+	SEQ_printf(m, "  .%-30s: %d\n", "load_tg",
+			atomic_read(&tg->load_weight));
 
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);
 #ifdef CONFIG_FAIR_GROUP_SCHED
-#ifdef CONFIG_SMP
-	SEQ_printf(m, "  .%-30s: %lu\n", "shares", cfs_rq->shares);
-#endif
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
 #endif
 }
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -417,7 +417,6 @@ int sched_proc_update_handler(struct ctl
 	WRT_SYSCTL(sched_min_granularity);
 	WRT_SYSCTL(sched_latency);
 	WRT_SYSCTL(sched_wakeup_granularity);
-	WRT_SYSCTL(sched_shares_ratelimit);
 #undef WRT_SYSCTL
 
 	return 0;
@@ -633,7 +632,6 @@ account_entity_enqueue(struct cfs_rq *cf
 		list_add(&se->group_node, &cfs_rq->tasks);
 	}
 	cfs_rq->nr_running++;
-	se->on_rq = 1;
 }
 
 static void
@@ -647,9 +645,84 @@ account_entity_dequeue(struct cfs_rq *cf
 		list_del_init(&se->group_node);
 	}
 	cfs_rq->nr_running--;
-	se->on_rq = 0;
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static void update_cfs_load(struct cfs_rq *cfs_rq)
+{
+	u64 period = sched_avg_period();
+	u64 now = rq_of(cfs_rq)->clock;
+	u64 delta = now - cfs_rq->load_stamp;
+
+	cfs_rq->load_stamp = now;
+	cfs_rq->load_period += delta;
+	cfs_rq->load_avg += delta * cfs_rq->load.weight;
+
+	while (cfs_rq->load_period > period) {
+		/*
+		 * Inline assembly required to prevent the compiler
+		 * optimising this loop into a divmod call.
+		 * See __iter_div_u64_rem() for another example of this.
+		 */
+		asm("" : "+rm" (cfs_rq->load_period));
+		cfs_rq->load_period /= 2;
+		cfs_rq->load_avg /= 2;
+	}
+}
+
+static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+			    unsigned long weight)
+{
+	if (se->on_rq)
+		account_entity_dequeue(cfs_rq, se);
+
+	update_load_set(&se->load, weight);
+
+	if (se->on_rq)
+		account_entity_enqueue(cfs_rq, se);
+}
+
+static void update_cfs_shares(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg;
+	struct sched_entity *se;
+	unsigned long load_weight, load, shares;
+
+	if (!cfs_rq)
+		return;
+
+	tg = cfs_rq->tg;
+	se = tg->se[cpu_of(rq_of(cfs_rq))];
+	if (!se)
+		return;
+
+	load = cfs_rq->load.weight;
+
+	load_weight = atomic_read(&tg->load_weight);
+	load_weight -= cfs_rq->load_contribution;
+	load_weight += load;
+
+	shares = (tg->shares * load);
+	if (load_weight)
+		shares /= load_weight;
+
+	if (shares < MIN_SHARES)
+		shares = MIN_SHARES;
+	if (shares > tg->shares)
+		shares = tg->shares;
+
+	reweight_entity(cfs_rq_of(se), se, shares);
+}
+#else /* CONFIG_FAIR_GROUP_SCHED */
+static inline void update_cfs_load(struct cfs_rq *cfs_rq)
+{
+}
+
+static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
+{
+}
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 #ifdef CONFIG_SCHEDSTATS
@@ -771,7 +844,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+	update_cfs_load(cfs_rq);
 	account_entity_enqueue(cfs_rq, se);
+	update_cfs_shares(group_cfs_rq(se));
 
 	if (flags & ENQUEUE_WAKEUP) {
 		place_entity(cfs_rq, se, 0);
@@ -782,6 +857,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	check_spread(cfs_rq, se);
 	if (se != cfs_rq->curr)
 		__enqueue_entity(cfs_rq, se);
+	se->on_rq = 1;
 }
 
 static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -825,8 +901,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
+	se->on_rq = 0;
+	update_cfs_load(cfs_rq);
 	account_entity_dequeue(cfs_rq, se);
 	update_min_vruntime(cfs_rq);
+	update_cfs_shares(group_cfs_rq(se));
 
 	/*
 	 * Normalize the entity after updating the min_vruntime because the
@@ -1055,6 +1134,9 @@ enqueue_task_fair(struct rq *rq, struct 
 		flags = ENQUEUE_WAKEUP;
 	}
 
+	for_each_sched_entity(se)
+		update_cfs_shares(group_cfs_rq(se));
+
 	hrtick_update(rq);
 }
 
@@ -1071,12 +1153,16 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
+
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight)
 			break;
 		flags |= DEQUEUE_SLEEP;
 	}
 
+	for_each_sched_entity(se)
+		update_cfs_shares(group_cfs_rq(se));
+
 	hrtick_update(rq);
 }
 
@@ -1143,51 +1229,20 @@ static void task_waking_fair(struct rq *
  * Adding load to a group doesn't make a group heavier, but can cause movement
  * of group shares between cpus. Assuming the shares were perfectly aligned one
  * can calculate the shift in shares.
- *
- * The problem is that perfectly aligning the shares is rather expensive, hence
- * we try to avoid doing that too often - see update_shares(), which ratelimits
- * this change.
- *
- * We compensate this by not only taking the current delta into account, but
- * also considering the delta between when the shares were last adjusted and
- * now.
- *
- * We still saw a performance dip, some tracing learned us that between
- * cgroup:/ and cgroup:/foo balancing the number of affine wakeups increased
- * significantly. Therefore try to bias the error in direction of failing
- * the affine wakeup.
- *
  */
-static long effective_load(struct task_group *tg, int cpu,
-		long wl, long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	struct sched_entity *se = tg->se[cpu];
 
 	if (!tg->parent)
 		return wl;
 
-	/*
-	 * By not taking the decrease of shares on the other cpu into
-	 * account our error leans towards reducing the affine wakeups.
-	 */
-	if (!wl && sched_feat(ASYM_EFF_LOAD))
-		return wl;
-
 	for_each_sched_entity(se) {
 		long S, rw, s, a, b;
-		long more_w;
 
-		/*
-		 * Instead of using this increment, also add the difference
-		 * between when the shares were last updated and now.
-		 */
-		more_w = se->my_q->load.weight - se->my_q->rq_weight;
-		wl += more_w;
-		wg += more_w;
-
-		S = se->my_q->tg->shares;
-		s = se->my_q->shares;
-		rw = se->my_q->rq_weight;
+		S = tg->shares;
+		s = se->load.weight;
+		rw = se->my_q->load.weight;
 
 		a = S*(rw + wl);
 		b = S*rw + s*wg;
@@ -1509,23 +1564,6 @@ select_task_rq_fair(struct rq *rq, struc
 			sd = tmp;
 	}
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	if (sched_feat(LB_SHARES_UPDATE)) {
-		/*
-		 * Pick the largest domain to update shares over
-		 */
-		tmp = sd;
-		if (affine_sd && (!tmp || affine_sd->span_weight > sd->span_weight))
-			tmp = affine_sd;
-
-		if (tmp) {
-			raw_spin_unlock(&rq->lock);
-			update_shares(tmp);
-			raw_spin_lock(&rq->lock);
-		}
-	}
-#endif
-
 	if (affine_sd) {
 		if (cpu == prev_cpu || wake_affine(affine_sd, p, sync))
 			return select_idle_sibling(p, cpu);
@@ -2980,7 +3018,6 @@ static int load_balance(int this_cpu, st
 	schedstat_inc(sd, lb_count[idle]);
 
 redo:
-	update_shares(sd);
 	group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
 				   cpus, balance);
 
@@ -3115,8 +3152,6 @@ out_one_pinned:
 	else
 		ld_moved = 0;
 out:
-	if (ld_moved)
-		update_shares(sd);
 	return ld_moved;
 }
 
@@ -3510,6 +3545,8 @@ static void rebalance_domains(int cpu, e
 	int update_next_balance = 0;
 	int need_serialize;
 
+	update_shares(cpu);
+
 	for_each_domain(cpu, sd) {
 		if (!(sd->flags & SD_LOAD_BALANCE))
 			continue;
Index: linux-2.6/kernel/sched_features.h
===================================================================
--- linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -52,8 +52,6 @@ SCHED_FEAT(ARCH_POWER, 0)
 SCHED_FEAT(HRTICK, 0)
 SCHED_FEAT(DOUBLE_TICK, 0)
 SCHED_FEAT(LB_BIAS, 1)
-SCHED_FEAT(LB_SHARES_UPDATE, 1)
-SCHED_FEAT(ASYM_EFF_LOAD, 1)
 
 /*
  * Spin-wait on mutex acquisition when the mutex owner is running on
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -307,15 +307,6 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &max_wakeup_granularity_ns,
 	},
 	{
-		.procname	= "sched_shares_ratelimit",
-		.data		= &sysctl_sched_shares_ratelimit,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= sched_proc_update_handler,
-		.extra1		= &min_sched_shares_ratelimit,
-		.extra2		= &max_sched_shares_ratelimit,
-	},
-	{
 		.procname	= "sched_tunable_scaling",
 		.data		= &sysctl_sched_tunable_scaling,
 		.maxlen		= sizeof(enum sched_tunable_scaling),
@@ -325,14 +316,6 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &max_sched_tunable_scaling,
 	},
 	{
-		.procname	= "sched_shares_thresh",
-		.data		= &sysctl_sched_shares_thresh,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &zero,
-	},
-	{
 		.procname	= "sched_migration_cost",
 		.data		= &sysctl_sched_migration_cost,
 		.maxlen		= sizeof(unsigned int),



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 2/3] sched: On-demand cfs_rq list
  2010-08-28 22:30 [RFC][PATCH 0/3] Try and make cpu-cgroup suck less Peter Zijlstra
  2010-08-28 22:30 ` [RFC][PATCH 1/3] sched: Rewrite tg_shares_up Peter Zijlstra
@ 2010-08-28 22:30 ` Peter Zijlstra
  2010-09-03  3:33   ` Paul Turner
  2010-08-28 22:30 ` [RFC][PATCH 3/3] sched: On-demand tg_shares_up() Peter Zijlstra
  2 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2010-08-28 22:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Paul Turner, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Peter Zijlstra

[-- Attachment #1: sched-tg-ondemand.patch --]
[-- Type: text/plain, Size: 11706 bytes --]

Make certain load-balance actions scale per number of active cgroups
instead of the number of existing cgroups.

This makes wakeup/sleep paths more expensive, but is a win for systems
where the vast majority of existing cgroups are idle.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |   86 +++++++++-------------------------------------------
 kernel/sched_fair.c |   41 ++++++++++++++++++++++--
 kernel/sched_rt.c   |   24 ++++++++++++++
 3 files changed, 76 insertions(+), 75 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -344,6 +344,7 @@ struct cfs_rq {
 	 * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a cpu. This
 	 * list is used during load balance.
 	 */
+	int on_list;
 	struct list_head leaf_cfs_rq_list;
 	struct task_group *tg;	/* group that "owns" this runqueue */
 
@@ -1529,7 +1530,7 @@ static unsigned long cpu_avg_load_per_ta
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-static void update_cfs_load(struct cfs_rq *cfs_rq);
+static void update_cfs_load(struct cfs_rq *cfs_rq, int lb);
 static void update_cfs_shares(struct cfs_rq *cfs_rq);
 
 /*
@@ -1552,7 +1553,7 @@ static int tg_shares_up(struct task_grou
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	update_rq_clock(rq);
-	update_cfs_load(cfs_rq);
+	update_cfs_load(cfs_rq, 1);
 
 	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
 	load_avg -= cfs_rq->load_contribution;
@@ -7553,15 +7554,13 @@ static void init_rt_rq(struct rt_rq *rt_
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
-				struct sched_entity *se, int cpu, int add,
+				struct sched_entity *se, int cpu,
 				struct sched_entity *parent)
 {
 	struct rq *rq = cpu_rq(cpu);
 	tg->cfs_rq[cpu] = cfs_rq;
 	init_cfs_rq(cfs_rq, rq);
 	cfs_rq->tg = tg;
-	if (add)
-		list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
 
 	tg->se[cpu] = se;
 	/* se could be NULL for init_task_group */
@@ -7581,7 +7580,7 @@ static void init_tg_cfs_entry(struct tas
 
 #ifdef CONFIG_RT_GROUP_SCHED
 static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
-		struct sched_rt_entity *rt_se, int cpu, int add,
+		struct sched_rt_entity *rt_se, int cpu,
 		struct sched_rt_entity *parent)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -7590,8 +7589,6 @@ static void init_tg_rt_entry(struct task
 	init_rt_rq(rt_rq, rq);
 	rt_rq->tg = tg;
 	rt_rq->rt_runtime = tg->rt_bandwidth.rt_runtime;
-	if (add)
-		list_add(&rt_rq->leaf_rt_rq_list, &rq->leaf_rt_rq_list);
 
 	tg->rt_se[cpu] = rt_se;
 	if (!rt_se)
@@ -7700,7 +7697,7 @@ void __init sched_init(void)
 		 * We achieve this by letting init_task_group's tasks sit
 		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
 		 */
-		init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, 1, NULL);
+		init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, NULL);
 #endif
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -7708,7 +7705,7 @@ void __init sched_init(void)
 #ifdef CONFIG_RT_GROUP_SCHED
 		INIT_LIST_HEAD(&rq->leaf_rt_rq_list);
 #ifdef CONFIG_CGROUP_SCHED
-		init_tg_rt_entry(&init_task_group, &rq->rt, NULL, i, 1, NULL);
+		init_tg_rt_entry(&init_task_group, &rq->rt, NULL, i, NULL);
 #endif
 #endif
 
@@ -7984,7 +7981,7 @@ int alloc_fair_sched_group(struct task_g
 		if (!se)
 			goto err_free_rq;
 
-		init_tg_cfs_entry(tg, cfs_rq, se, i, 0, parent->se[i]);
+		init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
 	}
 
 	return 1;
@@ -8015,14 +8012,6 @@ int alloc_fair_sched_group(struct task_g
 {
 	return 1;
 }
-
-static inline void register_fair_sched_group(struct task_group *tg, int cpu)
-{
-}
-
-static inline void unregister_fair_sched_group(struct task_group *tg, int cpu)
-{
-}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -8074,7 +8063,7 @@ int alloc_rt_sched_group(struct task_gro
 		if (!rt_se)
 			goto err_free_rq;
 
-		init_tg_rt_entry(tg, rt_rq, rt_se, i, 0, parent->rt_se[i]);
+		init_tg_rt_entry(tg, rt_rq, rt_se, i, parent->rt_se[i]);
 	}
 
 	return 1;
@@ -8084,17 +8073,6 @@ int alloc_rt_sched_group(struct task_gro
  err:
 	return 0;
 }
-
-static inline void register_rt_sched_group(struct task_group *tg, int cpu)
-{
-	list_add_rcu(&tg->rt_rq[cpu]->leaf_rt_rq_list,
-			&cpu_rq(cpu)->leaf_rt_rq_list);
-}
-
-static inline void unregister_rt_sched_group(struct task_group *tg, int cpu)
-{
-	list_del_rcu(&tg->rt_rq[cpu]->leaf_rt_rq_list);
-}
 #else /* !CONFIG_RT_GROUP_SCHED */
 static inline void free_rt_sched_group(struct task_group *tg)
 {
@@ -8105,14 +8083,6 @@ int alloc_rt_sched_group(struct task_gro
 {
 	return 1;
 }
-
-static inline void register_rt_sched_group(struct task_group *tg, int cpu)
-{
-}
-
-static inline void unregister_rt_sched_group(struct task_group *tg, int cpu)
-{
-}
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_CGROUP_SCHED
@@ -8128,7 +8098,6 @@ struct task_group *sched_create_group(st
 {
 	struct task_group *tg;
 	unsigned long flags;
-	int i;
 
 	tg = kzalloc(sizeof(*tg), GFP_KERNEL);
 	if (!tg)
@@ -8141,10 +8110,6 @@ struct task_group *sched_create_group(st
 		goto err;
 
 	spin_lock_irqsave(&task_group_lock, flags);
-	for_each_possible_cpu(i) {
-		register_fair_sched_group(tg, i);
-		register_rt_sched_group(tg, i);
-	}
 	list_add_rcu(&tg->list, &task_groups);
 
 	WARN_ON(!parent); /* root should already exist */
@@ -8175,10 +8140,12 @@ void sched_destroy_group(struct task_gro
 	int i;
 
 	spin_lock_irqsave(&task_group_lock, flags);
-	for_each_possible_cpu(i) {
-		unregister_fair_sched_group(tg, i);
-		unregister_rt_sched_group(tg, i);
-	}
+	/*
+	 * XXX should not be a race against enqueue, even without rq->lock
+	 * because only empty groups can be destroyed.
+	 */
+	for_each_possible_cpu(i)
+		list_del_leaf_cfs_rq(tg->cfs_rq[i]);
 	list_del_rcu(&tg->list);
 	list_del_rcu(&tg->siblings);
 	spin_unlock_irqrestore(&task_group_lock, flags);
@@ -8256,7 +8223,6 @@ static DEFINE_MUTEX(shares_mutex);
 int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 {
 	int i;
-	unsigned long flags;
 
 	/*
 	 * We can't change the weight of the root cgroup.
@@ -8273,19 +8239,6 @@ int sched_group_set_shares(struct task_g
 	if (tg->shares == shares)
 		goto done;
 
-	spin_lock_irqsave(&task_group_lock, flags);
-	for_each_possible_cpu(i)
-		unregister_fair_sched_group(tg, i);
-	list_del_rcu(&tg->siblings);
-	spin_unlock_irqrestore(&task_group_lock, flags);
-
-	/* wait for any ongoing reference to this group to finish */
-	synchronize_sched();
-
-	/*
-	 * Now we are free to modify the group's share on each cpu
-	 * w/o tripping rebalance_share or load_balance_fair.
-	 */
 	tg->shares = shares;
 	for_each_possible_cpu(i) {
 		/*
@@ -8294,15 +8247,6 @@ int sched_group_set_shares(struct task_g
 		set_se_shares(tg->se[i], shares);
 	}
 
-	/*
-	 * Enable load balance activity on this group, by inserting it back on
-	 * each cpu's rq->leaf_cfs_rq_list.
-	 */
-	spin_lock_irqsave(&task_group_lock, flags);
-	for_each_possible_cpu(i)
-		register_fair_sched_group(tg, i);
-	list_add_rcu(&tg->siblings, &tg->parent->children);
-	spin_unlock_irqrestore(&task_group_lock, flags);
 done:
 	mutex_unlock(&shares_mutex);
 	return 0;
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -143,6 +143,23 @@ static inline struct cfs_rq *cpu_cfs_rq(
 	return cfs_rq->tg->cfs_rq[this_cpu];
 }
 
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	if (!cfs_rq->on_list) {
+		list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
+				&rq_of(cfs_rq)->leaf_cfs_rq_list);
+		cfs_rq->on_list = 1;
+	}
+}
+
+static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	if (cfs_rq->on_list) {
+		list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
+		cfs_rq->on_list = 0;
+	}
+}
+
 /* Iterate thr' all leaf cfs_rq's on a runqueue */
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
 	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
@@ -246,6 +263,14 @@ static inline struct cfs_rq *cpu_cfs_rq(
 	return &cpu_rq(this_cpu)->cfs;
 }
 
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+{
+}
+
+static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+{
+}
+
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
 		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
 
@@ -648,7 +673,7 @@ account_entity_dequeue(struct cfs_rq *cf
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static void update_cfs_load(struct cfs_rq *cfs_rq)
+static void update_cfs_load(struct cfs_rq *cfs_rq, int lb)
 {
 	u64 period = sched_avg_period();
 	u64 now = rq_of(cfs_rq)->clock;
@@ -668,6 +693,11 @@ static void update_cfs_load(struct cfs_r
 		cfs_rq->load_period /= 2;
 		cfs_rq->load_avg /= 2;
 	}
+
+	if (lb && !cfs_rq->nr_running) {
+		if (cfs_rq->load_period < (period / 8))
+			list_del_leaf_cfs_rq(cfs_rq);
+	}
 }
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
@@ -729,7 +759,7 @@ static void update_cfs_shares(struct cfs
 	reweight_entity(cfs_rq_of(se), se, shares);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void update_cfs_load(struct cfs_rq *cfs_rq)
+static inline void update_cfs_load(struct cfs_rq *cfs_rq, int lb)
 {
 }
 
@@ -859,7 +889,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	update_cfs_load(cfs_rq);
+	update_cfs_load(cfs_rq, 0);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(group_cfs_rq(se));
 
@@ -873,6 +903,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	if (se != cfs_rq->curr)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
+
+	if (cfs_rq->nr_running == 1)
+		list_add_leaf_cfs_rq(cfs_rq);
 }
 
 static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -917,7 +950,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
-	update_cfs_load(cfs_rq);
+	update_cfs_load(cfs_rq, 0);
 	account_entity_dequeue(cfs_rq, se);
 	update_min_vruntime(cfs_rq);
 	update_cfs_shares(group_cfs_rq(se));
Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -183,6 +183,17 @@ static inline u64 sched_rt_period(struct
 	return ktime_to_ns(rt_rq->tg->rt_bandwidth.rt_period);
 }
 
+static inline void list_add_leaf_rt_rq(struct rt_rq *rt_rq)
+{
+	list_add_rcu(&rt_rq->leaf_rt_rq_list,
+			&rq_of_rt_rq(rt_rq)->leaf_rt_rq_list);
+}
+
+static inline void list_del_leaf_rt_rq(struct rt_rq *rt_rq)
+{
+	list_del_rcu(&rt_rq->leaf_rt_rq_list);
+}
+
 #define for_each_leaf_rt_rq(rt_rq, rq) \
 	list_for_each_entry_rcu(rt_rq, &rq->leaf_rt_rq_list, leaf_rt_rq_list)
 
@@ -276,6 +287,14 @@ static inline u64 sched_rt_period(struct
 	return ktime_to_ns(def_rt_bandwidth.rt_period);
 }
 
+static inline void list_add_leaf_rt_rq(struct rt_rq *rt_rq)
+{
+}
+
+static inline void list_del_leaf_rt_rq(struct rt_rq *rt_rq)
+{
+}
+
 #define for_each_leaf_rt_rq(rt_rq, rq) \
 	for (rt_rq = &rq->rt; rt_rq; rt_rq = NULL)
 
@@ -825,6 +844,9 @@ static void __enqueue_rt_entity(struct s
 	if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running))
 		return;
 
+	if (!rt_rq->rt_nr_running)
+		list_add_leaf_rt_rq(rt_rq);
+
 	if (head)
 		list_add(&rt_se->run_list, queue);
 	else
@@ -844,6 +866,8 @@ static void __dequeue_rt_entity(struct s
 		__clear_bit(rt_se_prio(rt_se), array->bitmap);
 
 	dec_rt_tasks(rt_se, rt_rq);
+	if (!rt_rq->rt_nr_running)
+		list_del_leaf_rt_rq(rt_rq);
 }
 
 /*



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC][PATCH 3/3] sched: On-demand tg_shares_up()
  2010-08-28 22:30 [RFC][PATCH 0/3] Try and make cpu-cgroup suck less Peter Zijlstra
  2010-08-28 22:30 ` [RFC][PATCH 1/3] sched: Rewrite tg_shares_up Peter Zijlstra
  2010-08-28 22:30 ` [RFC][PATCH 2/3] sched: On-demand cfs_rq list Peter Zijlstra
@ 2010-08-28 22:30 ` Peter Zijlstra
  2010-09-03  1:52   ` Paul Turner
  2 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2010-08-28 22:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Paul Turner, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Peter Zijlstra

[-- Attachment #1: sched-tg-more-ondemand.patch --]
[-- Type: text/plain, Size: 4405 bytes --]

Make tg_shares_up() use the active cgroup list, this means we cannot
do a strict bottom-up walk of the hierarchy, but assuming its a very
wide tree with a small number of active groups it should be a win.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |   67 ----------------------------------------------------
 kernel/sched_fair.c |   59 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+), 67 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -281,13 +281,6 @@ static DEFINE_SPINLOCK(task_group_lock);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-#ifdef CONFIG_SMP
-static int root_task_group_empty(void)
-{
-	return list_empty(&root_task_group.children);
-}
-#endif
-
 # define INIT_TASK_GROUP_LOAD	NICE_0_LOAD
 
 /*
@@ -1530,48 +1523,6 @@ static unsigned long cpu_avg_load_per_ta
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-static void update_cfs_load(struct cfs_rq *cfs_rq, int lb);
-static void update_cfs_shares(struct cfs_rq *cfs_rq);
-
-/*
- * update tg->load_weight by folding this cpu's load_avg
- */
-static int tg_shares_up(struct task_group *tg, void *data)
-{
-	unsigned long load_avg;
-	struct cfs_rq *cfs_rq;
-	unsigned long flags;
-	int cpu = (long)data;
-	struct rq *rq;
-
-	if (!tg->se[cpu])
-		return 0;
-
-	rq = cpu_rq(cpu);
-	cfs_rq = tg->cfs_rq[cpu];
-
-	raw_spin_lock_irqsave(&rq->lock, flags);
-
-	update_rq_clock(rq);
-	update_cfs_load(cfs_rq, 1);
-
-	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
-	load_avg -= cfs_rq->load_contribution;
-
-	atomic_add(load_avg, &tg->load_weight);
-	cfs_rq->load_contribution += load_avg;
-
-	/*
-	 * We need to update shares after updating tg->load_weight in
-	 * order to adjust the weight of groups with long running tasks.
-	 */
-	update_cfs_shares(cfs_rq);
-
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
-
-	return 0;
-}
-
 /*
  * Compute the cpu's hierarchical load factor for each task group.
  * This needs to be done in a top-down fashion because the load of a child
@@ -1595,29 +1546,11 @@ static int tg_load_down(struct task_grou
 	return 0;
 }
 
-static void update_shares(long cpu)
-{
-	if (root_task_group_empty())
-		return;
-
-	/*
-	 * XXX: replace with an on-demand list
-	 */
-
-	walk_tg_tree(tg_nop, tg_shares_up, (void *)cpu);
-}
-
 static void update_h_load(long cpu)
 {
 	walk_tg_tree(tg_load_down, tg_nop, (void *)cpu);
 }
 
-#else
-
-static inline void update_shares(int cpu)
-{
-}
-
 #endif
 
 #ifdef CONFIG_PREEMPT
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -2002,6 +2002,61 @@ out:
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+/*
+ * update tg->load_weight by folding this cpu's load_avg
+ */
+static int tg_shares_up(struct task_group *tg, int cpu)
+{
+	unsigned long load_avg;
+	struct cfs_rq *cfs_rq;
+	unsigned long flags;
+	struct rq *rq;
+
+	if (!tg->se[cpu])
+		return 0;
+
+	rq = cpu_rq(cpu);
+	cfs_rq = tg->cfs_rq[cpu];
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+
+	update_rq_clock(rq);
+	update_cfs_load(cfs_rq, 1);
+
+	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
+	load_avg -= cfs_rq->load_contribution;
+
+	atomic_add(load_avg, &tg->load_weight);
+	cfs_rq->load_contribution += load_avg;
+
+	/*
+	 * We need to update shares after updating tg->load_weight in
+	 * order to adjust the weight of groups with long running tasks.
+	 */
+	update_cfs_shares(cfs_rq);
+
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+	return 0;
+}
+
+static void update_shares(int cpu)
+{
+	struct cfs_rq *cfs_rq;
+	struct rq *rq = cpu_rq(cpu);
+
+	rcu_read_lock();
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
+		struct task_group *tg = cfs_rq->tg;
+
+		do {
+			tg_shares_up(tg, cpu);
+			tg = tg->parent;
+		} while (tg);
+	}
+	rcu_read_unlock();
+}
+
 static unsigned long
 load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		  unsigned long max_load_move,
@@ -2049,6 +2104,10 @@ load_balance_fair(struct rq *this_rq, in
 	return max_load_move - rem_load_move;
 }
 #else
+static inline void update_shares(int cpu)
+{
+}
+
 static unsigned long
 load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		  unsigned long max_load_move,



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 1/3] sched: Rewrite tg_shares_up
  2010-08-28 22:30 ` [RFC][PATCH 1/3] sched: Rewrite tg_shares_up Peter Zijlstra
@ 2010-08-30 17:20   ` Srivatsa Vaddagiri
  2010-08-30 17:53     ` Peter Zijlstra
  2010-09-03  3:09   ` Paul Turner
  1 sibling, 1 reply; 12+ messages in thread
From: Srivatsa Vaddagiri @ 2010-08-30 17:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Paul Turner, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon

On Sun, Aug 29, 2010 at 12:30:26AM +0200, Peter Zijlstra wrote:
> By tracking a per-cpu load-avg for each cfs_rq and folding it into a
> global task_group load on each tick we can rework tg_shares_up to be
> strictly per-cpu.

So tg->load_weight is supposed to represent more or less current task load
across all cpus? I see only atomic_add() to it - which means it can only keep
growing or remain constant - IOW capturing the historical load even since
the task group was started? I was expecting it to reduce based on how a group 
goes idle, otherwise 

> +static void update_cfs_shares(struct cfs_rq *cfs_rq)
> +{
> +	struct task_group *tg;
> +	struct sched_entity *se;
> +	unsigned long load_weight, load, shares;
> +
> +	if (!cfs_rq)
> +		return;
> +
> +	tg = cfs_rq->tg;
> +	se = tg->se[cpu_of(rq_of(cfs_rq))];
> +	if (!se)
> +		return;
> +
> +	load = cfs_rq->load.weight;
> +
> +	load_weight = atomic_read(&tg->load_weight);
> +	load_weight -= cfs_rq->load_contribution;
> +	load_weight += load;
> +
> +	shares = (tg->shares * load);
> +	if (load_weight)
> +		shares /= load_weight;

this seem incorrect? Even though we have corrected tg->load_weight to reflect
current load on 'cfs_rq', tg->load_weight still captures historical load on
other cpus and hence could be a large #, making the division inaccurate?

Also I wonder how much of a hot spot tg->load_weight would become ..

- vatsa

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 1/3] sched: Rewrite tg_shares_up
  2010-08-30 17:20   ` Srivatsa Vaddagiri
@ 2010-08-30 17:53     ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2010-08-30 17:53 UTC (permalink / raw)
  To: vatsa
  Cc: linux-kernel, Ingo Molnar, Paul Turner, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon

On Mon, 2010-08-30 at 22:50 +0530, Srivatsa Vaddagiri wrote:
> On Sun, Aug 29, 2010 at 12:30:26AM +0200, Peter Zijlstra wrote:
> > By tracking a per-cpu load-avg for each cfs_rq and folding it into a
> > global task_group load on each tick we can rework tg_shares_up to be
> > strictly per-cpu.
> 
> So tg->load_weight is supposed to represent more or less current task load
> across all cpus? I see only atomic_add() to it - which means it can only keep
> growing or remain constant - IOW capturing the historical load even since
> the task group was started? I was expecting it to reduce based on how a group 
> goes idle, otherwise 

It can add a negative value, I should probably have made the below use
long instead of unsigned long.

> > +static void update_cfs_shares(struct cfs_rq *cfs_rq)
> > +{
> > +	struct task_group *tg;
> > +	struct sched_entity *se;
> > +	unsigned long load_weight, load, shares;
> > +
> > +	if (!cfs_rq)
> > +		return;
> > +
> > +	tg = cfs_rq->tg;
> > +	se = tg->se[cpu_of(rq_of(cfs_rq))];
> > +	if (!se)
> > +		return;
> > +
> > +	load = cfs_rq->load.weight;
> > +
> > +	load_weight = atomic_read(&tg->load_weight);
> > +	load_weight -= cfs_rq->load_contribution;
> > +	load_weight += load;
> > +
> > +	shares = (tg->shares * load);
> > +	if (load_weight)
> > +		shares /= load_weight;
> 
> this seem incorrect? Even though we have corrected tg->load_weight to reflect
> current load on 'cfs_rq', tg->load_weight still captures historical load on
> other cpus and hence could be a large #, making the division inaccurate?

Well, without history you can get terribly inaccurate too. Suppose the
workload sleeps for 50ms then runs for 50ms with n tasks, each on its
own cpu. Without history you'll end up being off by n*load, with history
you'll be floating about the avg, giving a much more stable result.

The (tg->shares * cfs_rq->load) / tg->load < MIN_SHARES issues is not
something we can easily fix, there's only so much you can do with fixed
point math.

> Also I wonder how much of a hot spot tg->load_weight would become ..

We touch it once per tick, so pretty hot on large systems, then again,
the larger the system the less likely all cpus will run tasks from the
same cgroup.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 3/3] sched: On-demand tg_shares_up()
  2010-08-28 22:30 ` [RFC][PATCH 3/3] sched: On-demand tg_shares_up() Peter Zijlstra
@ 2010-09-03  1:52   ` Paul Turner
  2010-09-03  7:59     ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Paul Turner @ 2010-09-03  1:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon

On Sat, Aug 28, 2010 at 11:30 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Make tg_shares_up() use the active cgroup list, this means we cannot
> do a strict bottom-up walk of the hierarchy, but assuming its a very
> wide tree with a small number of active groups it should be a win.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched.c      |   67 ----------------------------------------------------
>  kernel/sched_fair.c |   59 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 59 insertions(+), 67 deletions(-)
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -281,13 +281,6 @@ static DEFINE_SPINLOCK(task_group_lock);
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>
> -#ifdef CONFIG_SMP
> -static int root_task_group_empty(void)
> -{
> -       return list_empty(&root_task_group.children);
> -}
> -#endif
> -
>  # define INIT_TASK_GROUP_LOAD  NICE_0_LOAD
>
>  /*
> @@ -1530,48 +1523,6 @@ static unsigned long cpu_avg_load_per_ta
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>
> -static void update_cfs_load(struct cfs_rq *cfs_rq, int lb);
> -static void update_cfs_shares(struct cfs_rq *cfs_rq);
> -
> -/*
> - * update tg->load_weight by folding this cpu's load_avg
> - */
> -static int tg_shares_up(struct task_group *tg, void *data)
> -{
> -       unsigned long load_avg;
> -       struct cfs_rq *cfs_rq;
> -       unsigned long flags;
> -       int cpu = (long)data;
> -       struct rq *rq;
> -
> -       if (!tg->se[cpu])
> -               return 0;
> -
> -       rq = cpu_rq(cpu);
> -       cfs_rq = tg->cfs_rq[cpu];
> -
> -       raw_spin_lock_irqsave(&rq->lock, flags);
> -
> -       update_rq_clock(rq);
> -       update_cfs_load(cfs_rq, 1);
> -
> -       load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
> -       load_avg -= cfs_rq->load_contribution;
> -
> -       atomic_add(load_avg, &tg->load_weight);
> -       cfs_rq->load_contribution += load_avg;
> -
> -       /*
> -        * We need to update shares after updating tg->load_weight in
> -        * order to adjust the weight of groups with long running tasks.
> -        */
> -       update_cfs_shares(cfs_rq);
> -
> -       raw_spin_unlock_irqrestore(&rq->lock, flags);
> -
> -       return 0;
> -}
> -
>  /*
>  * Compute the cpu's hierarchical load factor for each task group.
>  * This needs to be done in a top-down fashion because the load of a child
> @@ -1595,29 +1546,11 @@ static int tg_load_down(struct task_grou
>        return 0;
>  }
>
> -static void update_shares(long cpu)
> -{
> -       if (root_task_group_empty())
> -               return;
> -
> -       /*
> -        * XXX: replace with an on-demand list
> -        */
> -
> -       walk_tg_tree(tg_nop, tg_shares_up, (void *)cpu);
> -}
> -
>  static void update_h_load(long cpu)
>  {
>        walk_tg_tree(tg_load_down, tg_nop, (void *)cpu);
>  }
>
> -#else
> -
> -static inline void update_shares(int cpu)
> -{
> -}
> -
>  #endif
>
>  #ifdef CONFIG_PREEMPT
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -2002,6 +2002,61 @@ out:
>  }
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> +/*
> + * update tg->load_weight by folding this cpu's load_avg
> + */
> +static int tg_shares_up(struct task_group *tg, int cpu)
> +{
> +       unsigned long load_avg;
> +       struct cfs_rq *cfs_rq;
> +       unsigned long flags;
> +       struct rq *rq;
> +
> +       if (!tg->se[cpu])
> +               return 0;
> +
> +       rq = cpu_rq(cpu);
> +       cfs_rq = tg->cfs_rq[cpu];
> +
> +       raw_spin_lock_irqsave(&rq->lock, flags);
> +
> +       update_rq_clock(rq);
> +       update_cfs_load(cfs_rq, 1);
> +
> +       load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
> +       load_avg -= cfs_rq->load_contribution;
> +
> +       atomic_add(load_avg, &tg->load_weight);
> +       cfs_rq->load_contribution += load_avg;
> +
> +       /*
> +        * We need to update shares after updating tg->load_weight in
> +        * order to adjust the weight of groups with long running tasks.
> +        */
> +       update_cfs_shares(cfs_rq);
> +
> +       raw_spin_unlock_irqrestore(&rq->lock, flags);
> +
> +       return 0;
> +}
> +
> +static void update_shares(int cpu)
> +{
> +       struct cfs_rq *cfs_rq;
> +       struct rq *rq = cpu_rq(cpu);
> +
> +       rcu_read_lock();
> +       for_each_leaf_cfs_rq(rq, cfs_rq) {
> +               struct task_group *tg = cfs_rq->tg;
> +
> +               do {
> +                       tg_shares_up(tg, cpu);
> +                       tg = tg->parent;
> +               } while (tg);
> +       }

This will multiply visit taskgroups:

In the case of a-b-task, both {a} and {b} will be on the leaf cfs_rq
list.  Resulting in a being visited both as b's parent and as a leaf
entity.

Rather than juggling to avoid this, it seems easier to maintain an
ordering on rq->leaf_cfs_rq_list.

That is:

When an entity is added:
a) If it has a parent, insert it immediately before the parent in the list.
b) Otherwise it is attached to the root, attach it to the tail of
rq->leaf_cfs_rq_list

This can actually be simplified to: if no parent insert at the tail,
otherwise insert at the head, since we know the parent will always
have been processed prior to the child.

Traversing the list in order should then ensure that all child
entities have been processed for the 'next' entity at any given time
and that its update is coherent.

> +       rcu_read_unlock();
> +}
> +
>  static unsigned long
>  load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
>                  unsigned long max_load_move,
> @@ -2049,6 +2104,10 @@ load_balance_fair(struct rq *this_rq, in
>        return max_load_move - rem_load_move;
>  }
>  #else
> +static inline void update_shares(int cpu)
> +{
> +}
> +
>  static unsigned long
>  load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
>                  unsigned long max_load_move,
>
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 1/3] sched: Rewrite tg_shares_up
  2010-08-28 22:30 ` [RFC][PATCH 1/3] sched: Rewrite tg_shares_up Peter Zijlstra
  2010-08-30 17:20   ` Srivatsa Vaddagiri
@ 2010-09-03  3:09   ` Paul Turner
  2010-09-03  7:59     ` Peter Zijlstra
  1 sibling, 1 reply; 12+ messages in thread
From: Paul Turner @ 2010-09-03  3:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon

On Sat, Aug 28, 2010 at 11:30 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> By tracking a per-cpu load-avg for each cfs_rq and folding it into a
> global task_group load on each tick we can rework tg_shares_up to be
> strictly per-cpu.
>
> This should improve cpu-cgroup performance for smp systems
> significantly.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/sched.h   |    2
>  kernel/sched.c          |  173 ++++++++++++------------------------------------
>  kernel/sched_debug.c    |   11 ++-
>  kernel/sched_fair.c     |  153 ++++++++++++++++++++++++++----------------
>  kernel/sched_features.h |    2
>  kernel/sysctl.c         |   17 ----
>  6 files changed, 147 insertions(+), 211 deletions(-)
>
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h
> +++ linux-2.6/include/linux/sched.h
> @@ -1858,8 +1858,6 @@ static inline void wake_up_idle_cpu(int
>  extern unsigned int sysctl_sched_latency;
>  extern unsigned int sysctl_sched_min_granularity;
>  extern unsigned int sysctl_sched_wakeup_granularity;
> -extern unsigned int sysctl_sched_shares_ratelimit;
> -extern unsigned int sysctl_sched_shares_thresh;
>  extern unsigned int sysctl_sched_child_runs_first;
>
>  enum sched_tunable_scaling {
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -253,6 +253,8 @@ struct task_group {
>        /* runqueue "owned" by this group on each cpu */
>        struct cfs_rq **cfs_rq;
>        unsigned long shares;
> +
> +       atomic_t load_weight;
>  #endif
>
>  #ifdef CONFIG_RT_GROUP_SCHED
> @@ -359,15 +361,11 @@ struct cfs_rq {
>         */
>        unsigned long h_load;
>
> -       /*
> -        * this cpu's part of tg->shares
> -        */
> -       unsigned long shares;
> +       u64 load_avg;
> +       u64 load_period;
> +       u64 load_stamp;
>
> -       /*
> -        * load.weight at the time we set shares
> -        */
> -       unsigned long rq_weight;
> +       unsigned long load_contribution;
>  #endif
>  #endif
>  };
> @@ -793,20 +791,6 @@ late_initcall(sched_init_debug);
>  const_debug unsigned int sysctl_sched_nr_migrate = 32;
>
>  /*
> - * ratelimit for updating the group shares.
> - * default: 0.25ms
> - */
> -unsigned int sysctl_sched_shares_ratelimit = 250000;
> -unsigned int normalized_sysctl_sched_shares_ratelimit = 250000;
> -
> -/*
> - * Inject some fuzzyness into changing the per-cpu group shares
> - * this avoids remote rq-locks at the expense of fairness.
> - * default: 4
> - */
> -unsigned int sysctl_sched_shares_thresh = 4;
> -
> -/*
>  * period over which we average the RT time consumption, measured
>  * in ms.
>  *
> @@ -1351,6 +1335,12 @@ static inline void update_load_sub(struc
>        lw->inv_weight = 0;
>  }
>
> +static inline void update_load_set(struct load_weight *lw, unsigned long w)
> +{
> +       lw->weight = w;
> +       lw->inv_weight = 0;
> +}
> +
>  /*
>  * To aid in avoiding the subversion of "niceness" due to uneven distribution
>  * of tasks with abnormal "nice" values across CPUs the contribution that
> @@ -1539,97 +1529,44 @@ static unsigned long cpu_avg_load_per_ta
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>
> -static __read_mostly unsigned long __percpu *update_shares_data;
> -
> -static void __set_se_shares(struct sched_entity *se, unsigned long shares);
> -
> -/*
> - * Calculate and set the cpu's group shares.
> - */
> -static void update_group_shares_cpu(struct task_group *tg, int cpu,
> -                                   unsigned long sd_shares,
> -                                   unsigned long sd_rq_weight,
> -                                   unsigned long *usd_rq_weight)
> -{
> -       unsigned long shares, rq_weight;
> -       int boost = 0;
> -
> -       rq_weight = usd_rq_weight[cpu];
> -       if (!rq_weight) {
> -               boost = 1;
> -               rq_weight = NICE_0_LOAD;
> -       }
> -
> -       /*
> -        *             \Sum_j shares_j * rq_weight_i
> -        * shares_i =  -----------------------------
> -        *                  \Sum_j rq_weight_j
> -        */
> -       shares = (sd_shares * rq_weight) / sd_rq_weight;
> -       shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);
> -
> -       if (abs(shares - tg->se[cpu]->load.weight) >
> -                       sysctl_sched_shares_thresh) {
> -               struct rq *rq = cpu_rq(cpu);
> -               unsigned long flags;
> -
> -               raw_spin_lock_irqsave(&rq->lock, flags);
> -               tg->cfs_rq[cpu]->rq_weight = boost ? 0 : rq_weight;
> -               tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
> -               __set_se_shares(tg->se[cpu], shares);
> -               raw_spin_unlock_irqrestore(&rq->lock, flags);
> -       }
> -}
> +static void update_cfs_load(struct cfs_rq *cfs_rq);
> +static void update_cfs_shares(struct cfs_rq *cfs_rq);
>
>  /*
> - * Re-compute the task group their per cpu shares over the given domain.
> - * This needs to be done in a bottom-up fashion because the rq weight of a
> - * parent group depends on the shares of its child groups.
> + * update tg->load_weight by folding this cpu's load_avg
>  */
>  static int tg_shares_up(struct task_group *tg, void *data)
>  {
> -       unsigned long weight, rq_weight = 0, sum_weight = 0, shares = 0;
> -       unsigned long *usd_rq_weight;
> -       struct sched_domain *sd = data;
> +       unsigned long load_avg;
> +       struct cfs_rq *cfs_rq;
>        unsigned long flags;
> -       int i;
> +       int cpu = (long)data;
> +       struct rq *rq;
>
> -       if (!tg->se[0])
> +       if (!tg->se[cpu])
>                return 0;
>
> -       local_irq_save(flags);
> -       usd_rq_weight = per_cpu_ptr(update_shares_data, smp_processor_id());
> -
> -       for_each_cpu(i, sched_domain_span(sd)) {
> -               weight = tg->cfs_rq[i]->load.weight;
> -               usd_rq_weight[i] = weight;
> -
> -               rq_weight += weight;
> -               /*
> -                * If there are currently no tasks on the cpu pretend there
> -                * is one of average load so that when a new task gets to
> -                * run here it will not get delayed by group starvation.
> -                */
> -               if (!weight)
> -                       weight = NICE_0_LOAD;
> +       rq = cpu_rq(cpu);
> +       cfs_rq = tg->cfs_rq[cpu];
>
> -               sum_weight += weight;
> -               shares += tg->cfs_rq[i]->shares;
> -       }
> +       raw_spin_lock_irqsave(&rq->lock, flags);
>
> -       if (!rq_weight)
> -               rq_weight = sum_weight;
> +       update_rq_clock(rq);
> +       update_cfs_load(cfs_rq);
>
> -       if ((!shares && rq_weight) || shares > tg->shares)
> -               shares = tg->shares;
> +       load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
> +       load_avg -= cfs_rq->load_contribution;
>
> -       if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
> -               shares = tg->shares;
> +       atomic_add(load_avg, &tg->load_weight);
> +       cfs_rq->load_contribution += load_avg;
>
> -       for_each_cpu(i, sched_domain_span(sd))
> -               update_group_shares_cpu(tg, i, shares, rq_weight, usd_rq_weight);
> +       /*
> +        * We need to update shares after updating tg->load_weight in
> +        * order to adjust the weight of groups with long running tasks.
> +        */
> +       update_cfs_shares(cfs_rq);
>
> -       local_irq_restore(flags);
> +       raw_spin_unlock_irqrestore(&rq->lock, flags);
>
>        return 0;
>  }
> @@ -1648,7 +1585,7 @@ static int tg_load_down(struct task_grou
>                load = cpu_rq(cpu)->load.weight;
>        } else {
>                load = tg->parent->cfs_rq[cpu]->h_load;
> -               load *= tg->cfs_rq[cpu]->shares;
> +               load *= tg->se[cpu]->load.weight;
>                load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
>        }
>
> @@ -1657,21 +1594,16 @@ static int tg_load_down(struct task_grou
>        return 0;
>  }
>
> -static void update_shares(struct sched_domain *sd)
> +static void update_shares(long cpu)
>  {
> -       s64 elapsed;
> -       u64 now;
> -
>        if (root_task_group_empty())
>                return;
>
> -       now = local_clock();
> -       elapsed = now - sd->last_update;
> +       /*
> +        * XXX: replace with an on-demand list
> +        */
>
> -       if (elapsed >= (s64)(u64)sysctl_sched_shares_ratelimit) {
> -               sd->last_update = now;
> -               walk_tg_tree(tg_nop, tg_shares_up, sd);
> -       }
> +       walk_tg_tree(tg_nop, tg_shares_up, (void *)cpu);
>  }
>
>  static void update_h_load(long cpu)
> @@ -1681,7 +1613,7 @@ static void update_h_load(long cpu)
>
>  #else
>
> -static inline void update_shares(struct sched_domain *sd)
> +static inline void update_shares(int cpu)
>  {
>  }
>
> @@ -1806,15 +1738,6 @@ static void double_rq_unlock(struct rq *
>
>  #endif
>
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> -static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
> -{
> -#ifdef CONFIG_SMP
> -       cfs_rq->shares = shares;
> -#endif
> -}
> -#endif
> -
>  static void calc_load_account_idle(struct rq *this_rq);
>  static void update_sysctl(void);
>  static int get_update_sysctl_factor(void);
> @@ -5400,7 +5323,6 @@ static void update_sysctl(void)
>        SET_SYSCTL(sched_min_granularity);
>        SET_SYSCTL(sched_latency);
>        SET_SYSCTL(sched_wakeup_granularity);
> -       SET_SYSCTL(sched_shares_ratelimit);
>  #undef SET_SYSCTL
>  }
>
> @@ -7652,8 +7574,7 @@ static void init_tg_cfs_entry(struct tas
>                se->cfs_rq = parent->my_q;
>
>        se->my_q = cfs_rq;
> -       se->load.weight = tg->shares;
> -       se->load.inv_weight = 0;
> +       update_load_set(&se->load, tg->shares);

Given now instantaneous update of shares->load on enqueue/dequeue
initialization to 0 would result in sane(r) sums across tg->se->load.
Only relevant for debug though.

>        se->parent = parent;
>  }
>  #endif
> @@ -7746,10 +7667,6 @@ void __init sched_init(void)
>
>  #endif /* CONFIG_CGROUP_SCHED */
>
> -#if defined CONFIG_FAIR_GROUP_SCHED && defined CONFIG_SMP
> -       update_shares_data = __alloc_percpu(nr_cpu_ids * sizeof(unsigned long),
> -                                           __alignof__(unsigned long));
> -#endif
>        for_each_possible_cpu(i) {
>                struct rq *rq;
>
> @@ -8317,8 +8234,7 @@ static void __set_se_shares(struct sched
>        if (on_rq)
>                dequeue_entity(cfs_rq, se, 0);
>
> -       se->load.weight = shares;
> -       se->load.inv_weight = 0;
> +       update_load_set(&se->load, shares);
>
>        if (on_rq)
>                enqueue_entity(cfs_rq, se, 0);
> @@ -8375,7 +8291,6 @@ int sched_group_set_shares(struct task_g
>                /*
>                 * force a rebalance
>                 */
> -               cfs_rq_set_shares(tg->cfs_rq[i], 0);
>                set_se_shares(tg->se[i], shares);

I think a update_cfs_shares is wanted instead here, this will
potentially over-commit everything until we hit tg_shares_up (e.g.
long running task case).

Ironically, the heavy weight full enqueue/dequeue in the
__set_se_shares path will actually fix up the weights ignoring the
passed weight for the se->on_rq case.

I think both functions can be knocked out and just replaced with a
<lock> <update load> <update shares> <unlock>

Although.. for total correctness this update should probably be hierarchical.

>        }
>
> Index: linux-2.6/kernel/sched_debug.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_debug.c
> +++ linux-2.6/kernel/sched_debug.c
> @@ -204,13 +204,18 @@ void print_cfs_rq(struct seq_file *m, in
>                        SPLIT_NS(spread0));
>        SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
>        SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
> +       SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_avg",
> +                       SPLIT_NS(cfs_rq->load_avg));
> +       SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_period",
> +                       SPLIT_NS(cfs_rq->load_period));
> +       SEQ_printf(m, "  .%-30s: %ld\n", "load_contrib",
> +                       cfs_rq->load_contribution);
> +       SEQ_printf(m, "  .%-30s: %d\n", "load_tg",
> +                       atomic_read(&tg->load_weight));
>
>        SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
>                        cfs_rq->nr_spread_over);
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -#ifdef CONFIG_SMP
> -       SEQ_printf(m, "  .%-30s: %lu\n", "shares", cfs_rq->shares);
> -#endif
>        print_cfs_group_stats(m, cpu, cfs_rq->tg);
>  #endif
>  }
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -417,7 +417,6 @@ int sched_proc_update_handler(struct ctl
>        WRT_SYSCTL(sched_min_granularity);
>        WRT_SYSCTL(sched_latency);
>        WRT_SYSCTL(sched_wakeup_granularity);
> -       WRT_SYSCTL(sched_shares_ratelimit);
>  #undef WRT_SYSCTL
>
>        return 0;
> @@ -633,7 +632,6 @@ account_entity_enqueue(struct cfs_rq *cf
>                list_add(&se->group_node, &cfs_rq->tasks);
>        }
>        cfs_rq->nr_running++;
> -       se->on_rq = 1;
>  }
>
>  static void
> @@ -647,9 +645,84 @@ account_entity_dequeue(struct cfs_rq *cf
>                list_del_init(&se->group_node);
>        }
>        cfs_rq->nr_running--;
> -       se->on_rq = 0;
>  }
>
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +static void update_cfs_load(struct cfs_rq *cfs_rq)
> +{
> +       u64 period = sched_avg_period();

This is a pretty large history window; while it should overlap the
update period for obvious reasons, intuition suggests a smaller window
(e.g. 2 x sched_latency) would probably be preferable here in terms of
reducing over-commit and reducing convergence time.

I'll run some benchmarks and see how it impacts fairness.

> +       u64 now = rq_of(cfs_rq)->clock;
> +       u64 delta = now - cfs_rq->load_stamp;
> +

Is is meaningful/useful to maintain cfs_rq->load for the rq->cfs_rq case?

> +       cfs_rq->load_stamp = now;
> +       cfs_rq->load_period += delta;
> +       cfs_rq->load_avg += delta * cfs_rq->load.weight;
> +
> +       while (cfs_rq->load_period > period) {
> +               /*
> +                * Inline assembly required to prevent the compiler
> +                * optimising this loop into a divmod call.
> +                * See __iter_div_u64_rem() for another example of this.
> +                */
> +               asm("" : "+rm" (cfs_rq->load_period));
> +               cfs_rq->load_period /= 2;
> +               cfs_rq->load_avg /= 2;
> +       }
> +}
> +
> +static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> +                           unsigned long weight)
> +{
> +       if (se->on_rq)
> +               account_entity_dequeue(cfs_rq, se);
> +
> +       update_load_set(&se->load, weight);
> +
> +       if (se->on_rq)
> +               account_entity_enqueue(cfs_rq, se);
> +}
> +
> +static void update_cfs_shares(struct cfs_rq *cfs_rq)
> +{
> +       struct task_group *tg;
> +       struct sched_entity *se;
> +       unsigned long load_weight, load, shares;
> +
> +       if (!cfs_rq)
> +               return;
> +
> +       tg = cfs_rq->tg;
> +       se = tg->se[cpu_of(rq_of(cfs_rq))];
> +       if (!se)
> +               return;
> +
> +       load = cfs_rq->load.weight;
> +
> +       load_weight = atomic_read(&tg->load_weight);
> +       load_weight -= cfs_rq->load_contribution;
> +       load_weight += load;
> +
> +       shares = (tg->shares * load);
> +       if (load_weight)
> +               shares /= load_weight;
> +
> +       if (shares < MIN_SHARES)
> +               shares = MIN_SHARES;
> +       if (shares > tg->shares)
> +               shares = tg->shares;
> +
> +       reweight_entity(cfs_rq_of(se), se, shares);
> +}
> +#else /* CONFIG_FAIR_GROUP_SCHED */
> +static inline void update_cfs_load(struct cfs_rq *cfs_rq)
> +{
> +}
> +
> +static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
> +{
> +}
> +#endif /* CONFIG_FAIR_GROUP_SCHED */
> +
>  static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>  #ifdef CONFIG_SCHEDSTATS
> @@ -771,7 +844,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>         * Update run-time statistics of the 'current'.
>         */
>        update_curr(cfs_rq);
> +       update_cfs_load(cfs_rq);
>        account_entity_enqueue(cfs_rq, se);
> +       update_cfs_shares(group_cfs_rq(se));

Don't we want to be updating the queuing cfs_rq's shares here?

The owned cfs_rq's share proportion isn't going to change as a result
of being enqueued -- and is guaranteed to be hit by a previous queuing
cfs_rq update in the initial enqueue case.

>
>        if (flags & ENQUEUE_WAKEUP) {
>                place_entity(cfs_rq, se, 0);
> @@ -782,6 +857,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>        check_spread(cfs_rq, se);
>        if (se != cfs_rq->curr)
>                __enqueue_entity(cfs_rq, se);
> +       se->on_rq = 1;
>  }
>
>  static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
> @@ -825,8 +901,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>
>        if (se != cfs_rq->curr)
>                __dequeue_entity(cfs_rq, se);
> +       se->on_rq = 0;
> +       update_cfs_load(cfs_rq);
>        account_entity_dequeue(cfs_rq, se);
>        update_min_vruntime(cfs_rq);
> +       update_cfs_shares(group_cfs_rq(se));
>
>        /*
>         * Normalize the entity after updating the min_vruntime because the
> @@ -1055,6 +1134,9 @@ enqueue_task_fair(struct rq *rq, struct
>                flags = ENQUEUE_WAKEUP;
>        }
>
> +       for_each_sched_entity(se)
> +               update_cfs_shares(group_cfs_rq(se));

If the queuing cfs_rq is used above then group_cfs_rq is redundant
here, cfs_rq_of can be used.

Also, the respective load should be updated here.

> +
>        hrtick_update(rq);
>  }
>
> @@ -1071,12 +1153,16 @@ static void dequeue_task_fair(struct rq
>        for_each_sched_entity(se) {
>                cfs_rq = cfs_rq_of(se);
>                dequeue_entity(cfs_rq, se, flags);
> +
>                /* Don't dequeue parent if it has other entities besides us */
>                if (cfs_rq->load.weight)
>                        break;
>                flags |= DEQUEUE_SLEEP;
>        }
>
> +       for_each_sched_entity(se)
> +               update_cfs_shares(group_cfs_rq(se));
> +
>        hrtick_update(rq);
>  }
>
> @@ -1143,51 +1229,20 @@ static void task_waking_fair(struct rq *
>  * Adding load to a group doesn't make a group heavier, but can cause movement
>  * of group shares between cpus. Assuming the shares were perfectly aligned one
>  * can calculate the shift in shares.
> - *
> - * The problem is that perfectly aligning the shares is rather expensive, hence
> - * we try to avoid doing that too often - see update_shares(), which ratelimits
> - * this change.
> - *
> - * We compensate this by not only taking the current delta into account, but
> - * also considering the delta between when the shares were last adjusted and
> - * now.
> - *
> - * We still saw a performance dip, some tracing learned us that between
> - * cgroup:/ and cgroup:/foo balancing the number of affine wakeups increased
> - * significantly. Therefore try to bias the error in direction of failing
> - * the affine wakeup.
> - *
>  */
> -static long effective_load(struct task_group *tg, int cpu,
> -               long wl, long wg)
> +static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>  {
>        struct sched_entity *se = tg->se[cpu];
>
>        if (!tg->parent)
>                return wl;
>
> -       /*
> -        * By not taking the decrease of shares on the other cpu into
> -        * account our error leans towards reducing the affine wakeups.
> -        */
> -       if (!wl && sched_feat(ASYM_EFF_LOAD))
> -               return wl;
> -
>        for_each_sched_entity(se) {
>                long S, rw, s, a, b;
> -               long more_w;
>
> -               /*
> -                * Instead of using this increment, also add the difference
> -                * between when the shares were last updated and now.
> -                */
> -               more_w = se->my_q->load.weight - se->my_q->rq_weight;
> -               wl += more_w;
> -               wg += more_w;
> -
> -               S = se->my_q->tg->shares;
> -               s = se->my_q->shares;
> -               rw = se->my_q->rq_weight;
> +               S = tg->shares;
> +               s = se->load.weight;
> +               rw = se->my_q->load.weight;
>
>                a = S*(rw + wl);
>                b = S*rw + s*wg;
> @@ -1509,23 +1564,6 @@ select_task_rq_fair(struct rq *rq, struc
>                        sd = tmp;
>        }
>
> -#ifdef CONFIG_FAIR_GROUP_SCHED
> -       if (sched_feat(LB_SHARES_UPDATE)) {
> -               /*
> -                * Pick the largest domain to update shares over
> -                */
> -               tmp = sd;
> -               if (affine_sd && (!tmp || affine_sd->span_weight > sd->span_weight))
> -                       tmp = affine_sd;
> -
> -               if (tmp) {
> -                       raw_spin_unlock(&rq->lock);
> -                       update_shares(tmp);
> -                       raw_spin_lock(&rq->lock);
> -               }
> -       }
> -#endif
> -
>        if (affine_sd) {
>                if (cpu == prev_cpu || wake_affine(affine_sd, p, sync))
>                        return select_idle_sibling(p, cpu);
> @@ -2980,7 +3018,6 @@ static int load_balance(int this_cpu, st
>        schedstat_inc(sd, lb_count[idle]);
>
>  redo:
> -       update_shares(sd);
>        group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
>                                   cpus, balance);
>
> @@ -3115,8 +3152,6 @@ out_one_pinned:
>        else
>                ld_moved = 0;
>  out:
> -       if (ld_moved)
> -               update_shares(sd);
>        return ld_moved;
>  }
>
> @@ -3510,6 +3545,8 @@ static void rebalance_domains(int cpu, e
>        int update_next_balance = 0;
>        int need_serialize;
>
> +       update_shares(cpu);
> +

This may not be frequent enough, especially in the dilated cpus-busy case

>        for_each_domain(cpu, sd) {
>                if (!(sd->flags & SD_LOAD_BALANCE))
>                        continue;
> Index: linux-2.6/kernel/sched_features.h
> ===================================================================
> --- linux-2.6.orig/kernel/sched_features.h
> +++ linux-2.6/kernel/sched_features.h
> @@ -52,8 +52,6 @@ SCHED_FEAT(ARCH_POWER, 0)
>  SCHED_FEAT(HRTICK, 0)
>  SCHED_FEAT(DOUBLE_TICK, 0)
>  SCHED_FEAT(LB_BIAS, 1)
> -SCHED_FEAT(LB_SHARES_UPDATE, 1)
> -SCHED_FEAT(ASYM_EFF_LOAD, 1)
>
>  /*
>  * Spin-wait on mutex acquisition when the mutex owner is running on
> Index: linux-2.6/kernel/sysctl.c
> ===================================================================
> --- linux-2.6.orig/kernel/sysctl.c
> +++ linux-2.6/kernel/sysctl.c
> @@ -307,15 +307,6 @@ static struct ctl_table kern_table[] = {
>                .extra2         = &max_wakeup_granularity_ns,
>        },
>        {
> -               .procname       = "sched_shares_ratelimit",
> -               .data           = &sysctl_sched_shares_ratelimit,
> -               .maxlen         = sizeof(unsigned int),
> -               .mode           = 0644,
> -               .proc_handler   = sched_proc_update_handler,
> -               .extra1         = &min_sched_shares_ratelimit,
> -               .extra2         = &max_sched_shares_ratelimit,
> -       },
> -       {
>                .procname       = "sched_tunable_scaling",
>                .data           = &sysctl_sched_tunable_scaling,
>                .maxlen         = sizeof(enum sched_tunable_scaling),
> @@ -325,14 +316,6 @@ static struct ctl_table kern_table[] = {
>                .extra2         = &max_sched_tunable_scaling,
>        },
>        {
> -               .procname       = "sched_shares_thresh",
> -               .data           = &sysctl_sched_shares_thresh,
> -               .maxlen         = sizeof(unsigned int),
> -               .mode           = 0644,
> -               .proc_handler   = proc_dointvec_minmax,
> -               .extra1         = &zero,
> -       },
> -       {
>                .procname       = "sched_migration_cost",
>                .data           = &sysctl_sched_migration_cost,
>                .maxlen         = sizeof(unsigned int),
>
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 2/3] sched: On-demand cfs_rq list
  2010-08-28 22:30 ` [RFC][PATCH 2/3] sched: On-demand cfs_rq list Peter Zijlstra
@ 2010-09-03  3:33   ` Paul Turner
  2010-09-03  7:59     ` Peter Zijlstra
  0 siblings, 1 reply; 12+ messages in thread
From: Paul Turner @ 2010-09-03  3:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon

On Sat, Aug 28, 2010 at 3:30 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Make certain load-balance actions scale per number of active cgroups
> instead of the number of existing cgroups.
>
> This makes wakeup/sleep paths more expensive, but is a win for systems
> where the vast majority of existing cgroups are idle.
>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched.c      |   86 +++++++++-------------------------------------------
>  kernel/sched_fair.c |   41 ++++++++++++++++++++++--
>  kernel/sched_rt.c   |   24 ++++++++++++++
>  3 files changed, 76 insertions(+), 75 deletions(-)
>
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -344,6 +344,7 @@ struct cfs_rq {
>         * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a cpu. This
>         * list is used during load balance.
>         */
> +       int on_list;
>        struct list_head leaf_cfs_rq_list;
>        struct task_group *tg;  /* group that "owns" this runqueue */
>
> @@ -1529,7 +1530,7 @@ static unsigned long cpu_avg_load_per_ta
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>
> -static void update_cfs_load(struct cfs_rq *cfs_rq);
> +static void update_cfs_load(struct cfs_rq *cfs_rq, int lb);
>  static void update_cfs_shares(struct cfs_rq *cfs_rq);
>
>  /*
> @@ -1552,7 +1553,7 @@ static int tg_shares_up(struct task_grou
>        raw_spin_lock_irqsave(&rq->lock, flags);
>
>        update_rq_clock(rq);
> -       update_cfs_load(cfs_rq);
> +       update_cfs_load(cfs_rq, 1);
>
>        load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
>        load_avg -= cfs_rq->load_contribution;
> @@ -7553,15 +7554,13 @@ static void init_rt_rq(struct rt_rq *rt_
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
> -                               struct sched_entity *se, int cpu, int add,
> +                               struct sched_entity *se, int cpu,
>                                struct sched_entity *parent)
>  {
>        struct rq *rq = cpu_rq(cpu);
>        tg->cfs_rq[cpu] = cfs_rq;
>        init_cfs_rq(cfs_rq, rq);
>        cfs_rq->tg = tg;
> -       if (add)
> -               list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
>
>        tg->se[cpu] = se;
>        /* se could be NULL for init_task_group */
> @@ -7581,7 +7580,7 @@ static void init_tg_cfs_entry(struct tas
>
>  #ifdef CONFIG_RT_GROUP_SCHED
>  static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
> -               struct sched_rt_entity *rt_se, int cpu, int add,
> +               struct sched_rt_entity *rt_se, int cpu,
>                struct sched_rt_entity *parent)
>  {
>        struct rq *rq = cpu_rq(cpu);
> @@ -7590,8 +7589,6 @@ static void init_tg_rt_entry(struct task
>        init_rt_rq(rt_rq, rq);
>        rt_rq->tg = tg;
>        rt_rq->rt_runtime = tg->rt_bandwidth.rt_runtime;
> -       if (add)
> -               list_add(&rt_rq->leaf_rt_rq_list, &rq->leaf_rt_rq_list);
>
>        tg->rt_se[cpu] = rt_se;
>        if (!rt_se)
> @@ -7700,7 +7697,7 @@ void __init sched_init(void)
>                 * We achieve this by letting init_task_group's tasks sit
>                 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
>                 */
> -               init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, 1, NULL);
> +               init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, NULL);
>  #endif
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> @@ -7708,7 +7705,7 @@ void __init sched_init(void)
>  #ifdef CONFIG_RT_GROUP_SCHED
>                INIT_LIST_HEAD(&rq->leaf_rt_rq_list);
>  #ifdef CONFIG_CGROUP_SCHED
> -               init_tg_rt_entry(&init_task_group, &rq->rt, NULL, i, 1, NULL);
> +               init_tg_rt_entry(&init_task_group, &rq->rt, NULL, i, NULL);
>  #endif
>  #endif
>
> @@ -7984,7 +7981,7 @@ int alloc_fair_sched_group(struct task_g
>                if (!se)
>                        goto err_free_rq;
>
> -               init_tg_cfs_entry(tg, cfs_rq, se, i, 0, parent->se[i]);
> +               init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
>        }
>
>        return 1;
> @@ -8015,14 +8012,6 @@ int alloc_fair_sched_group(struct task_g
>  {
>        return 1;
>  }
> -
> -static inline void register_fair_sched_group(struct task_group *tg, int cpu)
> -{
> -}
> -
> -static inline void unregister_fair_sched_group(struct task_group *tg, int cpu)
> -{
> -}
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>
>  #ifdef CONFIG_RT_GROUP_SCHED
> @@ -8074,7 +8063,7 @@ int alloc_rt_sched_group(struct task_gro
>                if (!rt_se)
>                        goto err_free_rq;
>
> -               init_tg_rt_entry(tg, rt_rq, rt_se, i, 0, parent->rt_se[i]);
> +               init_tg_rt_entry(tg, rt_rq, rt_se, i, parent->rt_se[i]);
>        }
>
>        return 1;
> @@ -8084,17 +8073,6 @@ int alloc_rt_sched_group(struct task_gro
>  err:
>        return 0;
>  }
> -
> -static inline void register_rt_sched_group(struct task_group *tg, int cpu)
> -{
> -       list_add_rcu(&tg->rt_rq[cpu]->leaf_rt_rq_list,
> -                       &cpu_rq(cpu)->leaf_rt_rq_list);
> -}
> -
> -static inline void unregister_rt_sched_group(struct task_group *tg, int cpu)
> -{
> -       list_del_rcu(&tg->rt_rq[cpu]->leaf_rt_rq_list);
> -}
>  #else /* !CONFIG_RT_GROUP_SCHED */
>  static inline void free_rt_sched_group(struct task_group *tg)
>  {
> @@ -8105,14 +8083,6 @@ int alloc_rt_sched_group(struct task_gro
>  {
>        return 1;
>  }
> -
> -static inline void register_rt_sched_group(struct task_group *tg, int cpu)
> -{
> -}
> -
> -static inline void unregister_rt_sched_group(struct task_group *tg, int cpu)
> -{
> -}
>  #endif /* CONFIG_RT_GROUP_SCHED */
>
>  #ifdef CONFIG_CGROUP_SCHED
> @@ -8128,7 +8098,6 @@ struct task_group *sched_create_group(st
>  {
>        struct task_group *tg;
>        unsigned long flags;
> -       int i;
>
>        tg = kzalloc(sizeof(*tg), GFP_KERNEL);
>        if (!tg)
> @@ -8141,10 +8110,6 @@ struct task_group *sched_create_group(st
>                goto err;
>
>        spin_lock_irqsave(&task_group_lock, flags);
> -       for_each_possible_cpu(i) {
> -               register_fair_sched_group(tg, i);
> -               register_rt_sched_group(tg, i);
> -       }
>        list_add_rcu(&tg->list, &task_groups);
>
>        WARN_ON(!parent); /* root should already exist */
> @@ -8175,10 +8140,12 @@ void sched_destroy_group(struct task_gro
>        int i;
>
>        spin_lock_irqsave(&task_group_lock, flags);
> -       for_each_possible_cpu(i) {
> -               unregister_fair_sched_group(tg, i);
> -               unregister_rt_sched_group(tg, i);
> -       }
> +       /*
> +        * XXX should not be a race against enqueue, even without rq->lock
> +        * because only empty groups can be destroyed.
> +        */
> +       for_each_possible_cpu(i)
> +               list_del_leaf_cfs_rq(tg->cfs_rq[i]);
>        list_del_rcu(&tg->list);
>        list_del_rcu(&tg->siblings);
>        spin_unlock_irqrestore(&task_group_lock, flags);
> @@ -8256,7 +8223,6 @@ static DEFINE_MUTEX(shares_mutex);
>  int sched_group_set_shares(struct task_group *tg, unsigned long shares)
>  {
>        int i;
> -       unsigned long flags;
>
>        /*
>         * We can't change the weight of the root cgroup.
> @@ -8273,19 +8239,6 @@ int sched_group_set_shares(struct task_g
>        if (tg->shares == shares)
>                goto done;
>
> -       spin_lock_irqsave(&task_group_lock, flags);
> -       for_each_possible_cpu(i)
> -               unregister_fair_sched_group(tg, i);
> -       list_del_rcu(&tg->siblings);
> -       spin_unlock_irqrestore(&task_group_lock, flags);
> -
> -       /* wait for any ongoing reference to this group to finish */
> -       synchronize_sched();
> -
> -       /*
> -        * Now we are free to modify the group's share on each cpu
> -        * w/o tripping rebalance_share or load_balance_fair.
> -        */
>        tg->shares = shares;
>        for_each_possible_cpu(i) {
>                /*
> @@ -8294,15 +8247,6 @@ int sched_group_set_shares(struct task_g
>                set_se_shares(tg->se[i], shares);
>        }
>
> -       /*
> -        * Enable load balance activity on this group, by inserting it back on
> -        * each cpu's rq->leaf_cfs_rq_list.
> -        */
> -       spin_lock_irqsave(&task_group_lock, flags);
> -       for_each_possible_cpu(i)
> -               register_fair_sched_group(tg, i);
> -       list_add_rcu(&tg->siblings, &tg->parent->children);
> -       spin_unlock_irqrestore(&task_group_lock, flags);
>  done:
>        mutex_unlock(&shares_mutex);
>        return 0;
> Index: linux-2.6/kernel/sched_fair.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_fair.c
> +++ linux-2.6/kernel/sched_fair.c
> @@ -143,6 +143,23 @@ static inline struct cfs_rq *cpu_cfs_rq(
>        return cfs_rq->tg->cfs_rq[this_cpu];
>  }
>
> +static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> +{
> +       if (!cfs_rq->on_list) {
> +               list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
> +                               &rq_of(cfs_rq)->leaf_cfs_rq_list);
> +               cfs_rq->on_list = 1;
> +       }
> +}
> +
> +static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> +{
> +       if (cfs_rq->on_list) {
> +               list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
> +               cfs_rq->on_list = 0;
> +       }
> +}
> +
>  /* Iterate thr' all leaf cfs_rq's on a runqueue */
>  #define for_each_leaf_cfs_rq(rq, cfs_rq) \
>        list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
> @@ -246,6 +263,14 @@ static inline struct cfs_rq *cpu_cfs_rq(
>        return &cpu_rq(this_cpu)->cfs;
>  }
>
> +static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> +{
> +}
> +
> +static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> +{
> +}
> +
>  #define for_each_leaf_cfs_rq(rq, cfs_rq) \
>                for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
>
> @@ -648,7 +673,7 @@ account_entity_dequeue(struct cfs_rq *cf
>  }
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -static void update_cfs_load(struct cfs_rq *cfs_rq)
> +static void update_cfs_load(struct cfs_rq *cfs_rq, int lb)
>  {
>        u64 period = sched_avg_period();
>        u64 now = rq_of(cfs_rq)->clock;
> @@ -668,6 +693,11 @@ static void update_cfs_load(struct cfs_r
>                cfs_rq->load_period /= 2;
>                cfs_rq->load_avg /= 2;
>        }
> +
> +       if (lb && !cfs_rq->nr_running) {
> +               if (cfs_rq->load_period < (period / 8))
> +                       list_del_leaf_cfs_rq(cfs_rq);
> +       }
>  }
>

In the case of zero load, load_avg above will decay however
load_period will remain in the range  [period/2, period], any entity
that has passed period/8 time will remain on the leaf rq list.

The lb condition could also be relaxed in the dequeue_entity case.

>  static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> @@ -729,7 +759,7 @@ static void update_cfs_shares(struct cfs
>        reweight_entity(cfs_rq_of(se), se, shares);
>  }
>  #else /* CONFIG_FAIR_GROUP_SCHED */
> -static inline void update_cfs_load(struct cfs_rq *cfs_rq)
> +static inline void update_cfs_load(struct cfs_rq *cfs_rq, int lb)
>  {
>  }
>
> @@ -859,7 +889,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>         * Update run-time statistics of the 'current'.
>         */
>        update_curr(cfs_rq);
> -       update_cfs_load(cfs_rq);
> +       update_cfs_load(cfs_rq, 0);
>        account_entity_enqueue(cfs_rq, se);
>        update_cfs_shares(group_cfs_rq(se));
>
> @@ -873,6 +903,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>        if (se != cfs_rq->curr)
>                __enqueue_entity(cfs_rq, se);
>        se->on_rq = 1;
> +
> +       if (cfs_rq->nr_running == 1)
> +               list_add_leaf_cfs_rq(cfs_rq);
>  }
>
>  static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
> @@ -917,7 +950,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>        if (se != cfs_rq->curr)
>                __dequeue_entity(cfs_rq, se);
>        se->on_rq = 0;
> -       update_cfs_load(cfs_rq);
> +       update_cfs_load(cfs_rq, 0);
>        account_entity_dequeue(cfs_rq, se);
>        update_min_vruntime(cfs_rq);
>        update_cfs_shares(group_cfs_rq(se));
> Index: linux-2.6/kernel/sched_rt.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_rt.c
> +++ linux-2.6/kernel/sched_rt.c
> @@ -183,6 +183,17 @@ static inline u64 sched_rt_period(struct
>        return ktime_to_ns(rt_rq->tg->rt_bandwidth.rt_period);
>  }
>
> +static inline void list_add_leaf_rt_rq(struct rt_rq *rt_rq)
> +{
> +       list_add_rcu(&rt_rq->leaf_rt_rq_list,
> +                       &rq_of_rt_rq(rt_rq)->leaf_rt_rq_list);
> +}
> +
> +static inline void list_del_leaf_rt_rq(struct rt_rq *rt_rq)
> +{
> +       list_del_rcu(&rt_rq->leaf_rt_rq_list);
> +}
> +
>  #define for_each_leaf_rt_rq(rt_rq, rq) \
>        list_for_each_entry_rcu(rt_rq, &rq->leaf_rt_rq_list, leaf_rt_rq_list)
>
> @@ -276,6 +287,14 @@ static inline u64 sched_rt_period(struct
>        return ktime_to_ns(def_rt_bandwidth.rt_period);
>  }
>
> +static inline void list_add_leaf_rt_rq(struct rt_rq *rt_rq)
> +{
> +}
> +
> +static inline void list_del_leaf_rt_rq(struct rt_rq *rt_rq)
> +{
> +}
> +
>  #define for_each_leaf_rt_rq(rt_rq, rq) \
>        for (rt_rq = &rq->rt; rt_rq; rt_rq = NULL)
>
> @@ -825,6 +844,9 @@ static void __enqueue_rt_entity(struct s
>        if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running))
>                return;
>
> +       if (!rt_rq->rt_nr_running)
> +               list_add_leaf_rt_rq(rt_rq);
> +
>        if (head)
>                list_add(&rt_se->run_list, queue);
>        else
> @@ -844,6 +866,8 @@ static void __dequeue_rt_entity(struct s
>                __clear_bit(rt_se_prio(rt_se), array->bitmap);
>
>        dec_rt_tasks(rt_se, rt_rq);
> +       if (!rt_rq->rt_nr_running)
> +               list_del_leaf_rt_rq(rt_rq);
>  }
>
>  /*
>
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 2/3] sched: On-demand cfs_rq list
  2010-09-03  3:33   ` Paul Turner
@ 2010-09-03  7:59     ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2010-09-03  7:59 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon

On Fri, 2010-09-03 at 04:33 +0100, Paul Turner wrote:
> > @@ -668,6 +693,11 @@ static void update_cfs_load(struct cfs_r
> >                cfs_rq->load_period /= 2;
> >                cfs_rq->load_avg /= 2;
> >        }
> > +
> > +       if (lb && !cfs_rq->nr_running) {
> > +               if (cfs_rq->load_period < (period / 8))
> > +                       list_del_leaf_cfs_rq(cfs_rq);
> > +       }
> >  }
> >
> 
> In the case of zero load, load_avg above will decay however
> load_period will remain in the range  [period/2, period], any entity
> that has passed period/8 time will remain on the leaf rq list.

Right, already changed that to cfs_rq->load_avg < (period / 8) after you
pointed that out on irc..

> The lb condition could also be relaxed in the dequeue_entity case.
> 
Possibly yeah, we'll have to see if that makes much difference.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 1/3] sched: Rewrite tg_shares_up
  2010-09-03  3:09   ` Paul Turner
@ 2010-09-03  7:59     ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2010-09-03  7:59 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon

On Fri, 2010-09-03 at 04:09 +0100, Paul Turner wrote:

> > @@ -7652,8 +7574,7 @@ static void init_tg_cfs_entry(struct tas
> >                se->cfs_rq = parent->my_q;
> >
> >        se->my_q = cfs_rq;
> > -       se->load.weight = tg->shares;
> > -       se->load.inv_weight = 0;
> > +       update_load_set(&se->load, tg->shares);
> 
> Given now instantaneous update of shares->load on enqueue/dequeue
> initialization to 0 would result in sane(r) sums across tg->se->load.
> Only relevant for debug though.

Ah, indeed.

> > @@ -8375,7 +8291,6 @@ int sched_group_set_shares(struct task_g
> >                /*
> >                 * force a rebalance
> >                 */
> > -               cfs_rq_set_shares(tg->cfs_rq[i], 0);
> >                set_se_shares(tg->se[i], shares);
> 
> I think a update_cfs_shares is wanted instead here, this will
> potentially over-commit everything until we hit tg_shares_up (e.g.
> long running task case).
> 
> Ironically, the heavy weight full enqueue/dequeue in the
> __set_se_shares path will actually fix up the weights ignoring the
> passed weight for the se->on_rq case.
> 
> I think both functions can be knocked out and just replaced with a
> <lock> <update load> <update shares> <unlock>
> 
> Although.. for total correctness this update should probably be hierarchical.

Right, I just didn't want to bother too much with this code yet, getting
it to more or less not explode when changing weights was good 'nuff.

> > +#ifdef CONFIG_FAIR_GROUP_SCHED
> > +static void update_cfs_load(struct cfs_rq *cfs_rq)
> > +{
> > +       u64 period = sched_avg_period();
> 
> This is a pretty large history window; while it should overlap the
> update period for obvious reasons, intuition suggests a smaller window
> (e.g. 2 x sched_latency) would probably be preferable here in terms of
> reducing over-commit and reducing convergence time.
> 
> I'll run some benchmarks and see how it impacts fairness.

Agreed, maybe even as small as 2*TICK_NSEC, its certainly something we
want to play with, which is basically why I picked the variable that
already had a sysctl knob ;-)

> > +       u64 now = rq_of(cfs_rq)->clock;
> > +       u64 delta = now - cfs_rq->load_stamp;
> > +
> 
> Is is meaningful/useful to maintain cfs_rq->load for the rq->cfs_rq case?

Probably not,.. I had ideas of maybe using this load_avg for other
things, but then, maybe not..


> > @@ -771,7 +844,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
> >         * Update run-time statistics of the 'current'.
> >         */
> >        update_curr(cfs_rq);
> > +       update_cfs_load(cfs_rq);
> >        account_entity_enqueue(cfs_rq, se);
> > +       update_cfs_shares(group_cfs_rq(se));
> 
> Don't we want to be updating the queuing cfs_rq's shares here?
> 
> The owned cfs_rq's share proportion isn't going to change as a result
> of being enqueued -- and is guaranteed to be hit by a previous queuing
> cfs_rq update in the initial enqueue case.

Right, I had that, that didn't work because,.. uhm,. /me scratches
head.. Ah!, yes, you need the queueing cfs_rq's group to be already
enqueued. So instead of updating ahead, we update backwards.

> > @@ -1055,6 +1134,9 @@ enqueue_task_fair(struct rq *rq, struct
> >                flags = ENQUEUE_WAKEUP;
> >        }
> >
> > +       for_each_sched_entity(se)
> > +               update_cfs_shares(group_cfs_rq(se));
> 
> If the queuing cfs_rq is used above then group_cfs_rq is redundant
> here, cfs_rq_of can be used.
> 
> Also, the respective load should be updated here.

Ah, indeed, that wants a update_cfs_load() as well. /me does

> > @@ -3510,6 +3545,8 @@ static void rebalance_domains(int cpu, e
> >        int update_next_balance = 0;
> >        int need_serialize;
> >
> > +       update_shares(cpu);
> > +
> 
> This may not be frequent enough, especially in the dilated cpus-busy case

Not exactly sure what you mean, but if there's wakeup/sleep activity
that activity will already rebalance for us, its is purely long running
jobs, once a tick should suffice, no?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC][PATCH 3/3] sched: On-demand tg_shares_up()
  2010-09-03  1:52   ` Paul Turner
@ 2010-09-03  7:59     ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2010-09-03  7:59 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon

On Fri, 2010-09-03 at 02:52 +0100, Paul Turner wrote:
> > +static void update_shares(int cpu)
> > +{
> > +       struct cfs_rq *cfs_rq;
> > +       struct rq *rq = cpu_rq(cpu);
> > +
> > +       rcu_read_lock();
> > +       for_each_leaf_cfs_rq(rq, cfs_rq) {
> > +               struct task_group *tg = cfs_rq->tg;
> > +
> > +               do {
> > +                       tg_shares_up(tg, cpu);
> > +                       tg = tg->parent;
> > +               } while (tg);
> > +       }
> 
> This will multiply visit taskgroups:
> 
> In the case of a-b-task, both {a} and {b} will be on the leaf cfs_rq
> list.  Resulting in a being visited both as b's parent and as a leaf
> entity.
> 
> Rather than juggling to avoid this, it seems easier to maintain an
> ordering on rq->leaf_cfs_rq_list.
> 
> That is:
> 
> When an entity is added:
> a) If it has a parent, insert it immediately before the parent in the list.
> b) Otherwise it is attached to the root, attach it to the tail of
> rq->leaf_cfs_rq_list
> 
> This can actually be simplified to: if no parent insert at the tail,
> otherwise insert at the head, since we know the parent will always
> have been processed prior to the child.

Except we enqueue things bottom up, so: A-B-C-task would end up as:

add(C) - has parent on head: C
add(B) - has parent on head: B-C
add(A) - has no parent, on tail: B-C-A

Whereas we'd wanted: C-B-A

Which also means your rule (a) doens't work, can't enqueue it before the
parent if the parent itself isn't yet enqueued.

> Traversing the list in order should then ensure that all child
> entities have been processed for the 'next' entity at any given time
> and that its update is coherent. 

Right. I'm sure we can get something like this to work, just need to
come up with something that actually works, and a cold has currently
stopped everything but the very basic brain functions :/

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-09-03  8:00 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-28 22:30 [RFC][PATCH 0/3] Try and make cpu-cgroup suck less Peter Zijlstra
2010-08-28 22:30 ` [RFC][PATCH 1/3] sched: Rewrite tg_shares_up Peter Zijlstra
2010-08-30 17:20   ` Srivatsa Vaddagiri
2010-08-30 17:53     ` Peter Zijlstra
2010-09-03  3:09   ` Paul Turner
2010-09-03  7:59     ` Peter Zijlstra
2010-08-28 22:30 ` [RFC][PATCH 2/3] sched: On-demand cfs_rq list Peter Zijlstra
2010-09-03  3:33   ` Paul Turner
2010-09-03  7:59     ` Peter Zijlstra
2010-08-28 22:30 ` [RFC][PATCH 3/3] sched: On-demand tg_shares_up() Peter Zijlstra
2010-09-03  1:52   ` Paul Turner
2010-09-03  7:59     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox