public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution
@ 2010-10-16  4:43 pjt
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 01/12] sched: rewrite tg_shares_up pjt
                   ` (14 more replies)
  0 siblings, 15 replies; 28+ messages in thread
From: pjt @ 2010-10-16  4:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao

Hi all,

Peter previously posted a patchset that attempted to improve the problem of
task_group share distribution.  This is something that has been a long-time
pain point for group scheduling.  The existing algorithm considers
distributions on a per-cpu-per-domain basis and carries a fairly high update
overhead, especially on larger machines.

I was previously looking at improving this using Fenwick trees to allow a
single sum without the exorbitant cost but then Peter's idea above was better :).

The kernel is that by monitoring the average contribution to load on a
per-cpu-per-taskgroup basis we can distribute the weight for which we are
expected to consume.

This set extends the original posting with a focus on increased fairness and
reduced convergence (to true average) time.  In particular the case of large
over-commit in the case of a distributed wake-up is a concern which is now
fairly well addressed.

Obviously everything's experimental but it should be stable/fair.

Some motivation:

24 thread intel box, 150 active cgroups, multiple threads/group, load at ~90% (10 second sample):
tip:
     2.64%  [k] tg_shares_up <!>
     0.15%  [k] __set_se_shares

patched:
     0.02%  [k] update_cfs_load
     0.01%  [k] update_cpu_load
     0.00%  [k] update_cfs_shares

Some fairness coverage for the above at: http://rs5.risingnet.net/~pjt/patches/shares_data_v1.txt

Note: The last patch is fairly obviously a temporary debug patch, I only
include it as it interfaces with some analysis scripts I'm simultaneously
trying to publish for the purposes of validating this series.  Since this
approach estimates the share distribution, the spread between issued shares
and target is an important factor until people are happy with the patchset

Paul

TODO:
- Validate any RT interaction
- Continue collecting/analyzing performance and fairness data
- Should the shares period just be the sched_latency?




^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC tg_shares_up improvements - v1 01/12] sched: rewrite tg_shares_up
  2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
@ 2010-10-16  4:43 ` pjt
  2010-10-21  6:04   ` Bharata B Rao
                     ` (2 more replies)
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 02/12] sched: on-demand (active) cfs_rq list pjt
                   ` (13 subsequent siblings)
  14 siblings, 3 replies; 28+ messages in thread
From: pjt @ 2010-10-16  4:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao

[-- Attachment #1: sched-rework-tg_shares_up.patch --]
[-- Type: text/plain, Size: 19077 bytes --]

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

By tracking a per-cpu load-avg for each cfs_rq and folding it into a
global task_group load on each tick we can rework tg_shares_up to be
strictly per-cpu.

This should improve cpu-cgroup performance for smp systems
significantly.

[ Paul: changed to use queueing cfs_rq ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul Turner <pjt@google.com>

---
 include/linux/sched.h   |    2 
 kernel/sched.c          |  173 ++++++++++++------------------------------------
 kernel/sched_debug.c    |   15 +++-
 kernel/sched_fair.c     |  166 +++++++++++++++++++++++++++++-----------------
 kernel/sched_features.h |    2 
 kernel/sysctl.c         |   17 ----
 6 files changed, 163 insertions(+), 212 deletions(-)

Index: include/linux/sched.h
===================================================================
--- include/linux/sched.h.orig
+++ include/linux/sched.h
@@ -1868,8 +1868,6 @@ static inline void wake_up_idle_cpu(int 
 extern unsigned int sysctl_sched_latency;
 extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
-extern unsigned int sysctl_sched_shares_ratelimit;
-extern unsigned int sysctl_sched_shares_thresh;
 extern unsigned int sysctl_sched_child_runs_first;
 
 enum sched_tunable_scaling {
Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -253,6 +253,8 @@ struct task_group {
 	/* runqueue "owned" by this group on each cpu */
 	struct cfs_rq **cfs_rq;
 	unsigned long shares;
+
+	atomic_t load_weight;
 #endif
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -359,15 +361,11 @@ struct cfs_rq {
 	 */
 	unsigned long h_load;
 
-	/*
-	 * this cpu's part of tg->shares
-	 */
-	unsigned long shares;
+	u64 load_avg;
+	u64 load_period;
+	u64 load_stamp;
 
-	/*
-	 * load.weight at the time we set shares
-	 */
-	unsigned long rq_weight;
+	unsigned long load_contribution;
 #endif
 #endif
 };
@@ -790,20 +788,6 @@ late_initcall(sched_init_debug);
 const_debug unsigned int sysctl_sched_nr_migrate = 32;
 
 /*
- * ratelimit for updating the group shares.
- * default: 0.25ms
- */
-unsigned int sysctl_sched_shares_ratelimit = 250000;
-unsigned int normalized_sysctl_sched_shares_ratelimit = 250000;
-
-/*
- * Inject some fuzzyness into changing the per-cpu group shares
- * this avoids remote rq-locks at the expense of fairness.
- * default: 4
- */
-unsigned int sysctl_sched_shares_thresh = 4;
-
-/*
  * period over which we average the RT time consumption, measured
  * in ms.
  *
@@ -1352,6 +1336,12 @@ static inline void update_load_sub(struc
 	lw->inv_weight = 0;
 }
 
+static inline void update_load_set(struct load_weight *lw, unsigned long w)
+{
+	lw->weight = w;
+	lw->inv_weight = 0;
+}
+
 /*
  * To aid in avoiding the subversion of "niceness" due to uneven distribution
  * of tasks with abnormal "nice" values across CPUs the contribution that
@@ -1540,97 +1530,44 @@ static unsigned long cpu_avg_load_per_ta
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-static __read_mostly unsigned long __percpu *update_shares_data;
-
-static void __set_se_shares(struct sched_entity *se, unsigned long shares);
-
-/*
- * Calculate and set the cpu's group shares.
- */
-static void update_group_shares_cpu(struct task_group *tg, int cpu,
-				    unsigned long sd_shares,
-				    unsigned long sd_rq_weight,
-				    unsigned long *usd_rq_weight)
-{
-	unsigned long shares, rq_weight;
-	int boost = 0;
-
-	rq_weight = usd_rq_weight[cpu];
-	if (!rq_weight) {
-		boost = 1;
-		rq_weight = NICE_0_LOAD;
-	}
-
-	/*
-	 *             \Sum_j shares_j * rq_weight_i
-	 * shares_i =  -----------------------------
-	 *                  \Sum_j rq_weight_j
-	 */
-	shares = (sd_shares * rq_weight) / sd_rq_weight;
-	shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);
-
-	if (abs(shares - tg->se[cpu]->load.weight) >
-			sysctl_sched_shares_thresh) {
-		struct rq *rq = cpu_rq(cpu);
-		unsigned long flags;
-
-		raw_spin_lock_irqsave(&rq->lock, flags);
-		tg->cfs_rq[cpu]->rq_weight = boost ? 0 : rq_weight;
-		tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
-		__set_se_shares(tg->se[cpu], shares);
-		raw_spin_unlock_irqrestore(&rq->lock, flags);
-	}
-}
+static void update_cfs_load(struct cfs_rq *cfs_rq);
+static void update_cfs_shares(struct cfs_rq *cfs_rq);
 
 /*
- * Re-compute the task group their per cpu shares over the given domain.
- * This needs to be done in a bottom-up fashion because the rq weight of a
- * parent group depends on the shares of its child groups.
+ * update tg->load_weight by folding this cpu's load_avg
  */
 static int tg_shares_up(struct task_group *tg, void *data)
 {
-	unsigned long weight, rq_weight = 0, sum_weight = 0, shares = 0;
-	unsigned long *usd_rq_weight;
-	struct sched_domain *sd = data;
+	long load_avg;
+	struct cfs_rq *cfs_rq;
 	unsigned long flags;
-	int i;
+	int cpu = (long)data;
+	struct rq *rq;
 
-	if (!tg->se[0])
+	if (!tg->se[cpu])
 		return 0;
 
-	local_irq_save(flags);
-	usd_rq_weight = per_cpu_ptr(update_shares_data, smp_processor_id());
-
-	for_each_cpu(i, sched_domain_span(sd)) {
-		weight = tg->cfs_rq[i]->load.weight;
-		usd_rq_weight[i] = weight;
-
-		rq_weight += weight;
-		/*
-		 * If there are currently no tasks on the cpu pretend there
-		 * is one of average load so that when a new task gets to
-		 * run here it will not get delayed by group starvation.
-		 */
-		if (!weight)
-			weight = NICE_0_LOAD;
+	rq = cpu_rq(cpu);
+	cfs_rq = tg->cfs_rq[cpu];
 
-		sum_weight += weight;
-		shares += tg->cfs_rq[i]->shares;
-	}
+	raw_spin_lock_irqsave(&rq->lock, flags);
 
-	if (!rq_weight)
-		rq_weight = sum_weight;
+	update_rq_clock(rq);
+	update_cfs_load(cfs_rq);
 
-	if ((!shares && rq_weight) || shares > tg->shares)
-		shares = tg->shares;
+	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
+	load_avg -= cfs_rq->load_contribution;
 
-	if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
-		shares = tg->shares;
+	atomic_add(load_avg, &tg->load_weight);
+	cfs_rq->load_contribution += load_avg;
 
-	for_each_cpu(i, sched_domain_span(sd))
-		update_group_shares_cpu(tg, i, shares, rq_weight, usd_rq_weight);
+	/*
+	 * We need to update shares after updating tg->load_weight in
+	 * order to adjust the weight of groups with long running tasks.
+	 */
+	update_cfs_shares(cfs_rq);
 
-	local_irq_restore(flags);
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
 
 	return 0;
 }
@@ -1649,7 +1586,7 @@ static int tg_load_down(struct task_grou
 		load = cpu_rq(cpu)->load.weight;
 	} else {
 		load = tg->parent->cfs_rq[cpu]->h_load;
-		load *= tg->cfs_rq[cpu]->shares;
+		load *= tg->se[cpu]->load.weight;
 		load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
 	}
 
@@ -1658,21 +1595,16 @@ static int tg_load_down(struct task_grou
 	return 0;
 }
 
-static void update_shares(struct sched_domain *sd)
+static void update_shares(long cpu)
 {
-	s64 elapsed;
-	u64 now;
-
 	if (root_task_group_empty())
 		return;
 
-	now = local_clock();
-	elapsed = now - sd->last_update;
+	/*
+	 * XXX: replace with an on-demand list
+	 */
 
-	if (elapsed >= (s64)(u64)sysctl_sched_shares_ratelimit) {
-		sd->last_update = now;
-		walk_tg_tree(tg_nop, tg_shares_up, sd);
-	}
+	walk_tg_tree(tg_nop, tg_shares_up, (void *)cpu);
 }
 
 static void update_h_load(long cpu)
@@ -1682,7 +1614,7 @@ static void update_h_load(long cpu)
 
 #else
 
-static inline void update_shares(struct sched_domain *sd)
+static inline void update_shares(int cpu)
 {
 }
 
@@ -1807,15 +1739,6 @@ static void double_rq_unlock(struct rq *
 
 #endif
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
-{
-#ifdef CONFIG_SMP
-	cfs_rq->shares = shares;
-#endif
-}
-#endif
-
 static void calc_load_account_idle(struct rq *this_rq);
 static void update_sysctl(void);
 static int get_update_sysctl_factor(void);
@@ -5404,7 +5327,6 @@ static void update_sysctl(void)
 	SET_SYSCTL(sched_min_granularity);
 	SET_SYSCTL(sched_latency);
 	SET_SYSCTL(sched_wakeup_granularity);
-	SET_SYSCTL(sched_shares_ratelimit);
 #undef SET_SYSCTL
 }
 
@@ -7721,8 +7643,7 @@ static void init_tg_cfs_entry(struct tas
 		se->cfs_rq = parent->my_q;
 
 	se->my_q = cfs_rq;
-	se->load.weight = tg->shares;
-	se->load.inv_weight = 0;
+	update_load_set(&se->load, tg->shares);
 	se->parent = parent;
 }
 #endif
@@ -7815,10 +7736,6 @@ void __init sched_init(void)
 
 #endif /* CONFIG_CGROUP_SCHED */
 
-#if defined CONFIG_FAIR_GROUP_SCHED && defined CONFIG_SMP
-	update_shares_data = __alloc_percpu(nr_cpu_ids * sizeof(unsigned long),
-					    __alignof__(unsigned long));
-#endif
 	for_each_possible_cpu(i) {
 		struct rq *rq;
 
@@ -8386,8 +8303,7 @@ static void __set_se_shares(struct sched
 	if (on_rq)
 		dequeue_entity(cfs_rq, se, 0);
 
-	se->load.weight = shares;
-	se->load.inv_weight = 0;
+	update_load_set(&se->load, shares);
 
 	if (on_rq)
 		enqueue_entity(cfs_rq, se, 0);
@@ -8444,7 +8360,6 @@ int sched_group_set_shares(struct task_g
 		/*
 		 * force a rebalance
 		 */
-		cfs_rq_set_shares(tg->cfs_rq[i], 0);
 		set_se_shares(tg->se[i], shares);
 	}
 
Index: kernel/sched_debug.c
===================================================================
--- kernel/sched_debug.c.orig
+++ kernel/sched_debug.c
@@ -202,15 +202,22 @@ void print_cfs_rq(struct seq_file *m, in
 	spread0 = min_vruntime - rq0_min_vruntime;
 	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread0",
 			SPLIT_NS(spread0));
-	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
-	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
-
 	SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
 			cfs_rq->nr_spread_over);
+	SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
+	SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #ifdef CONFIG_SMP
-	SEQ_printf(m, "  .%-30s: %lu\n", "shares", cfs_rq->shares);
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_avg",
+			SPLIT_NS(cfs_rq->load_avg));
+	SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_period",
+			SPLIT_NS(cfs_rq->load_period));
+	SEQ_printf(m, "  .%-30s: %ld\n", "load_contrib",
+			cfs_rq->load_contribution);
+	SEQ_printf(m, "  .%-30s: %d\n", "load_tg",
+			atomic_read(&tg->load_weight));
 #endif
+
 	print_cfs_group_stats(m, cpu, cfs_rq->tg);
 #endif
 }
Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -417,7 +417,6 @@ int sched_proc_update_handler(struct ctl
 	WRT_SYSCTL(sched_min_granularity);
 	WRT_SYSCTL(sched_latency);
 	WRT_SYSCTL(sched_wakeup_granularity);
-	WRT_SYSCTL(sched_shares_ratelimit);
 #undef WRT_SYSCTL
 
 	return 0;
@@ -633,7 +632,6 @@ account_entity_enqueue(struct cfs_rq *cf
 		list_add(&se->group_node, &cfs_rq->tasks);
 	}
 	cfs_rq->nr_running++;
-	se->on_rq = 1;
 }
 
 static void
@@ -647,9 +645,89 @@ account_entity_dequeue(struct cfs_rq *cf
 		list_del_init(&se->group_node);
 	}
 	cfs_rq->nr_running--;
-	se->on_rq = 0;
 }
 
+#if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
+static void update_cfs_load(struct cfs_rq *cfs_rq)
+{
+	u64 period = sched_avg_period();
+	u64 now, delta;
+
+	if (!cfs_rq)
+		return;
+
+	now = rq_of(cfs_rq)->clock;
+	delta = now - cfs_rq->load_stamp;
+
+	cfs_rq->load_stamp = now;
+	cfs_rq->load_period += delta;
+	cfs_rq->load_avg += delta * cfs_rq->load.weight;
+
+	while (cfs_rq->load_period > period) {
+		/*
+		 * Inline assembly required to prevent the compiler
+		 * optimising this loop into a divmod call.
+		 * See __iter_div_u64_rem() for another example of this.
+		 */
+		asm("" : "+rm" (cfs_rq->load_period));
+		cfs_rq->load_period /= 2;
+		cfs_rq->load_avg /= 2;
+	}
+}
+
+static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
+			    unsigned long weight)
+{
+	if (se->on_rq)
+		account_entity_dequeue(cfs_rq, se);
+
+	update_load_set(&se->load, weight);
+
+	if (se->on_rq)
+		account_entity_enqueue(cfs_rq, se);
+}
+
+static void update_cfs_shares(struct cfs_rq *cfs_rq)
+{
+	struct task_group *tg;
+	struct sched_entity *se;
+	long load_weight, load, shares;
+
+	if (!cfs_rq)
+		return;
+
+	tg = cfs_rq->tg;
+	se = tg->se[cpu_of(rq_of(cfs_rq))];
+	if (!se)
+		return;
+
+	load = cfs_rq->load.weight;
+
+	load_weight = atomic_read(&tg->load_weight);
+	load_weight -= cfs_rq->load_contribution;
+	load_weight += load;
+
+	shares = (tg->shares * load);
+	if (load_weight)
+		shares /= load_weight;
+
+	if (shares < MIN_SHARES)
+		shares = MIN_SHARES;
+	if (shares > tg->shares)
+		shares = tg->shares;
+
+	reweight_entity(cfs_rq_of(se), se, shares);
+}
+#else /* CONFIG_FAIR_GROUP_SCHED */
+static inline void update_cfs_load(struct cfs_rq *cfs_rq)
+{
+}
+
+static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
+{
+}
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 #ifdef CONFIG_SCHEDSTATS
@@ -771,7 +849,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+	update_cfs_load(cfs_rq);
 	account_entity_enqueue(cfs_rq, se);
+	update_cfs_shares(cfs_rq_of(se));
 
 	if (flags & ENQUEUE_WAKEUP) {
 		place_entity(cfs_rq, se, 0);
@@ -782,6 +862,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	check_spread(cfs_rq, se);
 	if (se != cfs_rq->curr)
 		__enqueue_entity(cfs_rq, se);
+	se->on_rq = 1;
 }
 
 static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -825,8 +906,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
+	se->on_rq = 0;
+	update_cfs_load(cfs_rq);
 	account_entity_dequeue(cfs_rq, se);
 	update_min_vruntime(cfs_rq);
+	update_cfs_shares(cfs_rq_of(se));
 
 	/*
 	 * Normalize the entity after updating the min_vruntime because the
@@ -1055,6 +1139,13 @@ enqueue_task_fair(struct rq *rq, struct 
 		flags = ENQUEUE_WAKEUP;
 	}
 
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		update_cfs_load(cfs_rq);
+		update_cfs_shares(cfs_rq);
+	}
+
 	hrtick_update(rq);
 }
 
@@ -1071,12 +1162,20 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
+
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight)
 			break;
 		flags |= DEQUEUE_SLEEP;
 	}
 
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		update_cfs_load(cfs_rq);
+		update_cfs_shares(cfs_rq);
+	}
+
 	hrtick_update(rq);
 }
 
@@ -1143,51 +1242,20 @@ static void task_waking_fair(struct rq *
  * Adding load to a group doesn't make a group heavier, but can cause movement
  * of group shares between cpus. Assuming the shares were perfectly aligned one
  * can calculate the shift in shares.
- *
- * The problem is that perfectly aligning the shares is rather expensive, hence
- * we try to avoid doing that too often - see update_shares(), which ratelimits
- * this change.
- *
- * We compensate this by not only taking the current delta into account, but
- * also considering the delta between when the shares were last adjusted and
- * now.
- *
- * We still saw a performance dip, some tracing learned us that between
- * cgroup:/ and cgroup:/foo balancing the number of affine wakeups increased
- * significantly. Therefore try to bias the error in direction of failing
- * the affine wakeup.
- *
  */
-static long effective_load(struct task_group *tg, int cpu,
-		long wl, long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
 	struct sched_entity *se = tg->se[cpu];
 
 	if (!tg->parent)
 		return wl;
 
-	/*
-	 * By not taking the decrease of shares on the other cpu into
-	 * account our error leans towards reducing the affine wakeups.
-	 */
-	if (!wl && sched_feat(ASYM_EFF_LOAD))
-		return wl;
-
 	for_each_sched_entity(se) {
 		long S, rw, s, a, b;
-		long more_w;
-
-		/*
-		 * Instead of using this increment, also add the difference
-		 * between when the shares were last updated and now.
-		 */
-		more_w = se->my_q->load.weight - se->my_q->rq_weight;
-		wl += more_w;
-		wg += more_w;
 
-		S = se->my_q->tg->shares;
-		s = se->my_q->shares;
-		rw = se->my_q->rq_weight;
+		S = tg->shares;
+		s = se->load.weight;
+		rw = se->my_q->load.weight;
 
 		a = S*(rw + wl);
 		b = S*rw + s*wg;
@@ -1508,23 +1576,6 @@ select_task_rq_fair(struct rq *rq, struc
 			sd = tmp;
 	}
 
-#ifdef CONFIG_FAIR_GROUP_SCHED
-	if (sched_feat(LB_SHARES_UPDATE)) {
-		/*
-		 * Pick the largest domain to update shares over
-		 */
-		tmp = sd;
-		if (affine_sd && (!tmp || affine_sd->span_weight > sd->span_weight))
-			tmp = affine_sd;
-
-		if (tmp) {
-			raw_spin_unlock(&rq->lock);
-			update_shares(tmp);
-			raw_spin_lock(&rq->lock);
-		}
-	}
-#endif
-
 	if (affine_sd) {
 		if (cpu == prev_cpu || wake_affine(affine_sd, p, sync))
 			return select_idle_sibling(p, cpu);
@@ -2977,7 +3028,6 @@ static int load_balance(int this_cpu, st
 	schedstat_inc(sd, lb_count[idle]);
 
 redo:
-	update_shares(sd);
 	group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
 				   cpus, balance);
 
@@ -3119,8 +3169,6 @@ out_one_pinned:
 	else
 		ld_moved = 0;
 out:
-	if (ld_moved)
-		update_shares(sd);
 	return ld_moved;
 }
 
@@ -3514,6 +3562,8 @@ static void rebalance_domains(int cpu, e
 	int update_next_balance = 0;
 	int need_serialize;
 
+	update_shares(cpu);
+
 	for_each_domain(cpu, sd) {
 		if (!(sd->flags & SD_LOAD_BALANCE))
 			continue;
Index: kernel/sched_features.h
===================================================================
--- kernel/sched_features.h.orig
+++ kernel/sched_features.h
@@ -52,8 +52,6 @@ SCHED_FEAT(ARCH_POWER, 0)
 SCHED_FEAT(HRTICK, 0)
 SCHED_FEAT(DOUBLE_TICK, 0)
 SCHED_FEAT(LB_BIAS, 1)
-SCHED_FEAT(LB_SHARES_UPDATE, 1)
-SCHED_FEAT(ASYM_EFF_LOAD, 1)
 
 /*
  * Spin-wait on mutex acquisition when the mutex owner is running on
Index: kernel/sysctl.c
===================================================================
--- kernel/sysctl.c.orig
+++ kernel/sysctl.c
@@ -307,15 +307,6 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &max_wakeup_granularity_ns,
 	},
 	{
-		.procname	= "sched_shares_ratelimit",
-		.data		= &sysctl_sched_shares_ratelimit,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= sched_proc_update_handler,
-		.extra1		= &min_sched_shares_ratelimit,
-		.extra2		= &max_sched_shares_ratelimit,
-	},
-	{
 		.procname	= "sched_tunable_scaling",
 		.data		= &sysctl_sched_tunable_scaling,
 		.maxlen		= sizeof(enum sched_tunable_scaling),
@@ -325,14 +316,6 @@ static struct ctl_table kern_table[] = {
 		.extra2		= &max_sched_tunable_scaling,
 	},
 	{
-		.procname	= "sched_shares_thresh",
-		.data		= &sysctl_sched_shares_thresh,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &zero,
-	},
-	{
 		.procname	= "sched_migration_cost",
 		.data		= &sysctl_sched_migration_cost,
 		.maxlen		= sizeof(unsigned int),

-- 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC tg_shares_up improvements - v1 02/12] sched: on-demand (active) cfs_rq list
  2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 01/12] sched: rewrite tg_shares_up pjt
@ 2010-10-16  4:43 ` pjt
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 03/12] sched: make tg_shares_up() walk on-demand pjt
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: pjt @ 2010-10-16  4:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao

[-- Attachment #1: sched-tg-ondemand.patch --]
[-- Type: text/plain, Size: 12195 bytes --]

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Make certain load-balance actions scale per number of active cgroups
instead of the number of existing cgroups.

This makes wakeup/sleep paths more expensive, but is a win for systems
where the vast majority of existing cgroups are idle.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |   86 +++++++++-------------------------------------------
 kernel/sched_fair.c |   46 ++++++++++++++++++++++++---
 kernel/sched_rt.c   |   24 ++++++++++++++
 3 files changed, 79 insertions(+), 77 deletions(-)

Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -344,6 +344,7 @@ struct cfs_rq {
 	 * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a cpu. This
 	 * list is used during load balance.
 	 */
+	int on_list;
 	struct list_head leaf_cfs_rq_list;
 	struct task_group *tg;	/* group that "owns" this runqueue */
 
@@ -1530,7 +1531,7 @@ static unsigned long cpu_avg_load_per_ta
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-static void update_cfs_load(struct cfs_rq *cfs_rq);
+static void update_cfs_load(struct cfs_rq *cfs_rq, int lb);
 static void update_cfs_shares(struct cfs_rq *cfs_rq);
 
 /*
@@ -1553,7 +1554,7 @@ static int tg_shares_up(struct task_grou
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	update_rq_clock(rq);
-	update_cfs_load(cfs_rq);
+	update_cfs_load(cfs_rq, 1);
 
 	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
 	load_avg -= cfs_rq->load_contribution;
@@ -7622,15 +7623,13 @@ static void init_rt_rq(struct rt_rq *rt_
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
-				struct sched_entity *se, int cpu, int add,
+				struct sched_entity *se, int cpu,
 				struct sched_entity *parent)
 {
 	struct rq *rq = cpu_rq(cpu);
 	tg->cfs_rq[cpu] = cfs_rq;
 	init_cfs_rq(cfs_rq, rq);
 	cfs_rq->tg = tg;
-	if (add)
-		list_add(&cfs_rq->leaf_cfs_rq_list, &rq->leaf_cfs_rq_list);
 
 	tg->se[cpu] = se;
 	/* se could be NULL for init_task_group */
@@ -7650,7 +7649,7 @@ static void init_tg_cfs_entry(struct tas
 
 #ifdef CONFIG_RT_GROUP_SCHED
 static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
-		struct sched_rt_entity *rt_se, int cpu, int add,
+		struct sched_rt_entity *rt_se, int cpu,
 		struct sched_rt_entity *parent)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -7659,8 +7658,6 @@ static void init_tg_rt_entry(struct task
 	init_rt_rq(rt_rq, rq);
 	rt_rq->tg = tg;
 	rt_rq->rt_runtime = tg->rt_bandwidth.rt_runtime;
-	if (add)
-		list_add(&rt_rq->leaf_rt_rq_list, &rq->leaf_rt_rq_list);
 
 	tg->rt_se[cpu] = rt_se;
 	if (!rt_se)
@@ -7769,7 +7766,7 @@ void __init sched_init(void)
 		 * We achieve this by letting init_task_group's tasks sit
 		 * directly in rq->cfs (i.e init_task_group->se[] = NULL).
 		 */
-		init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, 1, NULL);
+		init_tg_cfs_entry(&init_task_group, &rq->cfs, NULL, i, NULL);
 #endif
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -7777,7 +7774,7 @@ void __init sched_init(void)
 #ifdef CONFIG_RT_GROUP_SCHED
 		INIT_LIST_HEAD(&rq->leaf_rt_rq_list);
 #ifdef CONFIG_CGROUP_SCHED
-		init_tg_rt_entry(&init_task_group, &rq->rt, NULL, i, 1, NULL);
+		init_tg_rt_entry(&init_task_group, &rq->rt, NULL, i, NULL);
 #endif
 #endif
 
@@ -8053,7 +8050,7 @@ int alloc_fair_sched_group(struct task_g
 		if (!se)
 			goto err_free_rq;
 
-		init_tg_cfs_entry(tg, cfs_rq, se, i, 0, parent->se[i]);
+		init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
 	}
 
 	return 1;
@@ -8084,14 +8081,6 @@ int alloc_fair_sched_group(struct task_g
 {
 	return 1;
 }
-
-static inline void register_fair_sched_group(struct task_group *tg, int cpu)
-{
-}
-
-static inline void unregister_fair_sched_group(struct task_group *tg, int cpu)
-{
-}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -8143,7 +8132,7 @@ int alloc_rt_sched_group(struct task_gro
 		if (!rt_se)
 			goto err_free_rq;
 
-		init_tg_rt_entry(tg, rt_rq, rt_se, i, 0, parent->rt_se[i]);
+		init_tg_rt_entry(tg, rt_rq, rt_se, i, parent->rt_se[i]);
 	}
 
 	return 1;
@@ -8153,17 +8142,6 @@ int alloc_rt_sched_group(struct task_gro
  err:
 	return 0;
 }
-
-static inline void register_rt_sched_group(struct task_group *tg, int cpu)
-{
-	list_add_rcu(&tg->rt_rq[cpu]->leaf_rt_rq_list,
-			&cpu_rq(cpu)->leaf_rt_rq_list);
-}
-
-static inline void unregister_rt_sched_group(struct task_group *tg, int cpu)
-{
-	list_del_rcu(&tg->rt_rq[cpu]->leaf_rt_rq_list);
-}
 #else /* !CONFIG_RT_GROUP_SCHED */
 static inline void free_rt_sched_group(struct task_group *tg)
 {
@@ -8174,14 +8152,6 @@ int alloc_rt_sched_group(struct task_gro
 {
 	return 1;
 }
-
-static inline void register_rt_sched_group(struct task_group *tg, int cpu)
-{
-}
-
-static inline void unregister_rt_sched_group(struct task_group *tg, int cpu)
-{
-}
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_CGROUP_SCHED
@@ -8197,7 +8167,6 @@ struct task_group *sched_create_group(st
 {
 	struct task_group *tg;
 	unsigned long flags;
-	int i;
 
 	tg = kzalloc(sizeof(*tg), GFP_KERNEL);
 	if (!tg)
@@ -8210,10 +8179,6 @@ struct task_group *sched_create_group(st
 		goto err;
 
 	spin_lock_irqsave(&task_group_lock, flags);
-	for_each_possible_cpu(i) {
-		register_fair_sched_group(tg, i);
-		register_rt_sched_group(tg, i);
-	}
 	list_add_rcu(&tg->list, &task_groups);
 
 	WARN_ON(!parent); /* root should already exist */
@@ -8244,10 +8209,12 @@ void sched_destroy_group(struct task_gro
 	int i;
 
 	spin_lock_irqsave(&task_group_lock, flags);
-	for_each_possible_cpu(i) {
-		unregister_fair_sched_group(tg, i);
-		unregister_rt_sched_group(tg, i);
-	}
+	/*
+	 * XXX should not be a race against enqueue, even without rq->lock
+	 * because only empty groups can be destroyed.
+	 */
+	for_each_possible_cpu(i)
+		list_del_leaf_cfs_rq(tg->cfs_rq[i]);
 	list_del_rcu(&tg->list);
 	list_del_rcu(&tg->siblings);
 	spin_unlock_irqrestore(&task_group_lock, flags);
@@ -8325,7 +8292,6 @@ static DEFINE_MUTEX(shares_mutex);
 int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 {
 	int i;
-	unsigned long flags;
 
 	/*
 	 * We can't change the weight of the root cgroup.
@@ -8342,19 +8308,6 @@ int sched_group_set_shares(struct task_g
 	if (tg->shares == shares)
 		goto done;
 
-	spin_lock_irqsave(&task_group_lock, flags);
-	for_each_possible_cpu(i)
-		unregister_fair_sched_group(tg, i);
-	list_del_rcu(&tg->siblings);
-	spin_unlock_irqrestore(&task_group_lock, flags);
-
-	/* wait for any ongoing reference to this group to finish */
-	synchronize_sched();
-
-	/*
-	 * Now we are free to modify the group's share on each cpu
-	 * w/o tripping rebalance_share or load_balance_fair.
-	 */
 	tg->shares = shares;
 	for_each_possible_cpu(i) {
 		/*
@@ -8363,15 +8316,6 @@ int sched_group_set_shares(struct task_g
 		set_se_shares(tg->se[i], shares);
 	}
 
-	/*
-	 * Enable load balance activity on this group, by inserting it back on
-	 * each cpu's rq->leaf_cfs_rq_list.
-	 */
-	spin_lock_irqsave(&task_group_lock, flags);
-	for_each_possible_cpu(i)
-		register_fair_sched_group(tg, i);
-	list_add_rcu(&tg->siblings, &tg->parent->children);
-	spin_unlock_irqrestore(&task_group_lock, flags);
 done:
 	mutex_unlock(&shares_mutex);
 	return 0;
Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -143,6 +143,24 @@ static inline struct cfs_rq *cpu_cfs_rq(
 	return cfs_rq->tg->cfs_rq[this_cpu];
 }
 
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	if (!cfs_rq->on_list) {
+		list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
+				&rq_of(cfs_rq)->leaf_cfs_rq_list);
+
+		cfs_rq->on_list = 1;
+	}
+}
+
+static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	if (cfs_rq->on_list) {
+		list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
+		cfs_rq->on_list = 0;
+	}
+}
+
 /* Iterate thr' all leaf cfs_rq's on a runqueue */
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
 	list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
@@ -246,6 +264,14 @@ static inline struct cfs_rq *cpu_cfs_rq(
 	return &cpu_rq(this_cpu)->cfs;
 }
 
+static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+{
+}
+
+static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
+{
+}
+
 #define for_each_leaf_cfs_rq(rq, cfs_rq) \
 		for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
 
@@ -648,7 +674,7 @@ account_entity_dequeue(struct cfs_rq *cf
 }
 
 #if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
-static void update_cfs_load(struct cfs_rq *cfs_rq)
+static void update_cfs_load(struct cfs_rq *cfs_rq, int lb)
 {
 	u64 period = sched_avg_period();
 	u64 now, delta;
@@ -673,6 +699,11 @@ static void update_cfs_load(struct cfs_r
 		cfs_rq->load_period /= 2;
 		cfs_rq->load_avg /= 2;
 	}
+
+	if (lb && !cfs_rq->nr_running) {
+		if (cfs_rq->load_avg < (period / 8))
+			list_del_leaf_cfs_rq(cfs_rq);
+	}
 }
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
@@ -719,7 +750,7 @@ static void update_cfs_shares(struct cfs
 	reweight_entity(cfs_rq_of(se), se, shares);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void update_cfs_load(struct cfs_rq *cfs_rq)
+static inline void update_cfs_load(struct cfs_rq *cfs_rq, int lb)
 {
 }
 
@@ -849,7 +880,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	update_cfs_load(cfs_rq);
+	update_cfs_load(cfs_rq, 0);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(group_cfs_rq(se));
 
@@ -863,6 +894,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	if (se != cfs_rq->curr)
 		__enqueue_entity(cfs_rq, se);
 	se->on_rq = 1;
+
+	if (cfs_rq->nr_running == 1)
+		list_add_leaf_cfs_rq(cfs_rq);
 }
 
 static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -907,7 +941,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
-	update_cfs_load(cfs_rq);
+	update_cfs_load(cfs_rq, 0);
 	account_entity_dequeue(cfs_rq, se);
 	update_min_vruntime(cfs_rq);
 	update_cfs_shares(group_cfs_rq(se));
@@ -1142,7 +1176,7 @@ enqueue_task_fair(struct rq *rq, struct 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = group_cfs_rq(se);
 
-		update_cfs_load(cfs_rq);
+		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
@@ -1172,7 +1206,7 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = group_cfs_rq(se);
 
-		update_cfs_load(cfs_rq);
+		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq);
 	}
 
Index: kernel/sched_rt.c
===================================================================
--- kernel/sched_rt.c.orig
+++ kernel/sched_rt.c
@@ -183,6 +183,17 @@ static inline u64 sched_rt_period(struct
 	return ktime_to_ns(rt_rq->tg->rt_bandwidth.rt_period);
 }
 
+static inline void list_add_leaf_rt_rq(struct rt_rq *rt_rq)
+{
+	list_add_rcu(&rt_rq->leaf_rt_rq_list,
+			&rq_of_rt_rq(rt_rq)->leaf_rt_rq_list);
+}
+
+static inline void list_del_leaf_rt_rq(struct rt_rq *rt_rq)
+{
+	list_del_rcu(&rt_rq->leaf_rt_rq_list);
+}
+
 #define for_each_leaf_rt_rq(rt_rq, rq) \
 	list_for_each_entry_rcu(rt_rq, &rq->leaf_rt_rq_list, leaf_rt_rq_list)
 
@@ -276,6 +287,14 @@ static inline u64 sched_rt_period(struct
 	return ktime_to_ns(def_rt_bandwidth.rt_period);
 }
 
+static inline void list_add_leaf_rt_rq(struct rt_rq *rt_rq)
+{
+}
+
+static inline void list_del_leaf_rt_rq(struct rt_rq *rt_rq)
+{
+}
+
 #define for_each_leaf_rt_rq(rt_rq, rq) \
 	for (rt_rq = &rq->rt; rt_rq; rt_rq = NULL)
 
@@ -825,6 +844,9 @@ static void __enqueue_rt_entity(struct s
 	if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running))
 		return;
 
+	if (!rt_rq->rt_nr_running)
+		list_add_leaf_rt_rq(rt_rq);
+
 	if (head)
 		list_add(&rt_se->run_list, queue);
 	else
@@ -844,6 +866,8 @@ static void __dequeue_rt_entity(struct s
 		__clear_bit(rt_se_prio(rt_se), array->bitmap);
 
 	dec_rt_tasks(rt_se, rt_rq);
+	if (!rt_rq->rt_nr_running)
+		list_del_leaf_rt_rq(rt_rq);
 }
 
 /*

-- 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC tg_shares_up improvements - v1 03/12] sched: make tg_shares_up() walk on-demand
  2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 01/12] sched: rewrite tg_shares_up pjt
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 02/12] sched: on-demand (active) cfs_rq list pjt
@ 2010-10-16  4:43 ` pjt
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 04/12] sched: fix load corruption from update_cfs_shares pjt
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: pjt @ 2010-10-16  4:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao

[-- Attachment #1: sched-tg-more-ondemand.patch --]
[-- Type: text/plain, Size: 4421 bytes --]

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

Make tg_shares_up() use the active cgroup list, this means we cannot
do a strict bottom-up walk of the hierarchy, but assuming its a very
wide tree with a small number of active groups it should be a win.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |   67 ----------------------------------------------------
 kernel/sched_fair.c |   58 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+), 67 deletions(-)

Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -281,13 +281,6 @@ static DEFINE_SPINLOCK(task_group_lock);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-#ifdef CONFIG_SMP
-static int root_task_group_empty(void)
-{
-	return list_empty(&root_task_group.children);
-}
-#endif
-
 # define INIT_TASK_GROUP_LOAD	NICE_0_LOAD
 
 /*
@@ -1531,48 +1524,6 @@ static unsigned long cpu_avg_load_per_ta
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 
-static void update_cfs_load(struct cfs_rq *cfs_rq, int lb);
-static void update_cfs_shares(struct cfs_rq *cfs_rq);
-
-/*
- * update tg->load_weight by folding this cpu's load_avg
- */
-static int tg_shares_up(struct task_group *tg, void *data)
-{
-	long load_avg;
-	struct cfs_rq *cfs_rq;
-	unsigned long flags;
-	int cpu = (long)data;
-	struct rq *rq;
-
-	if (!tg->se[cpu])
-		return 0;
-
-	rq = cpu_rq(cpu);
-	cfs_rq = tg->cfs_rq[cpu];
-
-	raw_spin_lock_irqsave(&rq->lock, flags);
-
-	update_rq_clock(rq);
-	update_cfs_load(cfs_rq, 1);
-
-	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
-	load_avg -= cfs_rq->load_contribution;
-
-	atomic_add(load_avg, &tg->load_weight);
-	cfs_rq->load_contribution += load_avg;
-
-	/*
-	 * We need to update shares after updating tg->load_weight in
-	 * order to adjust the weight of groups with long running tasks.
-	 */
-	update_cfs_shares(cfs_rq);
-
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
-
-	return 0;
-}
-
 /*
  * Compute the cpu's hierarchical load factor for each task group.
  * This needs to be done in a top-down fashion because the load of a child
@@ -1596,29 +1547,11 @@ static int tg_load_down(struct task_grou
 	return 0;
 }
 
-static void update_shares(long cpu)
-{
-	if (root_task_group_empty())
-		return;
-
-	/*
-	 * XXX: replace with an on-demand list
-	 */
-
-	walk_tg_tree(tg_nop, tg_shares_up, (void *)cpu);
-}
-
 static void update_h_load(long cpu)
 {
 	walk_tg_tree(tg_load_down, tg_nop, (void *)cpu);
 }
 
-#else
-
-static inline void update_shares(int cpu)
-{
-}
-
 #endif
 
 #ifdef CONFIG_PREEMPT
Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -2000,6 +2000,60 @@ out:
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
+/*
+ * update tg->load_weight by folding this cpu's load_avg
+ */
+static int tg_shares_up(struct task_group *tg, int cpu)
+{
+	struct cfs_rq *cfs_rq;
+	unsigned long flags;
+	struct rq *rq;
+	long load_avg;
+
+	if (!tg->se[cpu])
+		return 0;
+
+	rq = cpu_rq(cpu);
+	cfs_rq = tg->cfs_rq[cpu];
+
+	raw_spin_lock_irqsave(&rq->lock, flags);
+
+	update_rq_clock(rq);
+	update_cfs_load(cfs_rq, 1);
+
+	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
+	load_avg -= cfs_rq->load_contribution;
+	atomic_add(load_avg, &tg->load_weight);
+	cfs_rq->load_contribution += load_avg;
+
+	/*
+	 * We need to update shares after updating tg->load_weight in
+	 * order to adjust the weight of groups with long running tasks.
+	 */
+	update_cfs_shares(cfs_rq);
+
+	raw_spin_unlock_irqrestore(&rq->lock, flags);
+
+	return 0;
+}
+
+static void update_shares(int cpu)
+{
+	struct cfs_rq *cfs_rq;
+	struct rq *rq = cpu_rq(cpu);
+
+	rcu_read_lock();
+	for_each_leaf_cfs_rq(rq, cfs_rq) {
+		struct task_group *tg = cfs_rq->tg;
+
+		do {
+			tg_shares_up(tg, cpu);
+			tg = tg->parent;
+		} while (tg);
+	}
+	rcu_read_unlock();
+}
+
 static unsigned long
 load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		  unsigned long max_load_move,
@@ -2047,6 +2101,10 @@ load_balance_fair(struct rq *this_rq, in
 	return max_load_move - rem_load_move;
 }
 #else
+static inline void update_shares(int cpu)
+{
+}
+
 static unsigned long
 load_balance_fair(struct rq *this_rq, int this_cpu, struct rq *busiest,
 		  unsigned long max_load_move,

-- 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC tg_shares_up improvements - v1 04/12] sched: fix load corruption from update_cfs_shares
  2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
                   ` (2 preceding siblings ...)
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 03/12] sched: make tg_shares_up() walk on-demand pjt
@ 2010-10-16  4:43 ` pjt
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 05/12] sched: fix update_cfs_load synchronization pjt
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: pjt @ 2010-10-16  4:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao

[-- Attachment #1: sched-tg-fix-update_cfs_shares.patch --]
[-- Type: text/plain, Size: 3251 bytes --]

as part of enqueue_entity both a new entity weight and its contribution to the
queuing cfs_rq / rq are updated.  Since update_cfs_shares will only update the
queueing weights when the entity is on_rq (which in this case it is not yet),
there's a dependency loop here:

update_cfs_shares needs account_entity_enqueue to update cfs_rq->load.weight
account_entity_enqueue needs the updated weight for the queuing cfs_rq load[*]

Fix this and avoid spurious dequeue/enqueues by issuing update_cfs_shares as
if we had accounted the enqueue already.

This was also resulting in rq->load corruption previously.

[*]: this dependency also exists when using the group cfs_rq w/ 
update_cfs_shares as the weight of the enqueued entity changes without the
load being updated.

Signed-off-by: Paul Turner <pjt@google.com>
---
 kernel/sched_fair.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -718,7 +718,7 @@ static void reweight_entity(struct cfs_r
 		account_entity_enqueue(cfs_rq, se);
 }
 
-static void update_cfs_shares(struct cfs_rq *cfs_rq)
+static void update_cfs_shares(struct cfs_rq *cfs_rq, long weight_delta)
 {
 	struct task_group *tg;
 	struct sched_entity *se;
@@ -732,7 +732,7 @@ static void update_cfs_shares(struct cfs
 	if (!se)
 		return;
 
-	load = cfs_rq->load.weight;
+	load = cfs_rq->load.weight + weight_delta;
 
 	load_weight = atomic_read(&tg->load_weight);
 	load_weight -= cfs_rq->load_contribution;
@@ -754,7 +754,7 @@ static inline void update_cfs_load(struc
 {
 }
 
-static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
+static inline void update_cfs_shares(struct cfs_rq *cfs_rq, long weight_delta)
 {
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -881,8 +881,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 */
 	update_curr(cfs_rq);
 	update_cfs_load(cfs_rq, 0);
+	update_cfs_shares(cfs_rq_of(se), se->load.weight);
 	account_entity_enqueue(cfs_rq, se);
-	update_cfs_shares(cfs_rq_of(se));
 
 	if (flags & ENQUEUE_WAKEUP) {
 		place_entity(cfs_rq, se, 0);
@@ -944,7 +944,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	update_cfs_load(cfs_rq, 0);
 	account_entity_dequeue(cfs_rq, se);
 	update_min_vruntime(cfs_rq);
-	update_cfs_shares(cfs_rq_of(se));
+	update_cfs_shares(cfs_rq_of(se), 0);
 
 	/*
 	 * Normalize the entity after updating the min_vruntime because the
@@ -1177,7 +1177,7 @@ enqueue_task_fair(struct rq *rq, struct 
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		update_cfs_load(cfs_rq, 0);
-		update_cfs_shares(cfs_rq);
+		update_cfs_shares(cfs_rq, 0);
 	}
 
 	hrtick_update(rq);
@@ -1207,7 +1207,7 @@ static void dequeue_task_fair(struct rq 
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		update_cfs_load(cfs_rq, 0);
-		update_cfs_shares(cfs_rq);
+		update_cfs_shares(cfs_rq, 0);
 	}
 
 	hrtick_update(rq);
@@ -2030,7 +2030,7 @@ static int tg_shares_up(struct task_grou
 	 * We need to update shares after updating tg->load_weight in
 	 * order to adjust the weight of groups with long running tasks.
 	 */
-	update_cfs_shares(cfs_rq);
+	update_cfs_shares(cfs_rq, 0);
 
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 

-- 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC tg_shares_up improvements - v1 05/12] sched: fix update_cfs_load synchronization
  2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
                   ` (3 preceding siblings ...)
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 04/12] sched: fix load corruption from update_cfs_shares pjt
@ 2010-10-16  4:43 ` pjt
  2010-10-21  9:52   ` Bharata B Rao
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 06/12] sched: hierarchal order on shares update list pjt
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 28+ messages in thread
From: pjt @ 2010-10-16  4:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao

[-- Attachment #1: sched-tg-fix-update_cfs_load.patch --]
[-- Type: text/plain, Size: 4089 bytes --]

Using cfs_rq->nr_running is not sufficient to synchronize update_cfs_load with
the put path since nr_running accounting occurs at deactivation.

It's also not safe to make the removal decision based on load_avg as this fails
with both high periods and low shares.  Resolve this by clipping history at 8
foldings.

Note: the above will always occur from update_shares() since in the 
last-task-sleep-case that task will still be cfs_rq->curr when update_cfs_load is 
called.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |    2 +-
 kernel/sched_fair.c |   33 +++++++++++++++++++++------------
 2 files changed, 22 insertions(+), 13 deletions(-)

Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -674,10 +674,11 @@ account_entity_dequeue(struct cfs_rq *cf
 }
 
 #if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
-static void update_cfs_load(struct cfs_rq *cfs_rq, int lb)
+static void update_cfs_load(struct cfs_rq *cfs_rq)
 {
 	u64 period = sched_avg_period();
 	u64 now, delta;
+	unsigned long load = cfs_rq->load.weight;
 
 	if (!cfs_rq)
 		return;
@@ -685,9 +686,19 @@ static void update_cfs_load(struct cfs_r
 	now = rq_of(cfs_rq)->clock;
 	delta = now - cfs_rq->load_stamp;
 
+	/* truncate load history at 4 idle periods */
+	if (cfs_rq->load_stamp > cfs_rq->load_last &&
+	    now - cfs_rq->load_last > 4 * period) {
+		cfs_rq->load_period = 0;
+		cfs_rq->load_avg = 0;
+	}
+
 	cfs_rq->load_stamp = now;
 	cfs_rq->load_period += delta;
-	cfs_rq->load_avg += delta * cfs_rq->load.weight;
+	if (load) {
+		cfs_rq->load_last = now;
+		cfs_rq->load_avg += delta * load;
+	}
 
 	while (cfs_rq->load_period > period) {
 		/*
@@ -700,10 +711,8 @@ static void update_cfs_load(struct cfs_r
 		cfs_rq->load_avg /= 2;
 	}
 
-	if (lb && !cfs_rq->nr_running) {
-		if (cfs_rq->load_avg < (period / 8))
-			list_del_leaf_cfs_rq(cfs_rq);
-	}
+	if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg)
+		list_del_leaf_cfs_rq(cfs_rq);
 }
 
 static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
@@ -750,7 +759,7 @@ static void update_cfs_shares(struct cfs
 	reweight_entity(cfs_rq_of(se), se, shares);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void update_cfs_load(struct cfs_rq *cfs_rq, int lb)
+static inline void update_cfs_load(struct cfs_rq *cfs_rq)
 {
 }
 
@@ -880,7 +889,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	update_cfs_load(cfs_rq, 0);
+	update_cfs_load(cfs_rq);
 	update_cfs_shares(cfs_rq_of(se), se->load.weight);
 	account_entity_enqueue(cfs_rq, se);
 
@@ -941,7 +950,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
-	update_cfs_load(cfs_rq, 0);
+	update_cfs_load(cfs_rq);
 	account_entity_dequeue(cfs_rq, se);
 	update_min_vruntime(cfs_rq);
 	update_cfs_shares(cfs_rq_of(se), 0);
@@ -1176,7 +1185,7 @@ enqueue_task_fair(struct rq *rq, struct 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-		update_cfs_load(cfs_rq, 0);
+		update_cfs_load(cfs_rq);
 		update_cfs_shares(cfs_rq, 0);
 	}
 
@@ -1206,7 +1215,7 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-		update_cfs_load(cfs_rq, 0);
+		update_cfs_load(cfs_rq);
 		update_cfs_shares(cfs_rq, 0);
 	}
 
@@ -2019,7 +2028,7 @@ static int tg_shares_up(struct task_grou
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	update_rq_clock(rq);
-	update_cfs_load(cfs_rq, 1);
+	update_cfs_load(cfs_rq);
 
 	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
 	load_avg -= cfs_rq->load_contribution;
Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -357,7 +357,7 @@ struct cfs_rq {
 
 	u64 load_avg;
 	u64 load_period;
-	u64 load_stamp;
+	u64 load_stamp, load_last;
 
 	unsigned long load_contribution;
 #endif

-- 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC tg_shares_up improvements - v1 06/12] sched: hierarchal order on shares update list
  2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
                   ` (4 preceding siblings ...)
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 05/12] sched: fix update_cfs_load synchronization pjt
@ 2010-10-16  4:43 ` pjt
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 07/12] sched: add sysctl_sched_shares_window pjt
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: pjt @ 2010-10-16  4:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao

[-- Attachment #1: sched-tg-order-ondemand-list.patch --]
[-- Type: text/plain, Size: 2357 bytes --]

Avoid duplicate shares update calls by ensuring children always appear before
parents in rq->leaf_cfs_rq_list.

This allows us to do a single in-order traversal for update_shares().

Since we always enqueue in bottom-up order this reduces to 2 cases:
1) Our parent is already in the list, e.g.

   root
     \
      b
      /\
      c d* (root->b->c already enqueued)

Since d's parent is enqueued we push it to the head of the list, implicitly ahead of b.

2) Our parent does not appear in the list (or we have no parent)
In this case we enqueue to the tail of the list, if our parent is subsequently enqueued 
(bottom-up) it will appear to our right by the same rule.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched_fair.c |   26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -146,8 +146,20 @@ static inline struct cfs_rq *cpu_cfs_rq(
 static inline void list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	if (!cfs_rq->on_list) {
-		list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
+		/*
+		 * Ensure we either appear before our parent (if already
+		 * enqueued) or force our parent to appear after us when it is
+		 * enqueued.  The fact that we always enqueue bottom-up
+		 * reduces this to two cases.
+		 */
+		if (cfs_rq->tg->parent &&
+		    cfs_rq->tg->parent->cfs_rq[cpu_of(rq_of(cfs_rq))]->on_list) {
+			list_add_rcu(&cfs_rq->leaf_cfs_rq_list,
+				&rq_of(cfs_rq)->leaf_cfs_rq_list);
+		} else {
+			list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
 				&rq_of(cfs_rq)->leaf_cfs_rq_list);
+		}
 
 		cfs_rq->on_list = 1;
 	}
@@ -2009,7 +2021,7 @@ out:
 /*
  * update tg->load_weight by folding this cpu's load_avg
  */
-static int tg_shares_up(struct task_group *tg, int cpu)
+static int update_shares_cpu(struct task_group *tg, int cpu)
 {
 	struct cfs_rq *cfs_rq;
 	unsigned long flags;
@@ -2049,14 +2061,8 @@ static void update_shares(int cpu)
 	struct rq *rq = cpu_rq(cpu);
 
 	rcu_read_lock();
-	for_each_leaf_cfs_rq(rq, cfs_rq) {
-		struct task_group *tg = cfs_rq->tg;
-
-		do {
-			tg_shares_up(tg, cpu);
-			tg = tg->parent;
-		} while (tg);
-	}
+	for_each_leaf_cfs_rq(rq, cfs_rq)
+		update_shares_cpu(cfs_rq->tg, cpu);
 	rcu_read_unlock();
 }
 

-- 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC tg_shares_up improvements - v1 07/12] sched: add sysctl_sched_shares_window
  2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
                   ` (5 preceding siblings ...)
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 06/12] sched: hierarchal order on shares update list pjt
@ 2010-10-16  4:43 ` pjt
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 08/12] sched: update shares on idle_balance pjt
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: pjt @ 2010-10-16  4:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao

[-- Attachment #1: sched-tg-add-shares_window_sysctl.patch --]
[-- Type: text/plain, Size: 2295 bytes --]

Introduce a new sysctl for the shares window and disambiguate it from
sched_time_avg.

A 10ms window appears to be a good compromise between accuracy and performance.

Signed-off-by: Paul Turner <pjt@google.com>

---
 include/linux/sched.h |    1 +
 kernel/sched_fair.c   |    8 +++++++-
 kernel/sysctl.c       |    7 +++++++
 3 files changed, 15 insertions(+), 1 deletion(-)

Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -89,6 +89,12 @@ unsigned int normalized_sysctl_sched_wak
 
 const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
 
+/*
+ * The usage window over which average use is computed for shares
+ * distribution.
+ */
+unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
+
 static const struct sched_class fair_sched_class;
 
 /**************************************************************
@@ -688,7 +694,7 @@ account_entity_dequeue(struct cfs_rq *cf
 #if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
 static void update_cfs_load(struct cfs_rq *cfs_rq)
 {
-	u64 period = sched_avg_period();
+	u64 period = sysctl_sched_shares_window;
 	u64 now, delta;
 	unsigned long load = cfs_rq->load.weight;
 
Index: include/linux/sched.h
===================================================================
--- include/linux/sched.h.orig
+++ include/linux/sched.h
@@ -1883,6 +1883,7 @@ extern unsigned int sysctl_sched_migrati
 extern unsigned int sysctl_sched_nr_migrate;
 extern unsigned int sysctl_sched_time_avg;
 extern unsigned int sysctl_timer_migration;
+extern unsigned int sysctl_sched_shares_window;
 
 int sched_proc_update_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
Index: kernel/sysctl.c
===================================================================
--- kernel/sysctl.c.orig
+++ kernel/sysctl.c
@@ -337,6 +337,13 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.procname	= "sched_shares_window",
+		.data		= &sysctl_sched_shares_window,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
 		.procname	= "timer_migration",
 		.data		= &sysctl_timer_migration,
 		.maxlen		= sizeof(unsigned int),

-- 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC tg_shares_up improvements - v1 08/12] sched: update shares on idle_balance
  2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
                   ` (6 preceding siblings ...)
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 07/12] sched: add sysctl_sched_shares_window pjt
@ 2010-10-16  4:43 ` pjt
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 09/12] sched: demand based update_cfs_load() pjt
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: pjt @ 2010-10-16  4:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao

[-- Attachment #1: sched-tg-update_shares_on_idle_balance.patch --]
[-- Type: text/plain, Size: 675 bytes --]

Since shares updates are no longer expensive and effectively local, update them
at idle_balance().  This allows us to more quickly redistribute shares to
another cpu when our load becomes idle.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched_fair.c |    1 +
 1 file changed, 1 insertion(+)

Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -3301,6 +3301,7 @@ static void idle_balance(int this_cpu, s
 	 */
 	raw_spin_unlock(&this_rq->lock);
 
+	update_shares(this_cpu);
 	for_each_domain(this_cpu, sd) {
 		unsigned long interval;
 		int balance = 1;

-- 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC tg_shares_up improvements - v1 09/12] sched: demand based update_cfs_load()
  2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
                   ` (7 preceding siblings ...)
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 08/12] sched: update shares on idle_balance pjt
@ 2010-10-16  4:43 ` pjt
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 10/12] sched: allow update_cfs_load to update global load pjt
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: pjt @ 2010-10-16  4:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao

[-- Attachment #1: sched-tg-demand_based_update_cfs_load.patch --]
[-- Type: text/plain, Size: 2211 bytes --]

When the system is busy dilation of rq->next_balance makes lb->update_shares()
insufficiently frequent for threads which don't sleep (no dequeue/enqueue
updates).  Adjust for this by making demand based updates based on the
accumulation of execution time sufficient to wrap our averaging window.

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched.c      |    9 ++++++++-
 kernel/sched_fair.c |   10 ++++++++++
 2 files changed, 18 insertions(+), 1 deletion(-)

Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -355,9 +355,16 @@ struct cfs_rq {
 	 */
 	unsigned long h_load;
 
+	/*
+	 * Maintaining per-cpu shares distribution for group scheduling
+	 *
+	 * load_stamp is the last time we updated the load average
+	 * load_last is the last time we updated the load average and saw load
+	 * load_unacc_exec_time is currently unaccounted execution time
+	 */
 	u64 load_avg;
 	u64 load_period;
-	u64 load_stamp, load_last;
+	u64 load_stamp, load_last, load_unacc_exec_time;
 
 	unsigned long load_contribution;
 #endif
Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -538,6 +538,9 @@ static u64 sched_vslice(struct cfs_rq *c
 	return calc_delta_fair(sched_slice(cfs_rq, se), se);
 }
 
+static void update_cfs_load(struct cfs_rq *cfs_rq);
+static void update_cfs_shares(struct cfs_rq *cfs_rq, long weight_delta);
+
 /*
  * Update the current task's runtime statistics. Skip current tasks that
  * are not in our scheduling class.
@@ -557,6 +560,12 @@ __update_curr(struct cfs_rq *cfs_rq, str
 
 	curr->vruntime += delta_exec_weighted;
 	update_min_vruntime(cfs_rq);
+
+	cfs_rq->load_unacc_exec_time += delta_exec;
+	if (cfs_rq->load_unacc_exec_time > sysctl_sched_shares_window) {
+		update_cfs_load(cfs_rq);
+		update_cfs_shares(cfs_rq, 0);
+	}
 }
 
 static void update_curr(struct cfs_rq *cfs_rq)
@@ -712,6 +721,7 @@ static void update_cfs_load(struct cfs_r
 	}
 
 	cfs_rq->load_stamp = now;
+	cfs_rq->load_unacc_exec_time = 0;
 	cfs_rq->load_period += delta;
 	if (load) {
 		cfs_rq->load_last = now;

-- 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC tg_shares_up improvements - v1 10/12] sched: allow update_cfs_load to update global load
  2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
                   ` (8 preceding siblings ...)
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 09/12] sched: demand based update_cfs_load() pjt
@ 2010-10-16  4:43 ` pjt
  2010-10-16  4:44 ` [RFC tg_shares_up improvements - v1 11/12] sched: update tg->shares after cpu.shares write pjt
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: pjt @ 2010-10-16  4:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao

[-- Attachment #1: sched-tg-global_update_load.patch --]
[-- Type: text/plain, Size: 4070 bytes --]

Refactor the global load updates from update_shares_cpu() so that
update_cfs_load() can update global load when it is more than ~10% out of sync.

The new global_load parameter allows us to force an update, regardless of
the error factor so that we can synchronize w/ update_shares().

Signed-off-by: Paul Turner <pjt@google.com>

---
 kernel/sched_fair.c |   42 ++++++++++++++++++++++++++++--------------
 1 file changed, 28 insertions(+), 14 deletions(-)

Index: kernel/sched_fair.c
===================================================================
--- kernel/sched_fair.c.orig
+++ kernel/sched_fair.c
@@ -538,7 +538,7 @@ static u64 sched_vslice(struct cfs_rq *c
 	return calc_delta_fair(sched_slice(cfs_rq, se), se);
 }
 
-static void update_cfs_load(struct cfs_rq *cfs_rq);
+static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update);
 static void update_cfs_shares(struct cfs_rq *cfs_rq, long weight_delta);
 
 /*
@@ -563,7 +563,7 @@ __update_curr(struct cfs_rq *cfs_rq, str
 
 	cfs_rq->load_unacc_exec_time += delta_exec;
 	if (cfs_rq->load_unacc_exec_time > sysctl_sched_shares_window) {
-		update_cfs_load(cfs_rq);
+		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq, 0);
 	}
 }
@@ -701,7 +701,22 @@ account_entity_dequeue(struct cfs_rq *cf
 }
 
 #if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
-static void update_cfs_load(struct cfs_rq *cfs_rq)
+static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
+					    int global_update)
+{
+	struct task_group *tg = cfs_rq->tg;
+	long load_avg;
+
+	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
+	load_avg -= cfs_rq->load_contribution;
+
+	if (global_update || abs(load_avg) > cfs_rq->load_contribution / 8) {
+		atomic_add(load_avg, &tg->load_weight);
+		cfs_rq->load_contribution += load_avg;
+	}
+}
+
+static void update_cfs_load(struct cfs_rq *cfs_rq, int global_update)
 {
 	u64 period = sysctl_sched_shares_window;
 	u64 now, delta;
@@ -728,6 +743,11 @@ static void update_cfs_load(struct cfs_r
 		cfs_rq->load_avg += delta * load;
 	}
 
+	/* consider updating load contribution on each fold or truncate */
+	if (global_update || cfs_rq->load_period > period
+	    || !cfs_rq->load_period)
+		update_cfs_rq_load_contribution(cfs_rq, global_update);
+
 	while (cfs_rq->load_period > period) {
 		/*
 		 * Inline assembly required to prevent the compiler
@@ -917,7 +937,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
-	update_cfs_load(cfs_rq);
+	update_cfs_load(cfs_rq, 0);
 	update_cfs_shares(cfs_rq_of(se), se->load.weight);
 	account_entity_enqueue(cfs_rq, se);
 
@@ -978,7 +998,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	if (se != cfs_rq->curr)
 		__dequeue_entity(cfs_rq, se);
 	se->on_rq = 0;
-	update_cfs_load(cfs_rq);
+	update_cfs_load(cfs_rq, 0);
 	account_entity_dequeue(cfs_rq, se);
 	update_min_vruntime(cfs_rq);
 	update_cfs_shares(cfs_rq_of(se), 0);
@@ -1213,7 +1233,7 @@ enqueue_task_fair(struct rq *rq, struct 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-		update_cfs_load(cfs_rq);
+		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq, 0);
 	}
 
@@ -1243,7 +1263,7 @@ static void dequeue_task_fair(struct rq 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-		update_cfs_load(cfs_rq);
+		update_cfs_load(cfs_rq, 0);
 		update_cfs_shares(cfs_rq, 0);
 	}
 
@@ -2045,7 +2065,6 @@ static int update_shares_cpu(struct task
 	struct cfs_rq *cfs_rq;
 	unsigned long flags;
 	struct rq *rq;
-	long load_avg;
 
 	if (!tg->se[cpu])
 		return 0;
@@ -2056,12 +2075,7 @@ static int update_shares_cpu(struct task
 	raw_spin_lock_irqsave(&rq->lock, flags);
 
 	update_rq_clock(rq);
-	update_cfs_load(cfs_rq);
-
-	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
-	load_avg -= cfs_rq->load_contribution;
-	atomic_add(load_avg, &tg->load_weight);
-	cfs_rq->load_contribution += load_avg;
+	update_cfs_load(cfs_rq, 1);
 
 	/*
 	 * We need to update shares after updating tg->load_weight in

-- 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC tg_shares_up improvements - v1 11/12] sched: update tg->shares after cpu.shares write
  2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
                   ` (9 preceding siblings ...)
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 10/12] sched: allow update_cfs_load to update global load pjt
@ 2010-10-16  4:44 ` pjt
  2010-10-16  4:44 ` [RFC tg_shares_up improvements - v1 12/12] debug: export effective shares for analysis versus specified pjt
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: pjt @ 2010-10-16  4:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao

[-- Attachment #1: sched-tg-fix_group_set_shares.patch --]
[-- Type: text/plain, Size: 2279 bytes --]

Formerly sched_group_set_shares would force a rebalance by overflowing domain
share sums.  Now that per-cpu averages are maintained we can set the true value
by issuing an update_cfs_shares() following a tg->shares update.

Also initialize tg se->load to 0 for consistency since we'll now set correct
weights on enqueue.

Signed-off-by: Paul Turner <pjt@google.com?>

---
 kernel/sched.c |   42 +++++++++++-------------------------------
 1 file changed, 11 insertions(+), 31 deletions(-)

Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -7575,7 +7575,7 @@ static void init_tg_cfs_entry(struct tas
 		se->cfs_rq = parent->my_q;
 
 	se->my_q = cfs_rq;
-	update_load_set(&se->load, tg->shares);
+	update_load_set(&se->load, 0);
 	se->parent = parent;
 }
 #endif
@@ -8194,37 +8194,12 @@ void sched_move_task(struct task_struct 
 #endif /* CONFIG_CGROUP_SCHED */
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-static void __set_se_shares(struct sched_entity *se, unsigned long shares)
-{
-	struct cfs_rq *cfs_rq = se->cfs_rq;
-	int on_rq;
-
-	on_rq = se->on_rq;
-	if (on_rq)
-		dequeue_entity(cfs_rq, se, 0);
-
-	update_load_set(&se->load, shares);
-
-	if (on_rq)
-		enqueue_entity(cfs_rq, se, 0);
-}
-
-static void set_se_shares(struct sched_entity *se, unsigned long shares)
-{
-	struct cfs_rq *cfs_rq = se->cfs_rq;
-	struct rq *rq = cfs_rq->rq;
-	unsigned long flags;
-
-	raw_spin_lock_irqsave(&rq->lock, flags);
-	__set_se_shares(se, shares);
-	raw_spin_unlock_irqrestore(&rq->lock, flags);
-}
-
 static DEFINE_MUTEX(shares_mutex);
 
 int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 {
 	int i;
+	unsigned long flags;
 
 	/*
 	 * We can't change the weight of the root cgroup.
@@ -8243,10 +8218,15 @@ int sched_group_set_shares(struct task_g
 
 	tg->shares = shares;
 	for_each_possible_cpu(i) {
-		/*
-		 * force a rebalance
-		 */
-		set_se_shares(tg->se[i], shares);
+		struct rq *rq = cpu_rq(i);
+		struct sched_entity *se;
+
+		se = tg->se[i];
+		/* Propagate contribution to hierarchy */
+		raw_spin_lock_irqsave(&rq->lock, flags);
+		for_each_sched_entity(se)
+			update_cfs_shares(group_cfs_rq(se), 0);
+		raw_spin_unlock_irqrestore(&rq->lock, flags);
 	}
 
 done:

-- 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC tg_shares_up improvements - v1 12/12] debug: export effective shares for analysis versus specified
  2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
                   ` (10 preceding siblings ...)
  2010-10-16  4:44 ` [RFC tg_shares_up improvements - v1 11/12] sched: update tg->shares after cpu.shares write pjt
@ 2010-10-16  4:44 ` pjt
  2010-10-16 19:46 ` [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution Peter Zijlstra
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: pjt @ 2010-10-16  4:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao

[-- Attachment #1: sched-tg-debug-add_issued_shares.patch --]
[-- Type: text/plain, Size: 922 bytes --]

---
 kernel/sched.c |   19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

Index: kernel/sched.c
===================================================================
--- kernel/sched.c.orig
+++ kernel/sched.c
@@ -8607,6 +8607,21 @@ static u64 cpu_shares_read_u64(struct cg
 
 	return (u64) tg->shares;
 }
+
+static u64 cpu_issued_shares_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	u64 result = 0;
+	int i;
+
+	if (!tg->se[0])
+		return 0;
+
+	for_each_online_cpu(i)
+		result += tg->se[i]->load.weight;
+
+	return result;
+}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -8640,6 +8655,10 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_shares_read_u64,
 		.write_u64 = cpu_shares_write_u64,
 	},
+	{
+		.name = "issued_shares",
+		.read_u64 = cpu_issued_shares_read_u64,
+	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{

-- 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution
  2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
                   ` (11 preceding siblings ...)
  2010-10-16  4:44 ` [RFC tg_shares_up improvements - v1 12/12] debug: export effective shares for analysis versus specified pjt
@ 2010-10-16 19:46 ` Peter Zijlstra
  2010-10-21  6:36   ` Paul Turner
  2010-10-17  5:24 ` Balbir Singh
  2010-11-03 18:27 ` Karl Rister
  14 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2010-10-16 19:46 UTC (permalink / raw)
  To: pjt
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Bharata B Rao

On Fri, 2010-10-15 at 21:43 -0700, pjt@google.com wrote:
> Hi all,
> 
> Peter previously posted a patchset that attempted to improve the problem of
> task_group share distribution.  This is something that has been a long-time
> pain point for group scheduling.  The existing algorithm considers
> distributions on a per-cpu-per-domain basis and carries a fairly high update
> overhead, especially on larger machines.
> 
> I was previously looking at improving this using Fenwick trees to allow a
> single sum without the exorbitant cost but then Peter's idea above was better :).
> 
> The kernel is that by monitoring the average contribution to load on a
> per-cpu-per-taskgroup basis we can distribute the weight for which we are
> expected to consume.
> 
> This set extends the original posting with a focus on increased fairness and
> reduced convergence (to true average) time.  In particular the case of large
> over-commit in the case of a distributed wake-up is a concern which is now
> fairly well addressed.
> 
> Obviously everything's experimental but it should be stable/fair.

I like what you've done with it, my only worry is 10/12 where you allow
for extra updates to the global state -- I think they should be fairly
limited in number, and I can see the need for the update if we get too
far out of whack, but it is something to look at while testing this
stuff.

> TODO:
> - Validate any RT interaction

I don't think there's anything to worry about there, the only
interaction which there is between this and the rt scheduling classes is
the initial sharing of the load-avg window, but you 'cure' that in 7/12.

(I think that sysctl wants a _us postfix someplace and we thus want some
NSEC_PER_USEC multiplication in there).

> - Continue collecting/analyzing performance and fairness data

Yes please ;-), I'll try and run this on some machines as well.

> - Should the shares period just be the sched_latency?

Interesting idea.. lets keep it a separate sysctl for now for easy
tuning, if things settle down and we're still good in that range we can
consider merging them.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution
  2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
                   ` (12 preceding siblings ...)
  2010-10-16 19:46 ` [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution Peter Zijlstra
@ 2010-10-17  5:24 ` Balbir Singh
  2010-10-17  9:38   ` Peter Zijlstra
  2010-11-03 18:27 ` Karl Rister
  14 siblings, 1 reply; 28+ messages in thread
From: Balbir Singh @ 2010-10-17  5:24 UTC (permalink / raw)
  To: pjt
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri,
	Chris Friesen, Vaidyanathan Srinivasan, Pierre Bourdon,
	Bharata B Rao

* pjt@google.com <pjt@google.com> [2010-10-15 21:43:49]:

> Hi all,
> 
> Peter previously posted a patchset that attempted to improve the problem of
> task_group share distribution.  This is something that has been a long-time
> pain point for group scheduling.  The existing algorithm considers
> distributions on a per-cpu-per-domain basis and carries a fairly high update
> overhead, especially on larger machines.
> 
> I was previously looking at improving this using Fenwick trees to allow a
> single sum without the exorbitant cost but then Peter's idea above was better :).
>
> The kernel is that by monitoring the average contribution to load on a
> per-cpu-per-taskgroup basis we can distribute the weight for which we are
> expected to consume.
> 
> This set extends the original posting with a focus on increased fairness and
> reduced convergence (to true average) time.  In particular the case of large
> over-commit in the case of a distributed wake-up is a concern which is now
> fairly well addressed.
> 
> Obviously everything's experimental but it should be stable/fair.
> 
> Some motivation:
> 
> 24 thread intel box, 150 active cgroups, multiple threads/group, load at ~90% (10 second sample):
> tip:
>      2.64%  [k] tg_shares_up <!>
>      0.15%  [k] __set_se_shares
> 
> patched:
>      0.02%  [k] update_cfs_load
>      0.01%  [k] update_cpu_load
>      0.00%  [k] update_cfs_shares
> 
> Some fairness coverage for the above at: http://rs5.risingnet.net/~pjt/patches/shares_data_v1.txt
>

The results do sound interesting, could I recommend that you try to
get 1000 cgroups or so, with half of them as active and half inactive.
I'd also suggest tracking system and user time at the cgroup level.

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution
  2010-10-17  5:24 ` Balbir Singh
@ 2010-10-17  9:38   ` Peter Zijlstra
  2010-10-17 12:09     ` Balbir Singh
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2010-10-17  9:38 UTC (permalink / raw)
  To: balbir
  Cc: pjt, linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Bharata B Rao

On Sun, 2010-10-17 at 10:54 +0530, Balbir Singh wrote:
> 
> The results do sound interesting, could I recommend that you try to
> get 1000 cgroups or so, with half of them as active and half inactive.
> I'd also suggest tracking system and user time at the cgroup level. 

Sure, but do keep in mind that 500 active cgroups will always utterly
suck no matter what you do, it means we do indeed need to iterate 500
cgroups for a lot of scheduler operations.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution
  2010-10-17  9:38   ` Peter Zijlstra
@ 2010-10-17 12:09     ` Balbir Singh
  0 siblings, 0 replies; 28+ messages in thread
From: Balbir Singh @ 2010-10-17 12:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: pjt, linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Bharata B Rao

* Peter Zijlstra <peterz@infradead.org> [2010-10-17 11:38:51]:

> On Sun, 2010-10-17 at 10:54 +0530, Balbir Singh wrote:
> > 
> > The results do sound interesting, could I recommend that you try to
> > get 1000 cgroups or so, with half of them as active and half inactive.
> > I'd also suggest tracking system and user time at the cgroup level. 
> 
> Sure, but do keep in mind that 500 active cgroups will always utterly
> suck no matter what you do, it means we do indeed need to iterate 500
> cgroups for a lot of scheduler operations.
>

I agree in general, but in my case I've seen nasty things even when
the groups were empty. 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC tg_shares_up improvements - v1 01/12] sched: rewrite tg_shares_up
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 01/12] sched: rewrite tg_shares_up pjt
@ 2010-10-21  6:04   ` Bharata B Rao
  2010-10-21  6:28     ` Paul Turner
  2010-10-21  8:08   ` Bharata B Rao
       [not found]   ` <AANLkTi=zYAfb_izD15ROxH=C6+zPzX+XEGw7r5UUoAar@mail.gmail.com>
  2 siblings, 1 reply; 28+ messages in thread
From: Bharata B Rao @ 2010-10-21  6:04 UTC (permalink / raw)
  To: pjt
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri,
	Chris Friesen, Vaidyanathan Srinivasan, Pierre Bourdon

On Fri, Oct 15, 2010 at 09:43:50PM -0700, pjt@google.com wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> By tracking a per-cpu load-avg for each cfs_rq and folding it into a
> global task_group load on each tick we can rework tg_shares_up to be
> strictly per-cpu.
> 
> This should improve cpu-cgroup performance for smp systems
> significantly.
> 
> [ Paul: changed to use queueing cfs_rq ]
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Paul Turner <pjt@google.com>
> 
> Index: kernel/sched_fair.c
> ===================================================================
> --- kernel/sched_fair.c.orig
> +++ kernel/sched_fair.c
> @@ -417,7 +417,6 @@ int sched_proc_update_handler(struct ctl
>  	WRT_SYSCTL(sched_min_granularity);
>  	WRT_SYSCTL(sched_latency);
>  	WRT_SYSCTL(sched_wakeup_granularity);
> -	WRT_SYSCTL(sched_shares_ratelimit);
>  #undef WRT_SYSCTL
> 
>  	return 0;
> @@ -633,7 +632,6 @@ account_entity_enqueue(struct cfs_rq *cf
>  		list_add(&se->group_node, &cfs_rq->tasks);
>  	}
>  	cfs_rq->nr_running++;
> -	se->on_rq = 1;
>  }
> 
>  static void
> @@ -647,9 +645,89 @@ account_entity_dequeue(struct cfs_rq *cf
>  		list_del_init(&se->group_node);
>  	}
>  	cfs_rq->nr_running--;
> -	se->on_rq = 0;
>  }
> 
> +#if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
> +static void update_cfs_load(struct cfs_rq *cfs_rq)
> +{
> +	u64 period = sched_avg_period();
> +	u64 now, delta;
> +
> +	if (!cfs_rq)
> +		return;
> +
> +	now = rq_of(cfs_rq)->clock;
> +	delta = now - cfs_rq->load_stamp;
> +
> +	cfs_rq->load_stamp = now;
> +	cfs_rq->load_period += delta;
> +	cfs_rq->load_avg += delta * cfs_rq->load.weight;
> +
> +	while (cfs_rq->load_period > period) {
> +		/*
> +		 * Inline assembly required to prevent the compiler
> +		 * optimising this loop into a divmod call.
> +		 * See __iter_div_u64_rem() for another example of this.
> +		 */
> +		asm("" : "+rm" (cfs_rq->load_period));
> +		cfs_rq->load_period /= 2;
> +		cfs_rq->load_avg /= 2;
> +	}
> +}
> +
> +static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> +			    unsigned long weight)
> +{
> +	if (se->on_rq)
> +		account_entity_dequeue(cfs_rq, se);
> +
> +	update_load_set(&se->load, weight);
> +
> +	if (se->on_rq)
> +		account_entity_enqueue(cfs_rq, se);
> +}
> +
> +static void update_cfs_shares(struct cfs_rq *cfs_rq)
> +{
> +	struct task_group *tg;
> +	struct sched_entity *se;
> +	long load_weight, load, shares;
> +
> +	if (!cfs_rq)
> +		return;
> +
> +	tg = cfs_rq->tg;
> +	se = tg->se[cpu_of(rq_of(cfs_rq))];
> +	if (!se)
> +		return;
> +
> +	load = cfs_rq->load.weight;
> +
> +	load_weight = atomic_read(&tg->load_weight);
> +	load_weight -= cfs_rq->load_contribution;
> +	load_weight += load;
> +
> +	shares = (tg->shares * load);
> +	if (load_weight)
> +		shares /= load_weight;
> +
> +	if (shares < MIN_SHARES)
> +		shares = MIN_SHARES;
> +	if (shares > tg->shares)
> +		shares = tg->shares;
> +
> +	reweight_entity(cfs_rq_of(se), se, shares);
> +}
> +#else /* CONFIG_FAIR_GROUP_SCHED */
> +static inline void update_cfs_load(struct cfs_rq *cfs_rq)
> +{
> +}
> +
> +static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
> +{
> +}
> +#endif /* CONFIG_FAIR_GROUP_SCHED */
> +
>  static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>  #ifdef CONFIG_SCHEDSTATS
> @@ -771,7 +849,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>  	 * Update run-time statistics of the 'current'.
>  	 */
>  	update_curr(cfs_rq);
> +	update_cfs_load(cfs_rq);
>  	account_entity_enqueue(cfs_rq, se);

By placing update_cfs_load() before account_entity_enqueue(), you are
updating cfs_rq->load_avg before actually taking into account the current
load increment due to enqueing. I see same in dequeue also. Is there a
reason for this ?

> +	update_cfs_shares(cfs_rq_of(se));

Isn't cfs_rq_of(se) same as cfs_rq that enqueue_entity() gets
from enqueue_task_fair() ?  Same for dequeue case.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC tg_shares_up improvements - v1 01/12] sched: rewrite tg_shares_up
  2010-10-21  6:04   ` Bharata B Rao
@ 2010-10-21  6:28     ` Paul Turner
  0 siblings, 0 replies; 28+ messages in thread
From: Paul Turner @ 2010-10-21  6:28 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri,
	Chris Friesen, Vaidyanathan Srinivasan, Pierre Bourdon

On Wed, Oct 20, 2010 at 11:04 PM, Bharata B Rao
<bharata@linux.vnet.ibm.com> wrote:
> On Fri, Oct 15, 2010 at 09:43:50PM -0700, pjt@google.com wrote:
>> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
>>
>> By tracking a per-cpu load-avg for each cfs_rq and folding it into a
>> global task_group load on each tick we can rework tg_shares_up to be
>> strictly per-cpu.
>>
>> This should improve cpu-cgroup performance for smp systems
>> significantly.
>>
>> [ Paul: changed to use queueing cfs_rq ]
>>
>> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> Signed-off-by: Paul Turner <pjt@google.com>
>>
>> Index: kernel/sched_fair.c
>> ===================================================================
>> --- kernel/sched_fair.c.orig
>> +++ kernel/sched_fair.c
>> @@ -417,7 +417,6 @@ int sched_proc_update_handler(struct ctl
>>       WRT_SYSCTL(sched_min_granularity);
>>       WRT_SYSCTL(sched_latency);
>>       WRT_SYSCTL(sched_wakeup_granularity);
>> -     WRT_SYSCTL(sched_shares_ratelimit);
>>  #undef WRT_SYSCTL
>>
>>       return 0;
>> @@ -633,7 +632,6 @@ account_entity_enqueue(struct cfs_rq *cf
>>               list_add(&se->group_node, &cfs_rq->tasks);
>>       }
>>       cfs_rq->nr_running++;
>> -     se->on_rq = 1;
>>  }
>>
>>  static void
>> @@ -647,9 +645,89 @@ account_entity_dequeue(struct cfs_rq *cf
>>               list_del_init(&se->group_node);
>>       }
>>       cfs_rq->nr_running--;
>> -     se->on_rq = 0;
>>  }
>>
>> +#if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
>> +static void update_cfs_load(struct cfs_rq *cfs_rq)
>> +{
>> +     u64 period = sched_avg_period();
>> +     u64 now, delta;
>> +
>> +     if (!cfs_rq)
>> +             return;
>> +
>> +     now = rq_of(cfs_rq)->clock;
>> +     delta = now - cfs_rq->load_stamp;
>> +
>> +     cfs_rq->load_stamp = now;
>> +     cfs_rq->load_period += delta;
>> +     cfs_rq->load_avg += delta * cfs_rq->load.weight;
>> +
>> +     while (cfs_rq->load_period > period) {
>> +             /*
>> +              * Inline assembly required to prevent the compiler
>> +              * optimising this loop into a divmod call.
>> +              * See __iter_div_u64_rem() for another example of this.
>> +              */
>> +             asm("" : "+rm" (cfs_rq->load_period));
>> +             cfs_rq->load_period /= 2;
>> +             cfs_rq->load_avg /= 2;
>> +     }
>> +}
>> +
>> +static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>> +                         unsigned long weight)
>> +{
>> +     if (se->on_rq)
>> +             account_entity_dequeue(cfs_rq, se);
>> +
>> +     update_load_set(&se->load, weight);
>> +
>> +     if (se->on_rq)
>> +             account_entity_enqueue(cfs_rq, se);
>> +}
>> +
>> +static void update_cfs_shares(struct cfs_rq *cfs_rq)
>> +{
>> +     struct task_group *tg;
>> +     struct sched_entity *se;
>> +     long load_weight, load, shares;
>> +
>> +     if (!cfs_rq)
>> +             return;
>> +
>> +     tg = cfs_rq->tg;
>> +     se = tg->se[cpu_of(rq_of(cfs_rq))];
>> +     if (!se)
>> +             return;
>> +
>> +     load = cfs_rq->load.weight;
>> +
>> +     load_weight = atomic_read(&tg->load_weight);
>> +     load_weight -= cfs_rq->load_contribution;
>> +     load_weight += load;
>> +
>> +     shares = (tg->shares * load);
>> +     if (load_weight)
>> +             shares /= load_weight;
>> +
>> +     if (shares < MIN_SHARES)
>> +             shares = MIN_SHARES;
>> +     if (shares > tg->shares)
>> +             shares = tg->shares;
>> +
>> +     reweight_entity(cfs_rq_of(se), se, shares);
>> +}
>> +#else /* CONFIG_FAIR_GROUP_SCHED */
>> +static inline void update_cfs_load(struct cfs_rq *cfs_rq)
>> +{
>> +}
>> +
>> +static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
>> +{
>> +}
>> +#endif /* CONFIG_FAIR_GROUP_SCHED */
>> +
>>  static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>  {
>>  #ifdef CONFIG_SCHEDSTATS
>> @@ -771,7 +849,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>>        * Update run-time statistics of the 'current'.
>>        */
>>       update_curr(cfs_rq);
>> +     update_cfs_load(cfs_rq);
>>       account_entity_enqueue(cfs_rq, se);
>
> By placing update_cfs_load() before account_entity_enqueue(), you are
> updating cfs_rq->load_avg before actually taking into account the current
> load increment due to enqueing. I see same in dequeue also. Is there a
> reason for this ?

Yes -- the update covers the interval spanning the previous update
(tracked with load_stamp) and the present.  This interval occurred
prior to the above weight delta which will only be meaningful against
the _next_ interval we account.

>
>> +     update_cfs_shares(cfs_rq_of(se));
>
> Isn't cfs_rq_of(se) same as cfs_rq that enqueue_entity() gets
> from enqueue_task_fair() ?  Same for dequeue case.
>

Yup.. no need for it, will fix.

Thanks

> Regards,
> Bharata.
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution
  2010-10-16 19:46 ` [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution Peter Zijlstra
@ 2010-10-21  6:36   ` Paul Turner
  2010-10-22  0:14     ` Paul Turner
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Turner @ 2010-10-21  6:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Bharata B Rao

On Sat, Oct 16, 2010 at 12:46 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2010-10-15 at 21:43 -0700, pjt@google.com wrote:
>> Hi all,
>>
>> Peter previously posted a patchset that attempted to improve the problem of
>> task_group share distribution.  This is something that has been a long-time
>> pain point for group scheduling.  The existing algorithm considers
>> distributions on a per-cpu-per-domain basis and carries a fairly high update
>> overhead, especially on larger machines.
>>
>> I was previously looking at improving this using Fenwick trees to allow a
>> single sum without the exorbitant cost but then Peter's idea above was better :).
>>
>> The kernel is that by monitoring the average contribution to load on a
>> per-cpu-per-taskgroup basis we can distribute the weight for which we are
>> expected to consume.
>>
>> This set extends the original posting with a focus on increased fairness and
>> reduced convergence (to true average) time.  In particular the case of large
>> over-commit in the case of a distributed wake-up is a concern which is now
>> fairly well addressed.
>>
>> Obviously everything's experimental but it should be stable/fair.
>
> I like what you've done with it, my only worry is 10/12 where you allow
> for extra updates to the global state -- I think they should be fairly
> limited in number, and I can see the need for the update if we get too
> far out of whack, but it is something to look at while testing this
> stuff.
>

So my original answer here was to only update when there was load and
it was > n% delta which stops 1 thread waking up and sleeping from
thrashing it, but the 2 thread case is just as obviously broken for
any n.  It needs a rate limit but I'm sort of loathe to introduce
_another_ set of timestamps.  I don't suppose there's much harm in
doing so though and I don't think it's going to be clean to overload
one of the existing ones so perhaps another counter is the answer.

I'll make sure this is addressed in v2.

>> TODO:
>> - Validate any RT interaction
>
> I don't think there's anything to worry about there, the only
> interaction which there is between this and the rt scheduling classes is
> the initial sharing of the load-avg window, but you 'cure' that in 7/12.
>
> (I think that sysctl wants a _us postfix someplace and we thus want some
> NSEC_PER_USEC multiplication in there).
>

Yes -- updated, thanks.

>> - Continue collecting/analyzing performance and fairness data
>
> Yes please ;-), I'll try and run this on some machines as well.
>
>> - Should the shares period just be the sched_latency?
>
> Interesting idea.. lets keep it a separate sysctl for now for easy
> tuning, if things settle down and we're still good in that range we can
> consider merging them.
>
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC tg_shares_up improvements - v1 01/12] sched: rewrite tg_shares_up
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 01/12] sched: rewrite tg_shares_up pjt
  2010-10-21  6:04   ` Bharata B Rao
@ 2010-10-21  8:08   ` Bharata B Rao
  2010-10-21  8:38     ` Paul Turner
  2010-10-21  9:08     ` Peter Zijlstra
       [not found]   ` <AANLkTi=zYAfb_izD15ROxH=C6+zPzX+XEGw7r5UUoAar@mail.gmail.com>
  2 siblings, 2 replies; 28+ messages in thread
From: Bharata B Rao @ 2010-10-21  8:08 UTC (permalink / raw)
  To: pjt
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri,
	Chris Friesen, Vaidyanathan Srinivasan, Pierre Bourdon

On Fri, Oct 15, 2010 at 09:43:50PM -0700, pjt@google.com wrote:
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> --- kernel/sched_fair.c.orig
> +++ kernel/sched_fair.c
> @@ -417,7 +417,6 @@ int sched_proc_update_handler(struct ctl
>  	WRT_SYSCTL(sched_min_granularity);
>  	WRT_SYSCTL(sched_latency);
>  	WRT_SYSCTL(sched_wakeup_granularity);
> -	WRT_SYSCTL(sched_shares_ratelimit);
>  #undef WRT_SYSCTL
> 
>  	return 0;
> @@ -633,7 +632,6 @@ account_entity_enqueue(struct cfs_rq *cf
>  		list_add(&se->group_node, &cfs_rq->tasks);
>  	}
>  	cfs_rq->nr_running++;
> -	se->on_rq = 1;
>  }
> 
>  static void
> @@ -647,9 +645,89 @@ account_entity_dequeue(struct cfs_rq *cf
>  		list_del_init(&se->group_node);
>  	}
>  	cfs_rq->nr_running--;
> -	se->on_rq = 0;
>  }
> 
<snip>
> @@ -771,7 +849,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>  	 * Update run-time statistics of the 'current'.
>  	 */
>  	update_curr(cfs_rq);
> +	update_cfs_load(cfs_rq);
>  	account_entity_enqueue(cfs_rq, se);
> +	update_cfs_shares(cfs_rq_of(se));
> 
>  	if (flags & ENQUEUE_WAKEUP) {
>  		place_entity(cfs_rq, se, 0);
> @@ -782,6 +862,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>  	check_spread(cfs_rq, se);
>  	if (se != cfs_rq->curr)
>  		__enqueue_entity(cfs_rq, se);
> +	se->on_rq = 1;
>  }
> 
>  static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
> @@ -825,8 +906,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> 
>  	if (se != cfs_rq->curr)
>  		__dequeue_entity(cfs_rq, se);
> +	se->on_rq = 0;

Since setting and un-setting of se->on_rq is completely handled within
sched_fair.c, we can remove the redundant setting and un-setting of
se->on_rq from sched.c:[en]dequeue_task. Even w/o this patch, they were
redundant I suppose.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC tg_shares_up improvements - v1 01/12] sched: rewrite tg_shares_up
  2010-10-21  8:08   ` Bharata B Rao
@ 2010-10-21  8:38     ` Paul Turner
  2010-10-21  9:08     ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Paul Turner @ 2010-10-21  8:38 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri,
	Chris Friesen, Vaidyanathan Srinivasan, Pierre Bourdon

On Thu, Oct 21, 2010 at 1:08 AM, Bharata B Rao
<bharata@linux.vnet.ibm.com> wrote:
> On Fri, Oct 15, 2010 at 09:43:50PM -0700, pjt@google.com wrote:
>> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> --- kernel/sched_fair.c.orig
>> +++ kernel/sched_fair.c
>> @@ -417,7 +417,6 @@ int sched_proc_update_handler(struct ctl
>>       WRT_SYSCTL(sched_min_granularity);
>>       WRT_SYSCTL(sched_latency);
>>       WRT_SYSCTL(sched_wakeup_granularity);
>> -     WRT_SYSCTL(sched_shares_ratelimit);
>>  #undef WRT_SYSCTL
>>
>>       return 0;
>> @@ -633,7 +632,6 @@ account_entity_enqueue(struct cfs_rq *cf
>>               list_add(&se->group_node, &cfs_rq->tasks);
>>       }
>>       cfs_rq->nr_running++;
>> -     se->on_rq = 1;
>>  }
>>
>>  static void
>> @@ -647,9 +645,89 @@ account_entity_dequeue(struct cfs_rq *cf
>>               list_del_init(&se->group_node);
>>       }
>>       cfs_rq->nr_running--;
>> -     se->on_rq = 0;
>>  }
>>
> <snip>
>> @@ -771,7 +849,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>>        * Update run-time statistics of the 'current'.
>>        */
>>       update_curr(cfs_rq);
>> +     update_cfs_load(cfs_rq);
>>       account_entity_enqueue(cfs_rq, se);
>> +     update_cfs_shares(cfs_rq_of(se));
>>
>>       if (flags & ENQUEUE_WAKEUP) {
>>               place_entity(cfs_rq, se, 0);
>> @@ -782,6 +862,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>>       check_spread(cfs_rq, se);
>>       if (se != cfs_rq->curr)
>>               __enqueue_entity(cfs_rq, se);
>> +     se->on_rq = 1;
>>  }
>>
>>  static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> @@ -825,8 +906,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>>
>>       if (se != cfs_rq->curr)
>>               __dequeue_entity(cfs_rq, se);
>> +     se->on_rq = 0;
>
> Since setting and un-setting of se->on_rq is completely handled within
> sched_fair.c, we can remove the redundant setting and un-setting of
> se->on_rq from sched.c:[en]dequeue_task. Even w/o this patch, they were
> redundant I suppose.

Mmmm I think SCHED_RT (currently) depends on this, it could probably
cleaned up in a separate patch.

>
> Regards,
> Bharata.
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC tg_shares_up improvements - v1 01/12] sched: rewrite tg_shares_up
  2010-10-21  8:08   ` Bharata B Rao
  2010-10-21  8:38     ` Paul Turner
@ 2010-10-21  9:08     ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-10-21  9:08 UTC (permalink / raw)
  To: bharata
  Cc: pjt, linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon

On Thu, 2010-10-21 at 13:38 +0530, Bharata B Rao wrote:
> > +     se->on_rq = 0;
> 
> Since setting and un-setting of se->on_rq is completely handled within
> sched_fair.c, we can remove the redundant setting and un-setting of
> se->on_rq from sched.c:[en]dequeue_task. Even w/o this patch, they were
> redundant I suppose. 

sadly not, sched_rt.c also uses them. It really shouldn't live in
sched_entity but in a common set.

I've long wanted to rework all that into something like:

 struct sched_entity {
	/* common fields */
	union {
		struct sched_fair_entity fair;
		struct sched_rt_entity   rt;
	};
 };

But that never quite happened.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC tg_shares_up improvements - v1 05/12] sched: fix update_cfs_load synchronization
  2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 05/12] sched: fix update_cfs_load synchronization pjt
@ 2010-10-21  9:52   ` Bharata B Rao
  2010-10-21 18:25     ` Paul Turner
  0 siblings, 1 reply; 28+ messages in thread
From: Bharata B Rao @ 2010-10-21  9:52 UTC (permalink / raw)
  To: pjt
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri,
	Chris Friesen, Vaidyanathan Srinivasan, Pierre Bourdon

On Fri, Oct 15, 2010 at 09:43:54PM -0700, pjt@google.com wrote:
> Using cfs_rq->nr_running is not sufficient to synchronize update_cfs_load with
> the put path since nr_running accounting occurs at deactivation.
> 
> It's also not safe to make the removal decision based on load_avg as this fails
> with both high periods and low shares.  Resolve this by clipping history at 8
> foldings.
> 

Did you mean 4 foldings (and not 8) above since I see you are truncating
the load history at 4 idle periods ?


> @@ -685,9 +686,19 @@ static void update_cfs_load(struct cfs_r
>  	now = rq_of(cfs_rq)->clock;
>  	delta = now - cfs_rq->load_stamp;
> 
> +	/* truncate load history at 4 idle periods */
> +	if (cfs_rq->load_stamp > cfs_rq->load_last &&
> +	    now - cfs_rq->load_last > 4 * period) {
> +		cfs_rq->load_period = 0;
> +		cfs_rq->load_avg = 0;
> +	}
> +
>  	cfs_rq->load_stamp = now;
>  	cfs_rq->load_period += delta;
> -	cfs_rq->load_avg += delta * cfs_rq->load.weight;
> +	if (load) {
> +		cfs_rq->load_last = now;
> +		cfs_rq->load_avg += delta * load;
> +	}
> 
>  	while (cfs_rq->load_period > period) {
>  		/*
> @@ -700,10 +711,8 @@ static void update_cfs_load(struct cfs_r
>  		cfs_rq->load_avg /= 2;
>  	}
> 
> -	if (lb && !cfs_rq->nr_running) {
> -		if (cfs_rq->load_avg < (period / 8))
> -			list_del_leaf_cfs_rq(cfs_rq);
> -	}
> +	if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg)
> +		list_del_leaf_cfs_rq(cfs_rq);
>  }
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC tg_shares_up improvements - v1 05/12] sched: fix update_cfs_load synchronization
  2010-10-21  9:52   ` Bharata B Rao
@ 2010-10-21 18:25     ` Paul Turner
  0 siblings, 0 replies; 28+ messages in thread
From: Paul Turner @ 2010-10-21 18:25 UTC (permalink / raw)
  To: bharata
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri,
	Chris Friesen, Vaidyanathan Srinivasan, Pierre Bourdon

On Thu, Oct 21, 2010 at 2:52 AM, Bharata B Rao
<bharata@linux.vnet.ibm.com> wrote:
> On Fri, Oct 15, 2010 at 09:43:54PM -0700, pjt@google.com wrote:
>> Using cfs_rq->nr_running is not sufficient to synchronize update_cfs_load with
>> the put path since nr_running accounting occurs at deactivation.
>>
>> It's also not safe to make the removal decision based on load_avg as this fails
>> with both high periods and low shares.  Resolve this by clipping history at 8
>> foldings.
>>
>
> Did you mean 4 foldings (and not 8) above since I see you are truncating
> the load history at 4 idle periods ?

We end up folding every period/2 (after the first fold) since after
that we start at period/2 and not 0.

Thus, if we were issuing 'continuous' load updates triggering this
condition would mean that we had folded idle periods into our load_avg
8 times.

Granted, it's possible for there to be less folds given sporadic load
updates so I'll just update the comment to reference periods.

Thanks

>
>
>> @@ -685,9 +686,19 @@ static void update_cfs_load(struct cfs_r
>>       now = rq_of(cfs_rq)->clock;
>>       delta = now - cfs_rq->load_stamp;
>>
>> +     /* truncate load history at 4 idle periods */
>> +     if (cfs_rq->load_stamp > cfs_rq->load_last &&
>> +         now - cfs_rq->load_last > 4 * period) {
>> +             cfs_rq->load_period = 0;
>> +             cfs_rq->load_avg = 0;
>> +     }
>> +
>>       cfs_rq->load_stamp = now;
>>       cfs_rq->load_period += delta;
>> -     cfs_rq->load_avg += delta * cfs_rq->load.weight;
>> +     if (load) {
>> +             cfs_rq->load_last = now;
>> +             cfs_rq->load_avg += delta * load;
>> +     }
>>
>>       while (cfs_rq->load_period > period) {
>>               /*
>> @@ -700,10 +711,8 @@ static void update_cfs_load(struct cfs_r
>>               cfs_rq->load_avg /= 2;
>>       }
>>
>> -     if (lb && !cfs_rq->nr_running) {
>> -             if (cfs_rq->load_avg < (period / 8))
>> -                     list_del_leaf_cfs_rq(cfs_rq);
>> -     }
>> +     if (!cfs_rq->curr && !cfs_rq->nr_running && !cfs_rq->load_avg)
>> +             list_del_leaf_cfs_rq(cfs_rq);
>>  }
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution
  2010-10-21  6:36   ` Paul Turner
@ 2010-10-22  0:14     ` Paul Turner
  0 siblings, 0 replies; 28+ messages in thread
From: Paul Turner @ 2010-10-22  0:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Bharata B Rao

On Wed, Oct 20, 2010 at 11:36 PM, Paul Turner <pjt@google.com> wrote:
> On Sat, Oct 16, 2010 at 12:46 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Fri, 2010-10-15 at 21:43 -0700, pjt@google.com wrote:
>>> Hi all,
>>>
>>> Peter previously posted a patchset that attempted to improve the problem of
>>> task_group share distribution.  This is something that has been a long-time
>>> pain point for group scheduling.  The existing algorithm considers
>>> distributions on a per-cpu-per-domain basis and carries a fairly high update
>>> overhead, especially on larger machines.
>>>
>>> I was previously looking at improving this using Fenwick trees to allow a
>>> single sum without the exorbitant cost but then Peter's idea above was better :).
>>>
>>> The kernel is that by monitoring the average contribution to load on a
>>> per-cpu-per-taskgroup basis we can distribute the weight for which we are
>>> expected to consume.
>>>
>>> This set extends the original posting with a focus on increased fairness and
>>> reduced convergence (to true average) time.  In particular the case of large
>>> over-commit in the case of a distributed wake-up is a concern which is now
>>> fairly well addressed.
>>>
>>> Obviously everything's experimental but it should be stable/fair.
>>
>> I like what you've done with it, my only worry is 10/12 where you allow
>> for extra updates to the global state -- I think they should be fairly
>> limited in number, and I can see the need for the update if we get too
>> far out of whack, but it is something to look at while testing this
>> stuff.
>>
>
> So my original answer here was to only update when there was load and
> it was > n% delta which stops 1 thread waking up and sleeping from
> thrashing it, but the 2 thread case is just as obviously broken for
> any n.  It needs a rate limit but I'm sort of loathe to introduce
> _another_ set of timestamps.  I don't suppose there's much harm in
> doing so though and I don't think it's going to be clean to overload
> one of the existing ones so perhaps another counter is the answer.
>
> I'll make sure this is addressed in v2.

Ahh -- I remember my original reasoning here now:

These global updates are rate limited since they only occur when we
fold the averaging period thus are limited to occur at a rate of
period/2 (e.g. each time we fold).

One concern is we could get a local "storm" since several cgroups
could trip this limit at the same time.. but it's also obviously not
going to be any worse than an update_shares_cpu().

While there may be a concern on *giant* cpu systems, I think we
already have that problem in both the original and (to a lesser
extent) current approaches.

>
>>> TODO:
>>> - Validate any RT interaction
>>
>> I don't think there's anything to worry about there, the only
>> interaction which there is between this and the rt scheduling classes is
>> the initial sharing of the load-avg window, but you 'cure' that in 7/12.
>>
>> (I think that sysctl wants a _us postfix someplace and we thus want some
>> NSEC_PER_USEC multiplication in there).
>>
>
> Yes -- updated, thanks.
>
>>> - Continue collecting/analyzing performance and fairness data
>>
>> Yes please ;-), I'll try and run this on some machines as well.
>>
>>> - Should the shares period just be the sched_latency?
>>
>> Interesting idea.. lets keep it a separate sysctl for now for easy
>> tuning, if things settle down and we're still good in that range we can
>> consider merging them.
>>
>>
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution
  2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
                   ` (13 preceding siblings ...)
  2010-10-17  5:24 ` Balbir Singh
@ 2010-11-03 18:27 ` Karl Rister
  14 siblings, 0 replies; 28+ messages in thread
From: Karl Rister @ 2010-11-03 18:27 UTC (permalink / raw)
  To: pjt
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri,
	Chris Friesen, Vaidyanathan Srinivasan, Pierre Bourdon,
	Paul Turner, habanero

Hi All

Here is a some performance data for the previously posted patches 
running a LAMP workload in a cloud-like environment which show promising 
reductions in CPU utilization.  In this particular test, 32 groups 
equaling 64 KVM guests (each group consists of an Apache server guest 
and a MySQL server guest) are running a LAMP workload being driven by 
external load drivers.  When using the default values in /etc/cgconfig.conf:

mount {
         cpuset  = /cgroup/cpuset;
         cpu     = /cgroup/cpu;
         cpuacct = /cgroup/cpuacct;
         memory  = /cgroup/memory;
         devices = /cgroup/devices;
         freezer = /cgroup/freezer;
         net_cls = /cgroup/net_cls;
         blkio   = /cgroup/blkio;
}

which enable libvirt usage of cgroups the contents of /proc/cgroups 
looks like this before launching the guests:

#subsys_name    hierarchy       num_cgroups     enabled
cpuset  1       4       1
ns      0       1       1
cpu     2       4       1
cpuacct 3       4       1
memory  4       4       1
devices 5       4       1
freezer 6       4       1
net_cls 7       1       1
blkio   8       1       1

and like this after launching the guests:

#subsys_name    hierarchy       num_cgroups     enabled
cpuset  1       68      1
ns      0       1       1
cpu     2       68      1
cpuacct 3       68      1
memory  4       68      1
devices 5       68      1
freezer 6       68      1
net_cls 7       1       1
blkio   8       1       1

When running the workload the run with the patches used significantly 
less CPU:

Host CPU utilization with patches: 54.35%
Host CPU utilization without patches: 80.89%

Since the workload uses a fixed injection rate the achieved throughput 
for both test runs was the same, however the run with the patches 
applied did achieve better quality of service metrics.

NOTE: The runs were made using kvm.git changeset 
cec8b6b972a572b69d4902f57fb659e8a4c749af.

-- 
Karl Rister <kmr@us.ibm.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC tg_shares_up improvements - v1 01/12] sched: rewrite tg_shares_up
       [not found]   ` <AANLkTi=zYAfb_izD15ROxH=C6+zPzX+XEGw7r5UUoAar@mail.gmail.com>
@ 2010-11-04 21:00     ` Paul Turner
  0 siblings, 0 replies; 28+ messages in thread
From: Paul Turner @ 2010-11-04 21:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Srivatsa Vaddagiri, Chris Friesen,
	Vaidyanathan Srinivasan, Pierre Bourdon, Paul Turner,
	Bharata B Rao

[resend due to html formatting creeping into first email]

On Thu, Nov 4, 2010 at 4:58 PM, Paul Turner <pjt@google.com> wrote:
>
>
> On Sat, Oct 16, 2010 at 12:43 AM, <pjt@google.com> wrote:
>>
>> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
>>
>> By tracking a per-cpu load-avg for each cfs_rq and folding it into a
>> global task_group load on each tick we can rework tg_shares_up to be
>> strictly per-cpu.
>>
>> This should improve cpu-cgroup performance for smp systems
>> significantly.
>>
>> [ Paul: changed to use queueing cfs_rq ]
>>
>> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
>> Signed-off-by: Paul Turner <pjt@google.com>
>>
>> ---
>>  include/linux/sched.h   |    2
>>  kernel/sched.c          |  173 ++++++++++++------------------------------------
>>  kernel/sched_debug.c    |   15 +++-
>>  kernel/sched_fair.c     |  166 +++++++++++++++++++++++++++++-----------------
>>  kernel/sched_features.h |    2
>>  kernel/sysctl.c         |   17 ----
>>  6 files changed, 163 insertions(+), 212 deletions(-)
>>
>> Index: include/linux/sched.h
>> ===================================================================
>> --- include/linux/sched.h.orig
>> +++ include/linux/sched.h
>> @@ -1868,8 +1868,6 @@ static inline void wake_up_idle_cpu(int
>>  extern unsigned int sysctl_sched_latency;
>>  extern unsigned int sysctl_sched_min_granularity;
>>  extern unsigned int sysctl_sched_wakeup_granularity;
>> -extern unsigned int sysctl_sched_shares_ratelimit;
>> -extern unsigned int sysctl_sched_shares_thresh;
>>  extern unsigned int sysctl_sched_child_runs_first;
>>
>>  enum sched_tunable_scaling {
>> Index: kernel/sched.c
>> ===================================================================
>> --- kernel/sched.c.orig
>> +++ kernel/sched.c
>> @@ -253,6 +253,8 @@ struct task_group {
>>        /* runqueue "owned" by this group on each cpu */
>>        struct cfs_rq **cfs_rq;
>>        unsigned long shares;
>> +
>> +       atomic_t load_weight;
>>  #endif
>>
>>  #ifdef CONFIG_RT_GROUP_SCHED
>> @@ -359,15 +361,11 @@ struct cfs_rq {
>>         */
>>        unsigned long h_load;
>>
>> -       /*
>> -        * this cpu's part of tg->shares
>> -        */
>> -       unsigned long shares;
>> +       u64 load_avg;
>> +       u64 load_period;
>> +       u64 load_stamp;
>>
>> -       /*
>> -        * load.weight at the time we set shares
>> -        */
>> -       unsigned long rq_weight;
>> +       unsigned long load_contribution;
>>  #endif
>>  #endif
>>  };
>> @@ -790,20 +788,6 @@ late_initcall(sched_init_debug);
>>  const_debug unsigned int sysctl_sched_nr_migrate = 32;
>>
>>  /*
>> - * ratelimit for updating the group shares.
>> - * default: 0.25ms
>> - */
>> -unsigned int sysctl_sched_shares_ratelimit = 250000;
>> -unsigned int normalized_sysctl_sched_shares_ratelimit = 250000;
>> -
>> -/*
>> - * Inject some fuzzyness into changing the per-cpu group shares
>> - * this avoids remote rq-locks at the expense of fairness.
>> - * default: 4
>> - */
>> -unsigned int sysctl_sched_shares_thresh = 4;
>> -
>> -/*
>>  * period over which we average the RT time consumption, measured
>>  * in ms.
>>  *
>> @@ -1352,6 +1336,12 @@ static inline void update_load_sub(struc
>>        lw->inv_weight = 0;
>>  }
>>
>> +static inline void update_load_set(struct load_weight *lw, unsigned long w)
>> +{
>> +       lw->weight = w;
>> +       lw->inv_weight = 0;
>> +}
>> +
>>  /*
>>  * To aid in avoiding the subversion of "niceness" due to uneven distribution
>>  * of tasks with abnormal "nice" values across CPUs the contribution that
>> @@ -1540,97 +1530,44 @@ static unsigned long cpu_avg_load_per_ta
>>
>>  #ifdef CONFIG_FAIR_GROUP_SCHED
>>
>> -static __read_mostly unsigned long __percpu *update_shares_data;
>> -
>> -static void __set_se_shares(struct sched_entity *se, unsigned long shares);
>> -
>> -/*
>> - * Calculate and set the cpu's group shares.
>> - */
>> -static void update_group_shares_cpu(struct task_group *tg, int cpu,
>> -                                   unsigned long sd_shares,
>> -                                   unsigned long sd_rq_weight,
>> -                                   unsigned long *usd_rq_weight)
>> -{
>> -       unsigned long shares, rq_weight;
>> -       int boost = 0;
>> -
>> -       rq_weight = usd_rq_weight[cpu];
>> -       if (!rq_weight) {
>> -               boost = 1;
>> -               rq_weight = NICE_0_LOAD;
>> -       }
>> -
>> -       /*
>> -        *             \Sum_j shares_j * rq_weight_i
>> -        * shares_i =  -----------------------------
>> -        *                  \Sum_j rq_weight_j
>> -        */
>> -       shares = (sd_shares * rq_weight) / sd_rq_weight;
>> -       shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);
>> -
>> -       if (abs(shares - tg->se[cpu]->load.weight) >
>> -                       sysctl_sched_shares_thresh) {
>> -               struct rq *rq = cpu_rq(cpu);
>> -               unsigned long flags;
>> -
>> -               raw_spin_lock_irqsave(&rq->lock, flags);
>> -               tg->cfs_rq[cpu]->rq_weight = boost ? 0 : rq_weight;
>> -               tg->cfs_rq[cpu]->shares = boost ? 0 : shares;
>> -               __set_se_shares(tg->se[cpu], shares);
>> -               raw_spin_unlock_irqrestore(&rq->lock, flags);
>> -       }
>> -}
>> +static void update_cfs_load(struct cfs_rq *cfs_rq);
>> +static void update_cfs_shares(struct cfs_rq *cfs_rq);
>>
>>  /*
>> - * Re-compute the task group their per cpu shares over the given domain.
>> - * This needs to be done in a bottom-up fashion because the rq weight of a
>> - * parent group depends on the shares of its child groups.
>> + * update tg->load_weight by folding this cpu's load_avg
>>  */
>>  static int tg_shares_up(struct task_group *tg, void *data)
>>  {
>> -       unsigned long weight, rq_weight = 0, sum_weight = 0, shares = 0;
>> -       unsigned long *usd_rq_weight;
>> -       struct sched_domain *sd = data;
>> +       long load_avg;
>> +       struct cfs_rq *cfs_rq;
>>        unsigned long flags;
>> -       int i;
>> +       int cpu = (long)data;
>> +       struct rq *rq;
>>
>> -       if (!tg->se[0])
>> +       if (!tg->se[cpu])
>>                return 0;
>>
>> -       local_irq_save(flags);
>> -       usd_rq_weight = per_cpu_ptr(update_shares_data, smp_processor_id());
>> -
>> -       for_each_cpu(i, sched_domain_span(sd)) {
>> -               weight = tg->cfs_rq[i]->load.weight;
>> -               usd_rq_weight[i] = weight;
>> -
>> -               rq_weight += weight;
>> -               /*
>> -                * If there are currently no tasks on the cpu pretend there
>> -                * is one of average load so that when a new task gets to
>> -                * run here it will not get delayed by group starvation.
>> -                */
>> -               if (!weight)
>> -                       weight = NICE_0_LOAD;
>> +       rq = cpu_rq(cpu);
>> +       cfs_rq = tg->cfs_rq[cpu];
>>
>> -               sum_weight += weight;
>> -               shares += tg->cfs_rq[i]->shares;
>> -       }
>> +       raw_spin_lock_irqsave(&rq->lock, flags);
>>
>> -       if (!rq_weight)
>> -               rq_weight = sum_weight;
>> +       update_rq_clock(rq);
>> +       update_cfs_load(cfs_rq);
>>
>> -       if ((!shares && rq_weight) || shares > tg->shares)
>> -               shares = tg->shares;
>> +       load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
>> +       load_avg -= cfs_rq->load_contribution;
>>
>> -       if (!sd->parent || !(sd->parent->flags & SD_LOAD_BALANCE))
>> -               shares = tg->shares;
>> +       atomic_add(load_avg, &tg->load_weight);
>> +       cfs_rq->load_contribution += load_avg;
>>
>> -       for_each_cpu(i, sched_domain_span(sd))
>> -               update_group_shares_cpu(tg, i, shares, rq_weight, usd_rq_weight);
>> +       /*
>> +        * We need to update shares after updating tg->load_weight in
>> +        * order to adjust the weight of groups with long running tasks.
>> +        */
>> +       update_cfs_shares(cfs_rq);
>>
>> -       local_irq_restore(flags);
>> +       raw_spin_unlock_irqrestore(&rq->lock, flags);
>>
>>        return 0;
>>  }
>> @@ -1649,7 +1586,7 @@ static int tg_load_down(struct task_grou
>>                load = cpu_rq(cpu)->load.weight;
>>        } else {
>>                load = tg->parent->cfs_rq[cpu]->h_load;
>> -               load *= tg->cfs_rq[cpu]->shares;
>> +               load *= tg->se[cpu]->load.weight;
>>                load /= tg->parent->cfs_rq[cpu]->load.weight + 1;
>>        }
>>
>> @@ -1658,21 +1595,16 @@ static int tg_load_down(struct task_grou
>>        return 0;
>>  }
>>
>> -static void update_shares(struct sched_domain *sd)
>> +static void update_shares(long cpu)
>>  {
>> -       s64 elapsed;
>> -       u64 now;
>> -
>>        if (root_task_group_empty())
>>                return;
>>
>> -       now = local_clock();
>> -       elapsed = now - sd->last_update;
>> +       /*
>> +        * XXX: replace with an on-demand list
>> +        */
>>
>> -       if (elapsed >= (s64)(u64)sysctl_sched_shares_ratelimit) {
>> -               sd->last_update = now;
>> -               walk_tg_tree(tg_nop, tg_shares_up, sd);
>> -       }
>> +       walk_tg_tree(tg_nop, tg_shares_up, (void *)cpu);
>>  }
>>
>>  static void update_h_load(long cpu)
>> @@ -1682,7 +1614,7 @@ static void update_h_load(long cpu)
>>
>>  #else
>>
>> -static inline void update_shares(struct sched_domain *sd)
>> +static inline void update_shares(int cpu)
>>  {
>>  }
>>
>> @@ -1807,15 +1739,6 @@ static void double_rq_unlock(struct rq *
>>
>>  #endif
>>
>> -#ifdef CONFIG_FAIR_GROUP_SCHED
>> -static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
>> -{
>> -#ifdef CONFIG_SMP
>> -       cfs_rq->shares = shares;
>> -#endif
>> -}
>> -#endif
>> -
>>  static void calc_load_account_idle(struct rq *this_rq);
>>  static void update_sysctl(void);
>>  static int get_update_sysctl_factor(void);
>> @@ -5404,7 +5327,6 @@ static void update_sysctl(void)
>>        SET_SYSCTL(sched_min_granularity);
>>        SET_SYSCTL(sched_latency);
>>        SET_SYSCTL(sched_wakeup_granularity);
>> -       SET_SYSCTL(sched_shares_ratelimit);
>>  #undef SET_SYSCTL
>>  }
>>
>> @@ -7721,8 +7643,7 @@ static void init_tg_cfs_entry(struct tas
>>                se->cfs_rq = parent->my_q;
>>
>>        se->my_q = cfs_rq;
>> -       se->load.weight = tg->shares;
>> -       se->load.inv_weight = 0;
>> +       update_load_set(&se->load, tg->shares);
>>        se->parent = parent;
>>  }
>>  #endif
>> @@ -7815,10 +7736,6 @@ void __init sched_init(void)
>>
>>  #endif /* CONFIG_CGROUP_SCHED */
>>
>> -#if defined CONFIG_FAIR_GROUP_SCHED && defined CONFIG_SMP
>> -       update_shares_data = __alloc_percpu(nr_cpu_ids * sizeof(unsigned long),
>> -                                           __alignof__(unsigned long));
>> -#endif
>>        for_each_possible_cpu(i) {
>>                struct rq *rq;
>>
>> @@ -8386,8 +8303,7 @@ static void __set_se_shares(struct sched
>>        if (on_rq)
>>                dequeue_entity(cfs_rq, se, 0);
>>
>> -       se->load.weight = shares;
>> -       se->load.inv_weight = 0;
>> +       update_load_set(&se->load, shares);
>>
>>        if (on_rq)
>>                enqueue_entity(cfs_rq, se, 0);
>> @@ -8444,7 +8360,6 @@ int sched_group_set_shares(struct task_g
>>                /*
>>                 * force a rebalance
>>                 */
>> -               cfs_rq_set_shares(tg->cfs_rq[i], 0);
>>                set_se_shares(tg->se[i], shares);
>>        }
>>
>> Index: kernel/sched_debug.c
>> ===================================================================
>> --- kernel/sched_debug.c.orig
>> +++ kernel/sched_debug.c
>> @@ -202,15 +202,22 @@ void print_cfs_rq(struct seq_file *m, in
>>        spread0 = min_vruntime - rq0_min_vruntime;
>>        SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "spread0",
>>                        SPLIT_NS(spread0));
>> -       SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
>> -       SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
>> -
>>        SEQ_printf(m, "  .%-30s: %d\n", "nr_spread_over",
>>                        cfs_rq->nr_spread_over);
>> +       SEQ_printf(m, "  .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
>> +       SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
>>  #ifdef CONFIG_FAIR_GROUP_SCHED
>>  #ifdef CONFIG_SMP
>> -       SEQ_printf(m, "  .%-30s: %lu\n", "shares", cfs_rq->shares);
>> +       SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_avg",
>> +                       SPLIT_NS(cfs_rq->load_avg));
>> +       SEQ_printf(m, "  .%-30s: %Ld.%06ld\n", "load_period",
>> +                       SPLIT_NS(cfs_rq->load_period));
>> +       SEQ_printf(m, "  .%-30s: %ld\n", "load_contrib",
>> +                       cfs_rq->load_contribution);
>> +       SEQ_printf(m, "  .%-30s: %d\n", "load_tg",
>> +                       atomic_read(&tg->load_weight));
>>  #endif
>> +
>>        print_cfs_group_stats(m, cpu, cfs_rq->tg);
>>  #endif
>>  }
>> Index: kernel/sched_fair.c
>> ===================================================================
>> --- kernel/sched_fair.c.orig
>> +++ kernel/sched_fair.c
>> @@ -417,7 +417,6 @@ int sched_proc_update_handler(struct ctl
>>        WRT_SYSCTL(sched_min_granularity);
>>        WRT_SYSCTL(sched_latency);
>>        WRT_SYSCTL(sched_wakeup_granularity);
>> -       WRT_SYSCTL(sched_shares_ratelimit);
>>  #undef WRT_SYSCTL
>>
>>        return 0;
>> @@ -633,7 +632,6 @@ account_entity_enqueue(struct cfs_rq *cf
>>                list_add(&se->group_node, &cfs_rq->tasks);
>>        }
>>        cfs_rq->nr_running++;
>> -       se->on_rq = 1;
>>  }
>>
>>  static void
>> @@ -647,9 +645,89 @@ account_entity_dequeue(struct cfs_rq *cf
>>                list_del_init(&se->group_node);
>>        }
>>        cfs_rq->nr_running--;
>> -       se->on_rq = 0;
>>  }
>>
>> +#if defined CONFIG_SMP && defined CONFIG_FAIR_GROUP_SCHED
>> +static void update_cfs_load(struct cfs_rq *cfs_rq)
>> +{
>> +       u64 period = sched_avg_period();
>> +       u64 now, delta;
>> +
>> +       if (!cfs_rq)
>> +               return;
>> +
>> +       now = rq_of(cfs_rq)->clock;
>> +       delta = now - cfs_rq->load_stamp;
>> +
>> +       cfs_rq->load_stamp = now;
>> +       cfs_rq->load_period += delta;
>> +       cfs_rq->load_avg += delta * cfs_rq->load.weight;
>> +
>> +       while (cfs_rq->load_period > period) {
>> +               /*
>> +                * Inline assembly required to prevent the compiler
>> +                * optimising this loop into a divmod call.
>> +                * See __iter_div_u64_rem() for another example of this.
>> +                */
>> +               asm("" : "+rm" (cfs_rq->load_period));
>> +               cfs_rq->load_period /= 2;
>> +               cfs_rq->load_avg /= 2;
>> +       }
>> +}
>> +
>> +static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>> +                           unsigned long weight)
>> +{
>> +       if (se->on_rq)
>> +               account_entity_dequeue(cfs_rq, se);
>> +
>> +       update_load_set(&se->load, weight);
>> +
>> +       if (se->on_rq)
>> +               account_entity_enqueue(cfs_rq, se);
>> +}
>> +
>> +static void update_cfs_shares(struct cfs_rq *cfs_rq)
>> +{
>> +       struct task_group *tg;
>> +       struct sched_entity *se;
>> +       long load_weight, load, shares;
>> +
>> +       if (!cfs_rq)
>> +               return;
>> +
>> +       tg = cfs_rq->tg;
>> +       se = tg->se[cpu_of(rq_of(cfs_rq))];
>> +       if (!se)
>> +               return;
>> +
>> +       load = cfs_rq->load.weight;
>> +
>> +       load_weight = atomic_read(&tg->load_weight);
>> +       load_weight -= cfs_rq->load_contribution;
>> +       load_weight += load;
>> +
>> +       shares = (tg->shares * load);
>> +       if (load_weight)
>> +               shares /= load_weight;
>> +
>> +       if (shares < MIN_SHARES)
>> +               shares = MIN_SHARES;
>> +       if (shares > tg->shares)
>> +               shares = tg->shares;
>> +
>> +       reweight_entity(cfs_rq_of(se), se, shares);
>> +}
>> +#else /* CONFIG_FAIR_GROUP_SCHED */
>> +static inline void update_cfs_load(struct cfs_rq *cfs_rq)
>> +{
>> +}
>> +
>> +static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
>> +{
>> +}
>> +#endif /* CONFIG_FAIR_GROUP_SCHED */
>> +
>>  static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>  {
>>  #ifdef CONFIG_SCHEDSTATS
>> @@ -771,7 +849,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>>         * Update run-time statistics of the 'current'.
>>         */
>>        update_curr(cfs_rq);
>> +       update_cfs_load(cfs_rq);
>>        account_entity_enqueue(cfs_rq, se);
>> +       update_cfs_shares(cfs_rq_of(se));
>>
>>        if (flags & ENQUEUE_WAKEUP) {
>>                place_entity(cfs_rq, se, 0);
>> @@ -782,6 +862,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>>        check_spread(cfs_rq, se);
>>        if (se != cfs_rq->curr)
>>                __enqueue_entity(cfs_rq, se);
>> +       se->on_rq = 1;
>>  }
>>
>>  static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> @@ -825,8 +906,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>>
>>        if (se != cfs_rq->curr)
>>                __dequeue_entity(cfs_rq, se);
>> +       se->on_rq = 0;
>> +       update_cfs_load(cfs_rq);
>>        account_entity_dequeue(cfs_rq, se);
>>        update_min_vruntime(cfs_rq);
>> +       update_cfs_shares(cfs_rq_of(se));
>>
>>        /*
>>         * Normalize the entity after updating the min_vruntime because the
>> @@ -1055,6 +1139,13 @@ enqueue_task_fair(struct rq *rq, struct
>>                flags = ENQUEUE_WAKEUP;
>>        }
>>
>> +       for_each_sched_entity(se) {
>> +               struct cfs_rq *cfs_rq = cfs_rq_of(se);
>> +
>> +               update_cfs_load(cfs_rq);
>> +               update_cfs_shares(cfs_rq);
>> +       }
>> +
>>        hrtick_update(rq);
>>  }
>>
>> @@ -1071,12 +1162,20 @@ static void dequeue_task_fair(struct rq
>>        for_each_sched_entity(se) {
>>                cfs_rq = cfs_rq_of(se);
>>                dequeue_entity(cfs_rq, se, flags);
>> +
>>                /* Don't dequeue parent if it has other entities besides us */
>>                if (cfs_rq->load.weight)
>>                        break;
>>                flags |= DEQUEUE_SLEEP;
>>        }
>>
>> +       for_each_sched_entity(se) {
>> +               struct cfs_rq *cfs_rq = cfs_rq_of(se);
>> +
>> +               update_cfs_load(cfs_rq);
>> +               update_cfs_shares(cfs_rq);
>> +       }
>> +
>>        hrtick_update(rq);
>>  }
>>
>> @@ -1143,51 +1242,20 @@ static void task_waking_fair(struct rq *
>>  * Adding load to a group doesn't make a group heavier, but can cause movement
>>  * of group shares between cpus. Assuming the shares were perfectly aligned one
>>  * can calculate the shift in shares.
>> - *
>> - * The problem is that perfectly aligning the shares is rather expensive, hence
>> - * we try to avoid doing that too often - see update_shares(), which ratelimits
>> - * this change.
>> - *
>> - * We compensate this by not only taking the current delta into account, but
>> - * also considering the delta between when the shares were last adjusted and
>> - * now.
>> - *
>> - * We still saw a performance dip, some tracing learned us that between
>> - * cgroup:/ and cgroup:/foo balancing the number of affine wakeups increased
>> - * significantly. Therefore try to bias the error in direction of failing
>> - * the affine wakeup.
>> - *
>>  */
>> -static long effective_load(struct task_group *tg, int cpu,
>> -               long wl, long wg)
>> +static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
>>  {
>>        struct sched_entity *se = tg->se[cpu];
>>
>>        if (!tg->parent)
>>                return wl;
>>
>> -       /*
>> -        * By not taking the decrease of shares on the other cpu into
>> -        * account our error leans towards reducing the affine wakeups.
>> -        */
>> -       if (!wl && sched_feat(ASYM_EFF_LOAD))
>> -               return wl;
>> -
>>        for_each_sched_entity(se) {
>>                long S, rw, s, a, b;
>> -               long more_w;
>> -
>> -               /*
>> -                * Instead of using this increment, also add the difference
>> -                * between when the shares were last updated and now.
>> -                */
>> -               more_w = se->my_q->load.weight - se->my_q->rq_weight;
>> -               wl += more_w;
>> -               wg += more_w;
>>
>> -               S = se->my_q->tg->shares;
>> -               s = se->my_q->shares;
>> -               rw = se->my_q->rq_weight;
>> +               S = tg->shares;
>
This  needs to at least be se->my_q->tg->shares (versus tg->shares as
we walk up the tree).
>>
>> +               s = se->load.weight;
>> +               rw = se->my_q->load.weight;
>
I suspect we could get better accuracy if we used the new knowledge of
global weight more directly (which we can get from the
load_contribution without having to bounce tg->load_weight around).

This path needs some attention in general though so I'll leave such
musings to a second patchset which will focus on improving group
wake_affine behavior.
>
>
>>
>>                a = S*(rw + wl);
>>                b = S*rw + s*wg;
>> @@ -1508,23 +1576,6 @@ select_task_rq_fair(struct rq *rq, struc
>>                        sd = tmp;
>>        }
>>
>> -#ifdef CONFIG_FAIR_GROUP_SCHED
>> -       if (sched_feat(LB_SHARES_UPDATE)) {
>> -               /*
>> -                * Pick the largest domain to update shares over
>> -                */
>> -               tmp = sd;
>> -               if (affine_sd && (!tmp || affine_sd->span_weight > sd->span_weight))
>> -                       tmp = affine_sd;
>> -
>> -               if (tmp) {
>> -                       raw_spin_unlock(&rq->lock);
>> -                       update_shares(tmp);
>> -                       raw_spin_lock(&rq->lock);
>> -               }
>> -       }
>> -#endif
>> -
>>        if (affine_sd) {
>>                if (cpu == prev_cpu || wake_affine(affine_sd, p, sync))
>>                        return select_idle_sibling(p, cpu);
>> @@ -2977,7 +3028,6 @@ static int load_balance(int this_cpu, st
>>        schedstat_inc(sd, lb_count[idle]);
>>
>>  redo:
>> -       update_shares(sd);
>>        group = find_busiest_group(sd, this_cpu, &imbalance, idle, &sd_idle,
>>                                   cpus, balance);
>>
>> @@ -3119,8 +3169,6 @@ out_one_pinned:
>>        else
>>                ld_moved = 0;
>>  out:
>> -       if (ld_moved)
>> -               update_shares(sd);
>>        return ld_moved;
>>  }
>>
>> @@ -3514,6 +3562,8 @@ static void rebalance_domains(int cpu, e
>>        int update_next_balance = 0;
>>        int need_serialize;
>>
>> +       update_shares(cpu);
>> +
>>        for_each_domain(cpu, sd) {
>>                if (!(sd->flags & SD_LOAD_BALANCE))
>>                        continue;
>> Index: kernel/sched_features.h
>> ===================================================================
>> --- kernel/sched_features.h.orig
>> +++ kernel/sched_features.h
>> @@ -52,8 +52,6 @@ SCHED_FEAT(ARCH_POWER, 0)
>>  SCHED_FEAT(HRTICK, 0)
>>  SCHED_FEAT(DOUBLE_TICK, 0)
>>  SCHED_FEAT(LB_BIAS, 1)
>> -SCHED_FEAT(LB_SHARES_UPDATE, 1)
>> -SCHED_FEAT(ASYM_EFF_LOAD, 1)
>>
>>  /*
>>  * Spin-wait on mutex acquisition when the mutex owner is running on
>> Index: kernel/sysctl.c
>> ===================================================================
>> --- kernel/sysctl.c.orig
>> +++ kernel/sysctl.c
>> @@ -307,15 +307,6 @@ static struct ctl_table kern_table[] = {
>>                .extra2         = &max_wakeup_granularity_ns,
>>        },
>>        {
>> -               .procname       = "sched_shares_ratelimit",
>> -               .data           = &sysctl_sched_shares_ratelimit,
>> -               .maxlen         = sizeof(unsigned int),
>> -               .mode           = 0644,
>> -               .proc_handler   = sched_proc_update_handler,
>> -               .extra1         = &min_sched_shares_ratelimit,
>> -               .extra2         = &max_sched_shares_ratelimit,
>> -       },
>> -       {
>>                .procname       = "sched_tunable_scaling",
>>                .data           = &sysctl_sched_tunable_scaling,
>>                .maxlen         = sizeof(enum sched_tunable_scaling),
>> @@ -325,14 +316,6 @@ static struct ctl_table kern_table[] = {
>>                .extra2         = &max_sched_tunable_scaling,
>>        },
>>        {
>> -               .procname       = "sched_shares_thresh",
>> -               .data           = &sysctl_sched_shares_thresh,
>> -               .maxlen         = sizeof(unsigned int),
>> -               .mode           = 0644,
>> -               .proc_handler   = proc_dointvec_minmax,
>> -               .extra1         = &zero,
>> -       },
>> -       {
>>                .procname       = "sched_migration_cost",
>>                .data           = &sysctl_sched_migration_cost,
>>                .maxlen         = sizeof(unsigned int),
>>
>> --
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2010-11-04 21:01 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-16  4:43 [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution pjt
2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 01/12] sched: rewrite tg_shares_up pjt
2010-10-21  6:04   ` Bharata B Rao
2010-10-21  6:28     ` Paul Turner
2010-10-21  8:08   ` Bharata B Rao
2010-10-21  8:38     ` Paul Turner
2010-10-21  9:08     ` Peter Zijlstra
     [not found]   ` <AANLkTi=zYAfb_izD15ROxH=C6+zPzX+XEGw7r5UUoAar@mail.gmail.com>
2010-11-04 21:00     ` Paul Turner
2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 02/12] sched: on-demand (active) cfs_rq list pjt
2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 03/12] sched: make tg_shares_up() walk on-demand pjt
2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 04/12] sched: fix load corruption from update_cfs_shares pjt
2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 05/12] sched: fix update_cfs_load synchronization pjt
2010-10-21  9:52   ` Bharata B Rao
2010-10-21 18:25     ` Paul Turner
2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 06/12] sched: hierarchal order on shares update list pjt
2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 07/12] sched: add sysctl_sched_shares_window pjt
2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 08/12] sched: update shares on idle_balance pjt
2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 09/12] sched: demand based update_cfs_load() pjt
2010-10-16  4:43 ` [RFC tg_shares_up improvements - v1 10/12] sched: allow update_cfs_load to update global load pjt
2010-10-16  4:44 ` [RFC tg_shares_up improvements - v1 11/12] sched: update tg->shares after cpu.shares write pjt
2010-10-16  4:44 ` [RFC tg_shares_up improvements - v1 12/12] debug: export effective shares for analysis versus specified pjt
2010-10-16 19:46 ` [RFC tg_shares_up improvements - v1 00/12] [RFC tg_shares_up - v1 00/12] Reducing cost of tg->shares distribution Peter Zijlstra
2010-10-21  6:36   ` Paul Turner
2010-10-22  0:14     ` Paul Turner
2010-10-17  5:24 ` Balbir Singh
2010-10-17  9:38   ` Peter Zijlstra
2010-10-17 12:09     ` Balbir Singh
2010-11-03 18:27 ` Karl Rister

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox