[patch -rt 00/17] [patch -rt] Sched load balance backport

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch -rt 00/17] [patch -rt] Sched load balance backport
@ 2009-10-22 12:37 dino
  2009-10-22 12:37 ` [patch -rt 01/17] sched: restore __cpu_power to a straight sum of power dino
                   ` (16 more replies)
  0 siblings, 17 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

Problem: The current -rt, 2.6.31.4-rt14, has several load balancing issues.
It can easily be seen with dbench running with the same number of processes
as the number of cpus on a SMP system. The dbench threads are running as
SCHED_OTHER processes. Several threads end up running on the same cpu even
though other cpus are idling. This results in severe throughput regression.

Peter Zijlstra posted several load balance patches meant for mainline a
while ago. The following patch series is a backport of the same to
2.6.31.4-rt14. Peter's original patches can be found at
http://marc.info/?l=linux-kernel&m=125198436208787&w=2

Patches 1 through 14 are backports of the original patches except
patch 10 which has the original code + bug fixes.
Patches 15 - 17 are various relevant fixes that have gone on top
of the original patches.

With these patches, load balancing improves considerably on -rt.
However it does not completely resolve the problem. This is still
under investigation.

This has been stress tested on x86_64 and i686

        -Dinakar

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 01/17] sched: restore __cpu_power to a straight sum of power
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:37 ` [patch -rt 02/17] sched: SD_PREFER_SIBLING dino
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: sched-lb-1.patch --]
[-- Type: text/plain, Size: 2390 bytes --]

cpu_power is supposed to be a representation of the process capacity
of the cpu, not a value to randomly tweak in order to affect
placement.

Remove the placement hacks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
 include/linux/sched.h    |    1 +
 include/linux/topology.h |    1 +
 kernel/sched.c           |   34 ++++++++++++++++++----------------
 3 files changed, 20 insertions(+), 16 deletions(-)

Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c	2009-10-16 08:56:17.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c	2009-10-16 09:15:18.000000000 -0400
@@ -8670,15 +8670,13 @@
  * there are asymmetries in the topology. If there are asymmetries, group
  * having more cpu_power will pickup more load compared to the group having
  * less cpu_power.
- *
- * cpu_power will be a multiple of SCHED_LOAD_SCALE. This multiple represents
- * the maximum number of tasks a group can handle in the presence of other idle
- * or lightly loaded groups in the same sched domain.
  */
 static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 {
 	struct sched_domain *child;
 	struct sched_group *group;
+	long power;
+	int weight;
 
 	WARN_ON(!sd || !sd->groups);
 
@@ -8689,22 +8687,20 @@
 
 	sd->groups->__cpu_power = 0;
 
-	/*
-	 * For perf policy, if the groups in child domain share resources
-	 * (for example cores sharing some portions of the cache hierarchy
-	 * or SMT), then set this domain groups cpu_power such that each group
-	 * can handle only one task, when there are other idle groups in the
-	 * same sched domain.
-	 */
-	if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
-		       (child->flags &
-			(SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
-		sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
+	if (!child) {
+		power = SCHED_LOAD_SCALE;
+		weight = cpumask_weight(sched_domain_span(sd));
+		/*
+		 * SMT siblings share the power of a single core.
+		 */
+		if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1)
+			power /= weight;
+		sg_inc_cpu_power(sd->groups, power);
 		return;
 	}
 
 	/*
-	 * add cpu_power of each child group to this groups cpu_power
+	 * Add cpu_power of each child group to this groups cpu_power.
 	 */
 	group = child->groups;
 	do {

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 02/17] sched: SD_PREFER_SIBLING
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
  2009-10-22 12:37 ` [patch -rt 01/17] sched: restore __cpu_power to a straight sum of power dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:37 ` [patch -rt 03/17] sched: update the cpu_power sum during load-balance dino
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: sched-lb-2.patch --]
[-- Type: text/plain, Size: 4002 bytes --]

Do the placement thing using SD flags

XXX: consider degenerate bits

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
 include/linux/sched.h |   29 +++++++++++++++--------------
 kernel/sched.c        |   14 +++++++++++++-
 2 files changed, 28 insertions(+), 15 deletions(-)

Index: linux-2.6.31.4-rt14/include/linux/sched.h
===================================================================
--- linux-2.6.31.4-rt14.orig/include/linux/sched.h	2009-10-16 09:15:18.000000000 -0400
+++ linux-2.6.31.4-rt14/include/linux/sched.h	2009-10-16 09:15:30.000000000 -0400
@@ -843,18 +843,19 @@
 #define SCHED_LOAD_SCALE_FUZZ	SCHED_LOAD_SCALE
 
 #ifdef CONFIG_SMP
-#define SD_LOAD_BALANCE		1	/* Do load balancing on this domain. */
-#define SD_BALANCE_NEWIDLE	2	/* Balance when about to become idle */
-#define SD_BALANCE_EXEC		4	/* Balance on exec */
-#define SD_BALANCE_FORK		8	/* Balance on fork, clone */
-#define SD_WAKE_IDLE		16	/* Wake to idle CPU on task wakeup */
-#define SD_WAKE_AFFINE		32	/* Wake task to waking CPU */
-#define SD_WAKE_BALANCE		64	/* Perform balancing at task wakeup */
-#define SD_SHARE_CPUPOWER	128	/* Domain members share cpu power */
-#define SD_POWERSAVINGS_BALANCE	256	/* Balance for power savings */
-#define SD_SHARE_PKG_RESOURCES	512	/* Domain members share cpu pkg resources */
-#define SD_SERIALIZE		1024	/* Only a single load balancing instance */
-#define SD_WAKE_IDLE_FAR	2048	/* Gain latency sacrificing cache hit */
+#define SD_LOAD_BALANCE		0x0001	/* Do load balancing on this domain. */
+#define SD_BALANCE_NEWIDLE	0x0002	/* Balance when about to become idle */
+#define SD_BALANCE_EXEC		0x0004	/* Balance on exec */
+#define SD_BALANCE_FORK		0x0008	/* Balance on fork, clone */
+#define SD_WAKE_IDLE		0x0010	/* Wake to idle CPU on task wakeup */
+#define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
+#define SD_WAKE_BALANCE		0x0040	/* Perform balancing at task wakeup */
+#define SD_SHARE_CPUPOWER	0x0080	/* Domain members share cpu power */
+#define SD_POWERSAVINGS_BALANCE	0x0100	/* Balance for power savings */
+#define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
+#define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */
+#define SD_WAKE_IDLE_FAR	0x0800	/* Gain latency sacrificing cache hit */
+#define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 
 enum powersavings_balance_level {
 	POWERSAVINGS_BALANCE_NONE = 0,  /* No power saving load balance */
@@ -874,7 +875,7 @@
 	if (sched_smt_power_savings)
 		return SD_POWERSAVINGS_BALANCE;
 
-	return 0;
+	return SD_PREFER_SIBLING;
 }
 
 static inline int sd_balance_for_package_power(void)
@@ -882,7 +883,7 @@
 	if (sched_mc_power_savings | sched_smt_power_savings)
 		return SD_POWERSAVINGS_BALANCE;
 
-	return 0;
+	return SD_PREFER_SIBLING;
 }
 
 /*
Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c	2009-10-16 09:15:18.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c	2009-10-16 09:15:30.000000000 -0400
@@ -3892,9 +3892,13 @@
 			const struct cpumask *cpus, int *balance,
 			struct sd_lb_stats *sds)
 {
+	struct sched_domain *child = sd->child;
 	struct sched_group *group = sd->groups;
 	struct sg_lb_stats sgs;
-	int load_idx;
+	int load_idx, prefer_sibling = 0;
+
+	if (child && child->flags & SD_PREFER_SIBLING)
+		prefer_sibling = 1;
 
 	init_sd_power_savings_stats(sd, sds, idle);
 	load_idx = get_sd_load_idx(sd, idle);
@@ -3914,6 +3918,14 @@
 		sds->total_load += sgs.group_load;
 		sds->total_pwr += group->__cpu_power;
 
+		/*
+		 * In case the child domain prefers tasks go to siblings
+		 * first, lower the group capacity to one so that we'll try
+		 * and move all the excess tasks away.
+		 */
+		if (prefer_sibling)
+			sgs.group_capacity = 1;
+
 		if (local_group) {
 			sds->this_load = sgs.avg_load;
 			sds->this = group;

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 03/17] sched: update the cpu_power sum during load-balance
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
  2009-10-22 12:37 ` [patch -rt 01/17] sched: restore __cpu_power to a straight sum of power dino
  2009-10-22 12:37 ` [patch -rt 02/17] sched: SD_PREFER_SIBLING dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:37 ` [patch -rt 04/17] sched: add smt_gain dino
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: sched-lb-3.patch --]
[-- Type: text/plain, Size: 2625 bytes --]

In order to prepare for a more dynamic cpu_power, update the group sum
while walking the sched domains during load-balance.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
 kernel/sched.c |   33 +++++++++++++++++++++++++++++----
 1 file changed, 29 insertions(+), 4 deletions(-)

Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c	2009-10-16 09:15:30.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c	2009-10-16 09:15:32.000000000 -0400
@@ -3780,6 +3780,28 @@
 }
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
 
+static void update_sched_power(struct sched_domain *sd)
+{
+	struct sched_domain *child = sd->child;
+	struct sched_group *group, *sdg = sd->groups;
+	unsigned long power = sdg->__cpu_power;
+
+	if (!child) {
+		/* compute cpu power for this cpu */
+		return;
+	}
+
+	sdg->__cpu_power = 0;
+
+	group = child->groups;
+	do {
+		sdg->__cpu_power += group->__cpu_power;
+		group = group->next;
+	} while (group != child->groups);
+
+	if (power != sdg->__cpu_power)
+		sdg->reciprocal_cpu_power = reciprocal_value(sdg->__cpu_power);
+}
 
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
@@ -3793,7 +3815,8 @@
  * @balance: Should we balance.
  * @sgs: variable to hold the statistics for this group.
  */
-static inline void update_sg_lb_stats(struct sched_group *group, int this_cpu,
+static inline void update_sg_lb_stats(struct sched_domain *sd,
+			struct sched_group *group, int this_cpu,
 			enum cpu_idle_type idle, int load_idx, int *sd_idle,
 			int local_group, const struct cpumask *cpus,
 			int *balance, struct sg_lb_stats *sgs)
@@ -3804,8 +3827,11 @@
 	unsigned long sum_avg_load_per_task;
 	unsigned long avg_load_per_task;
 
-	if (local_group)
+	if (local_group) {
 		balance_cpu = group_first_cpu(group);
+		if (balance_cpu == this_cpu)
+			update_sched_power(sd);
+	}
 
 	/* Tally up the load of all CPUs in the group */
 	sum_avg_load_per_task = avg_load_per_task = 0;
@@ -3909,7 +3935,7 @@
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_cpus(group));
 		memset(&sgs, 0, sizeof(sgs));
-		update_sg_lb_stats(group, this_cpu, idle, load_idx, sd_idle,
+		update_sg_lb_stats(sd, group, this_cpu, idle, load_idx, sd_idle,
 				local_group, cpus, balance, &sgs);
 
 		if (local_group && balance && !(*balance))
@@ -3944,7 +3970,6 @@
 		update_sd_power_savings_stats(group, sds, local_group, &sgs);
 		group = group->next;
 	} while (group != sd->groups);
-
 }
 
 /**

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 04/17] sched: add smt_gain
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
                   ` (2 preceding siblings ...)
  2009-10-22 12:37 ` [patch -rt 03/17] sched: update the cpu_power sum during load-balance dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:37 ` [patch -rt 05/17] sched: dynamic cpu_power dino
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: sched-lb-1b.patch --]
[-- Type: text/plain, Size: 2175 bytes --]

The idea is that multi-threading a core yields more work capacity than
a single thread, provide a way to express a static gain for threads.
 
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
 include/linux/sched.h    |    1 +
 include/linux/topology.h |    1 +
 kernel/sched.c           |    8 +++++++-
 3 files changed, 9 insertions(+), 1 deletion(-)

Index: linux-2.6.31.4-rt14/include/linux/sched.h
===================================================================
--- linux-2.6.31.4-rt14.orig/include/linux/sched.h	2009-10-16 09:15:30.000000000 -0400
+++ linux-2.6.31.4-rt14/include/linux/sched.h	2009-10-16 09:15:34.000000000 -0400
@@ -966,6 +966,7 @@
 	unsigned int newidle_idx;
 	unsigned int wake_idx;
 	unsigned int forkexec_idx;
+	unsigned int smt_gain;
 	int flags;			/* See SD_* */
 	enum sched_domain_level level;
 
Index: linux-2.6.31.4-rt14/include/linux/topology.h
===================================================================
--- linux-2.6.31.4-rt14.orig/include/linux/topology.h	2009-10-16 09:15:16.000000000 -0400
+++ linux-2.6.31.4-rt14/include/linux/topology.h	2009-10-16 09:15:34.000000000 -0400
@@ -99,6 +99,7 @@
 				| SD_SHARE_CPUPOWER,	\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
+	.smt_gain		= 1178,	/* 15% */	\
 }
 #endif
 #endif /* CONFIG_SCHED_SMT */
Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c	2009-10-16 09:15:32.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c	2009-10-16 09:15:34.000000000 -0400
@@ -8729,9 +8729,15 @@
 		weight = cpumask_weight(sched_domain_span(sd));
 		/*
 		 * SMT siblings share the power of a single core.
+		 * Usually multiple threads get a better yield out of
+		 * that one core than a single thread would have,
+		 * reflect that in sd->smt_gain.
 		 */
-		if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1)
+		if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
+			power *= sd->smt_gain;
 			power /= weight;
+			power >>= SCHED_LOAD_SHIFT;
+		}
 		sg_inc_cpu_power(sd->groups, power);
 		return;
 	}

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 05/17] sched: dynamic cpu_power
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
                   ` (3 preceding siblings ...)
  2009-10-22 12:37 ` [patch -rt 04/17] sched: add smt_gain dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:37 ` [patch -rt 06/17] sched: scale down cpu_power due to RT tasks dino
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: sched-lb-4.patch --]
[-- Type: text/plain, Size: 2053 bytes --]

Recompute the cpu_power for each cpu during load-balance
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
 kernel/sched.c |   38 +++++++++++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)

Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c	2009-10-16 09:15:34.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c	2009-10-16 09:15:35.000000000 -0400
@@ -3780,14 +3780,46 @@
 }
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
 
-static void update_sched_power(struct sched_domain *sd)
+unsigned long __weak arch_smt_gain(struct sched_domain *sd, int cpu)
+{
+	unsigned long weight = cpumask_weight(sched_domain_span(sd));
+	unsigned long smt_gain = sd->smt_gain;
+
+	smt_gain /= weight;
+
+	return smt_gain;
+}
+
+static void update_cpu_power(struct sched_domain *sd, int cpu)
+{
+	unsigned long weight = cpumask_weight(sched_domain_span(sd));
+	unsigned long power = SCHED_LOAD_SCALE;
+	struct sched_group *sdg = sd->groups;
+	unsigned long old = sdg->__cpu_power;
+
+	/* here we could scale based on cpufreq */
+
+	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
+		power *= arch_smt_gain(sd, cpu);
+		power >>= SCHED_LOAD_SHIFT;
+	}
+
+	/* here we could scale based on RT time */
+
+	if (power != old) {
+		sdg->__cpu_power = power;
+		sdg->reciprocal_cpu_power = reciprocal_value(power);
+	}
+}
+
+static void update_group_power(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
 	unsigned long power = sdg->__cpu_power;
 
 	if (!child) {
-		/* compute cpu power for this cpu */
+		update_cpu_power(sd, cpu);
 		return;
 	}
 
@@ -3830,7 +3862,7 @@
 	if (local_group) {
 		balance_cpu = group_first_cpu(group);
 		if (balance_cpu == this_cpu)
-			update_sched_power(sd);
+			update_group_power(sd, this_cpu);
 	}
 
 	/* Tally up the load of all CPUs in the group */

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 06/17] sched: scale down cpu_power due to RT tasks
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
                   ` (4 preceding siblings ...)
  2009-10-22 12:37 ` [patch -rt 05/17] sched: dynamic cpu_power dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:37 ` [patch -rt 07/17] sched: try to deal with low capacity dino
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: sched-lb-5.patch --]
[-- Type: text/plain, Size: 5365 bytes --]

Keep an average on the amount of time spend on RT tasks and use that
fraction to scale down the cpu_power for regular tasks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
 include/linux/sched.h |    1 
 kernel/sched.c        |   64 +++++++++++++++++++++++++++++++++++++++++++++++---
 kernel/sched_rt.c     |    6 +---
 kernel/sysctl.c       |    8 ++++++
 4 files changed, 72 insertions(+), 7 deletions(-)

Index: linux-2.6.31.4-rt14/include/linux/sched.h
===================================================================
--- linux-2.6.31.4-rt14.orig/include/linux/sched.h	2009-10-16 09:15:34.000000000 -0400
+++ linux-2.6.31.4-rt14/include/linux/sched.h	2009-10-16 09:15:36.000000000 -0400
@@ -1915,6 +1915,7 @@
 extern unsigned int sysctl_sched_features;
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
+extern unsigned int sysctl_sched_time_avg;
 extern unsigned int sysctl_timer_migration;
 
 int sched_nr_latency_handler(struct ctl_table *table, int write,
Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c	2009-10-16 09:15:35.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c	2009-10-16 09:15:36.000000000 -0400
@@ -673,6 +673,9 @@
 
 	struct task_struct *migration_thread;
 	struct list_head migration_queue;
+
+	u64 rt_avg;
+	u64 age_stamp;
 #endif
 
 	/* calc_load related fields */
@@ -927,6 +930,14 @@
 unsigned int sysctl_sched_shares_thresh = 4;
 
 /*
+ * period over which we average the RT time consumption, measured
+ * in ms.
+ *
+ * default: 1s
+ */
+const_debug unsigned int sysctl_sched_time_avg = MSEC_PER_SEC;
+
+/*
  * period over which we measure -rt task cpu usage in us.
  * default: 1s
  */
@@ -1370,12 +1381,37 @@
 }
 #endif /* CONFIG_NO_HZ */
 
+static u64 sched_avg_period(void)
+{
+	return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
+}
+
+static void sched_avg_update(struct rq *rq)
+{
+	s64 period = sched_avg_period();
+
+	while ((s64)(rq->clock - rq->age_stamp) > period) {
+		rq->age_stamp += period;
+		rq->rt_avg /= 2;
+	}
+}
+
+static void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
+{
+	rq->rt_avg += rt_delta;
+	sched_avg_update(rq);
+}
+
 #else /* !CONFIG_SMP */
 static void resched_task(struct task_struct *p)
 {
 	assert_atomic_spin_locked(&task_rq(p)->lock);
 	set_tsk_need_resched(p);
 }
+
+static void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
+{
+}
 #endif /* CONFIG_SMP */
 
 #if BITS_PER_LONG == 32
@@ -3780,7 +3816,7 @@
 }
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
 
-unsigned long __weak arch_smt_gain(struct sched_domain *sd, int cpu)
+unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu)
 {
 	unsigned long weight = cpumask_weight(sched_domain_span(sd));
 	unsigned long smt_gain = sd->smt_gain;
@@ -3790,6 +3826,24 @@
 	return smt_gain;
 }
 
+unsigned long scale_rt_power(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	u64 total, available;
+
+	sched_avg_update(rq);
+
+	total = sched_avg_period() + (rq->clock - rq->age_stamp);
+	available = total - rq->rt_avg;
+
+	if (unlikely((s64)total < SCHED_LOAD_SCALE))
+		total = SCHED_LOAD_SCALE;
+
+	total >>= SCHED_LOAD_SHIFT;
+
+	return div_u64(available, total);
+}
+
 static void update_cpu_power(struct sched_domain *sd, int cpu)
 {
 	unsigned long weight = cpumask_weight(sched_domain_span(sd));
@@ -3800,11 +3854,15 @@
 	/* here we could scale based on cpufreq */
 
 	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
-		power *= arch_smt_gain(sd, cpu);
+		power *= arch_scale_smt_power(sd, cpu);
 		power >>= SCHED_LOAD_SHIFT;
 	}
 
-	/* here we could scale based on RT time */
+	power *= scale_rt_power(cpu);
+	power >>= SCHED_LOAD_SHIFT;
+
+	if (!power)
+		power = 1;
 
 	if (power != old) {
 		sdg->__cpu_power = power;
Index: linux-2.6.31.4-rt14/kernel/sched_rt.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched_rt.c	2009-10-16 09:15:15.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched_rt.c	2009-10-16 09:15:36.000000000 -0400
@@ -602,6 +602,8 @@
 	curr->se.exec_start = rq->clock;
 	cpuacct_charge(curr, delta_exec);
 
+	sched_rt_avg_update(rq, delta_exec);
+
 	if (!rt_bandwidth_enabled())
 		return;
 
@@ -926,8 +928,6 @@
 
 	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
-
-	inc_cpu_load(rq, p->se.load.weight);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
@@ -942,8 +942,6 @@
 	dequeue_rt_entity(rt_se);
 
 	dequeue_pushable_task(rq, p);
-
-	dec_cpu_load(rq, p->se.load.weight);
 }
 
 /*
Index: linux-2.6.31.4-rt14/kernel/sysctl.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sysctl.c	2009-10-16 09:15:15.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sysctl.c	2009-10-16 09:15:36.000000000 -0400
@@ -332,6 +332,14 @@
 	},
 	{
 		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_time_avg",
+		.data		= &sysctl_sched_time_avg,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "timer_migration",
 		.data		= &sysctl_timer_migration,
 		.maxlen		= sizeof(unsigned int),

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 07/17] sched: try to deal with low capacity
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
                   ` (5 preceding siblings ...)
  2009-10-22 12:37 ` [patch -rt 06/17] sched: scale down cpu_power due to RT tasks dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:37 ` [patch -rt 08/17] sched: remove reciprocal for cpu_power dino
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: sched-lb-6.patch --]
[-- Type: text/plain, Size: 2465 bytes --]

When the capacity drops low, we want to migrate load away. Allow the
load-balancer to remove all tasks when we hit rock bottom.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
[ego@in.ibm.com: fix to update_sd_power_savings_stats]
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
 kernel/sched.c |   35 +++++++++++++++++++++++++++++------
 1 file changed, 29 insertions(+), 6 deletions(-)

Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c	2009-10-16 09:15:36.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c	2009-10-16 09:15:37.000000000 -0400
@@ -3749,7 +3749,7 @@
 	 * capacity but still has some space to pick up some load
 	 * from other group and save more power
 	 */
-	if (sgs->sum_nr_running > sgs->group_capacity - 1)
+	if (sgs->sum_nr_running + 1 > sgs->group_capacity)
 		return;
 
 	if (sgs->sum_nr_running > sds->leader_nr_running ||
@@ -3989,8 +3989,8 @@
 	if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
 		sgs->group_imb = 1;
 
-	sgs->group_capacity = group->__cpu_power / SCHED_LOAD_SCALE;
-
+	sgs->group_capacity =
+		DIV_ROUND_CLOSEST(group->__cpu_power, SCHED_LOAD_SCALE);
 }
 
 /**
@@ -4040,7 +4040,7 @@
 		 * and move all the excess tasks away.
 		 */
 		if (prefer_sibling)
-			sgs.group_capacity = 1;
+			sgs.group_capacity = min(sgs.group_capacity, 1UL);
 
 		if (local_group) {
 			sds->this_load = sgs.avg_load;
@@ -4272,6 +4272,26 @@
 	return NULL;
 }
 
+static struct sched_group *group_of(int cpu)
+{
+	struct sched_domain *sd = rcu_dereference(cpu_rq(cpu)->sd);
+
+	if (!sd)
+		return NULL;
+
+	return sd->groups;
+}
+
+static unsigned long power_of(int cpu)
+{
+	struct sched_group *group = group_of(cpu);
+
+	if (!group)
+		return SCHED_LOAD_SCALE;
+
+	return group->__cpu_power;
+}
+
 /*
  * find_busiest_queue - find the busiest runqueue among the cpus in group.
  */
@@ -4284,15 +4304,18 @@
 	int i;
 
 	for_each_cpu(i, sched_group_cpus(group)) {
+		unsigned long power = power_of(i);
+		unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
 		unsigned long wl;
 
 		if (!cpumask_test_cpu(i, cpus))
 			continue;
 
 		rq = cpu_rq(i);
-		wl = weighted_cpuload(i);
+		wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
+		wl /= power;
 
-		if (rq->nr_running == 1 && wl > imbalance)
+		if (capacity && rq->nr_running == 1 && wl > imbalance)
 			continue;
 
 		if (wl > max_load) {

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 08/17] sched: remove reciprocal for cpu_power
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
                   ` (6 preceding siblings ...)
  2009-10-22 12:37 ` [patch -rt 07/17] sched: try to deal with low capacity dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:37 ` [patch -rt 09/17] x86: move APERF/MPERF into a X86_FEATURE dino
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: sched-lb-7.patch --]
[-- Type: text/plain, Size: 8756 bytes --]

Its a source of fail, also, now that cpu_power is dynamical, its a
waste of time.

before:
<idle>-0   [000]   132.877936: find_busiest_group: avg_load: 0 group_load: 8241 power: 1 

after:
bash-1689  [001]   137.862151: find_busiest_group: avg_load: 10636288 group_load: 10387 power: 1

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
[andreas.herrmann3@amd.com: remove include]
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
 include/linux/sched.h |   10 +----
 kernel/sched.c        |  100 +++++++++++++++++---------------------------------
 2 files changed, 36 insertions(+), 74 deletions(-)

Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c	2009-10-16 09:15:37.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c	2009-10-16 09:15:38.000000000 -0400
@@ -137,30 +137,8 @@
  */
 #define RUNTIME_INF	((u64)~0ULL)
 
-#ifdef CONFIG_SMP
-
 static void double_rq_lock(struct rq *rq1, struct rq *rq2);
 
-/*
- * Divide a load by a sched group cpu_power : (load / sg->__cpu_power)
- * Since cpu_power is a 'constant', we can use a reciprocal divide.
- */
-static inline u32 sg_div_cpu_power(const struct sched_group *sg, u32 load)
-{
-	return reciprocal_divide(load, sg->reciprocal_cpu_power);
-}
-
-/*
- * Each time a sched group cpu_power is changed,
- * we must compute its reciprocal value
- */
-static inline void sg_inc_cpu_power(struct sched_group *sg, u32 val)
-{
-	sg->__cpu_power += val;
-	sg->reciprocal_cpu_power = reciprocal_value(sg->__cpu_power);
-}
-#endif
-
 #define TASK_PREEMPTS_CURR(p, rq) \
 	((p)->prio < (rq)->curr->prio)
 
@@ -2401,8 +2379,7 @@
 		}
 
 		/* Adjust by relative CPU power of the group */
-		avg_load = sg_div_cpu_power(group,
-				avg_load * SCHED_LOAD_SCALE);
+		avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power;
 
 		if (local_group) {
 			this_load = avg_load;
@@ -3849,7 +3826,6 @@
 	unsigned long weight = cpumask_weight(sched_domain_span(sd));
 	unsigned long power = SCHED_LOAD_SCALE;
 	struct sched_group *sdg = sd->groups;
-	unsigned long old = sdg->__cpu_power;
 
 	/* here we could scale based on cpufreq */
 
@@ -3864,33 +3840,26 @@
 	if (!power)
 		power = 1;
 
-	if (power != old) {
-		sdg->__cpu_power = power;
-		sdg->reciprocal_cpu_power = reciprocal_value(power);
-	}
+	sdg->cpu_power = power;
 }
 
 static void update_group_power(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long power = sdg->__cpu_power;
 
 	if (!child) {
 		update_cpu_power(sd, cpu);
 		return;
 	}
 
-	sdg->__cpu_power = 0;
+	sdg->cpu_power = 0;
 
 	group = child->groups;
 	do {
-		sdg->__cpu_power += group->__cpu_power;
+		sdg->cpu_power += group->cpu_power;
 		group = group->next;
 	} while (group != child->groups);
-
-	if (power != sdg->__cpu_power)
-		sdg->reciprocal_cpu_power = reciprocal_value(sdg->__cpu_power);
 }
 
 /**
@@ -3970,8 +3939,7 @@
 	}
 
 	/* Adjust by relative CPU power of the group */
-	sgs->avg_load = sg_div_cpu_power(group,
-			sgs->group_load * SCHED_LOAD_SCALE);
+	sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;
 
 
 	/*
@@ -3983,14 +3951,14 @@
 	 *      normalized nr_running number somewhere that negates
 	 *      the hierarchy?
 	 */
-	avg_load_per_task = sg_div_cpu_power(group,
-			sum_avg_load_per_task * SCHED_LOAD_SCALE);
+	avg_load_per_task = (sum_avg_load_per_task * SCHED_LOAD_SCALE) /
+		group->cpu_power;
 
 	if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
 		sgs->group_imb = 1;
 
 	sgs->group_capacity =
-		DIV_ROUND_CLOSEST(group->__cpu_power, SCHED_LOAD_SCALE);
+		DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
 }
 
 /**
@@ -4032,7 +4000,7 @@
 			return;
 
 		sds->total_load += sgs.group_load;
-		sds->total_pwr += group->__cpu_power;
+		sds->total_pwr += group->cpu_power;
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
@@ -4097,28 +4065,28 @@
 	 * moving them.
 	 */
 
-	pwr_now += sds->busiest->__cpu_power *
+	pwr_now += sds->busiest->cpu_power *
 			min(sds->busiest_load_per_task, sds->max_load);
-	pwr_now += sds->this->__cpu_power *
+	pwr_now += sds->this->cpu_power *
 			min(sds->this_load_per_task, sds->this_load);
 	pwr_now /= SCHED_LOAD_SCALE;
 
 	/* Amount of load we'd subtract */
-	tmp = sg_div_cpu_power(sds->busiest,
-			sds->busiest_load_per_task * SCHED_LOAD_SCALE);
+	tmp = (sds->busiest_load_per_task * SCHED_LOAD_SCALE) /
+		sds->busiest->cpu_power;
 	if (sds->max_load > tmp)
-		pwr_move += sds->busiest->__cpu_power *
+		pwr_move += sds->busiest->cpu_power *
 			min(sds->busiest_load_per_task, sds->max_load - tmp);
 
 	/* Amount of load we'd add */
-	if (sds->max_load * sds->busiest->__cpu_power <
+	if (sds->max_load * sds->busiest->cpu_power <
 		sds->busiest_load_per_task * SCHED_LOAD_SCALE)
-		tmp = sg_div_cpu_power(sds->this,
-			sds->max_load * sds->busiest->__cpu_power);
+		tmp = (sds->max_load * sds->busiest->cpu_power) /
+			sds->this->cpu_power;
 	else
-		tmp = sg_div_cpu_power(sds->this,
-			sds->busiest_load_per_task * SCHED_LOAD_SCALE);
-	pwr_move += sds->this->__cpu_power *
+		tmp = (sds->busiest_load_per_task * SCHED_LOAD_SCALE) /
+			sds->this->cpu_power;
+	pwr_move += sds->this->cpu_power *
 			min(sds->this_load_per_task, sds->this_load + tmp);
 	pwr_move /= SCHED_LOAD_SCALE;
 
@@ -4153,8 +4121,8 @@
 			sds->max_load - sds->busiest_load_per_task);
 
 	/* How much load to actually move to equalise the imbalance */
-	*imbalance = min(max_pull * sds->busiest->__cpu_power,
-		(sds->avg_load - sds->this_load) * sds->this->__cpu_power)
+	*imbalance = min(max_pull * sds->busiest->cpu_power,
+		(sds->avg_load - sds->this_load) * sds->this->cpu_power)
 			/ SCHED_LOAD_SCALE;
 
 	/*
@@ -4289,7 +4257,7 @@
 	if (!group)
 		return SCHED_LOAD_SCALE;
 
-	return group->__cpu_power;
+	return group->cpu_power;
 }
 
 /*
@@ -8226,7 +8194,7 @@
 			break;
 		}
 
-		if (!group->__cpu_power) {
+		if (!group->cpu_power) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: domain->cpu_power not "
 					"set\n");
@@ -8250,9 +8218,9 @@
 		cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group));
 
 		printk(KERN_CONT " %s", str);
-		if (group->__cpu_power != SCHED_LOAD_SCALE) {
-			printk(KERN_CONT " (__cpu_power = %d)",
-				group->__cpu_power);
+		if (group->cpu_power != SCHED_LOAD_SCALE) {
+			printk(KERN_CONT " (cpu_power = %d)",
+				group->cpu_power);
 		}
 
 		group = group->next;
@@ -8537,7 +8505,7 @@
 			continue;
 
 		cpumask_clear(sched_group_cpus(sg));
-		sg->__cpu_power = 0;
+		sg->cpu_power = 0;
 
 		for_each_cpu(j, span) {
 			if (group_fn(j, cpu_map, NULL, tmpmask) != group)
@@ -8762,7 +8730,7 @@
 				continue;
 			}
 
-			sg_inc_cpu_power(sg, sd->groups->__cpu_power);
+			sg->cpu_power += sd->groups->cpu_power;
 		}
 		sg = sg->next;
 	} while (sg != group_head);
@@ -8835,7 +8803,7 @@
 
 	child = sd->child;
 
-	sd->groups->__cpu_power = 0;
+	sd->groups->cpu_power = 0;
 
 	if (!child) {
 		power = SCHED_LOAD_SCALE;
@@ -8851,7 +8819,7 @@
 			power /= weight;
 			power >>= SCHED_LOAD_SHIFT;
 		}
-		sg_inc_cpu_power(sd->groups, power);
+		sd->groups->cpu_power += power;
 		return;
 	}
 
@@ -8860,7 +8828,7 @@
 	 */
 	group = child->groups;
 	do {
-		sg_inc_cpu_power(sd->groups, group->__cpu_power);
+		sd->groups->cpu_power += group->cpu_power;
 		group = group->next;
 	} while (group != child->groups);
 }
@@ -9133,7 +9101,7 @@
 			sd = &per_cpu(node_domains, j).sd;
 			sd->groups = sg;
 		}
-		sg->__cpu_power = 0;
+		sg->cpu_power = 0;
 		cpumask_copy(sched_group_cpus(sg), nodemask);
 		sg->next = sg;
 		cpumask_or(covered, covered, nodemask);
@@ -9160,7 +9128,7 @@
 				"Can not alloc domain group for node %d\n", j);
 				goto error;
 			}
-			sg->__cpu_power = 0;
+			sg->cpu_power = 0;
 			cpumask_copy(sched_group_cpus(sg), tmpmask);
 			sg->next = prev->next;
 			cpumask_or(covered, covered, tmpmask);
Index: linux-2.6.31.4-rt14/include/linux/sched.h
===================================================================
--- linux-2.6.31.4-rt14.orig/include/linux/sched.h	2009-10-16 09:15:36.000000000 -0400
+++ linux-2.6.31.4-rt14/include/linux/sched.h	2009-10-16 09:15:38.000000000 -0400
@@ -905,15 +905,9 @@
 
 	/*
 	 * CPU power of this group, SCHED_LOAD_SCALE being max power for a
-	 * single CPU. This is read only (except for setup, hotplug CPU).
-	 * Note : Never change cpu_power without recompute its reciprocal
+	 * single CPU.
 	 */
-	unsigned int __cpu_power;
-	/*
-	 * reciprocal value of cpu_power to avoid expensive divides
-	 * (see include/linux/reciprocal_div.h)
-	 */
-	u32 reciprocal_cpu_power;
+	unsigned int cpu_power;
 
 	/*
 	 * The CPUs this group covers.

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 09/17] x86: move APERF/MPERF into a X86_FEATURE
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
                   ` (7 preceding siblings ...)
  2009-10-22 12:37 ` [patch -rt 08/17] sched: remove reciprocal for cpu_power dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:37 ` [patch -rt 10/17] x86: Add generic aperf/mperf code dino
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: sched-lb-8.patch --]
[-- Type: text/plain, Size: 2880 bytes --]

Move the APERFMPERF capacility into a X86_FEATURE flag so that it can
be used outside of the acpi cpufreq driver.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
 arch/x86/include/asm/cpufeature.h          |    1 +
 arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c |    9 ++-------
 arch/x86/kernel/cpu/intel.c                |    6 ++++++
 3 files changed, 9 insertions(+), 7 deletions(-)

Index: linux-2.6.31.4-rt14/arch/x86/include/asm/cpufeature.h
===================================================================
--- linux-2.6.31.4-rt14.orig/arch/x86/include/asm/cpufeature.h	2009-10-12 16:15:40.000000000 -0400
+++ linux-2.6.31.4-rt14/arch/x86/include/asm/cpufeature.h	2009-10-16 09:15:39.000000000 -0400
@@ -95,6 +95,7 @@
 #define X86_FEATURE_NONSTOP_TSC	(3*32+24) /* TSC does not stop in C states */
 #define X86_FEATURE_CLFLUSH_MONITOR (3*32+25) /* "" clflush reqd with monitor */
 #define X86_FEATURE_EXTD_APICID	(3*32+26) /* has extended APICID (8 bits) */
+#define X86_FEATURE_APERFMPERF	(3*32+27) /* APERFMPERF */
 
 /* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
 #define X86_FEATURE_XMM3	(4*32+ 0) /* "pni" SSE-3 */
Index: linux-2.6.31.4-rt14/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-2.6.31.4-rt14.orig/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c	2009-10-12 16:15:40.000000000 -0400
+++ linux-2.6.31.4-rt14/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c	2009-10-16 09:15:39.000000000 -0400
@@ -60,7 +60,6 @@
 };
 
 #define INTEL_MSR_RANGE		(0xffff)
-#define CPUID_6_ECX_APERFMPERF_CAPABILITY	(0x1)
 
 struct acpi_cpufreq_data {
 	struct acpi_processor_performance *acpi_data;
@@ -731,12 +730,8 @@
 	acpi_processor_notify_smm(THIS_MODULE);
 
 	/* Check for APERF/MPERF support in hardware */
-	if (c->x86_vendor == X86_VENDOR_INTEL && c->cpuid_level >= 6) {
-		unsigned int ecx;
-		ecx = cpuid_ecx(6);
-		if (ecx & CPUID_6_ECX_APERFMPERF_CAPABILITY)
-			acpi_cpufreq_driver.getavg = get_measured_perf;
-	}
+	if (cpu_has(c, X86_FEATURE_APERFMPERF))
+		acpi_cpufreq_driver.getavg = get_measured_perf;
 
 	dprintk("CPU%u - ACPI performance management activated.\n", cpu);
 	for (i = 0; i < perf->state_count; i++)
Index: linux-2.6.31.4-rt14/arch/x86/kernel/cpu/intel.c
===================================================================
--- linux-2.6.31.4-rt14.orig/arch/x86/kernel/cpu/intel.c	2009-10-12 16:15:40.000000000 -0400
+++ linux-2.6.31.4-rt14/arch/x86/kernel/cpu/intel.c	2009-10-16 09:15:39.000000000 -0400
@@ -349,6 +349,12 @@
 			set_cpu_cap(c, X86_FEATURE_ARCH_PERFMON);
 	}
 
+	if (c->cpuid_level > 6) {
+		unsigned ecx = cpuid_ecx(6);
+		if (ecx & 0x01)
+			set_cpu_cap(c, X86_FEATURE_APERFMPERF);
+	}
+
 	if (cpu_has_xmm2)
 		set_cpu_cap(c, X86_FEATURE_LFENCE_RDTSC);
 	if (cpu_has_ds) {

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 10/17] x86: Add generic aperf/mperf code
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
                   ` (8 preceding siblings ...)
  2009-10-22 12:37 ` [patch -rt 09/17] x86: move APERF/MPERF into a X86_FEATURE dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:37 ` [patch -rt 11/17] Provide an arch specific hook for cpufreq based scaling of cpu_power dino
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: sched-lb-9-new.patch --]
[-- Type: text/plain, Size: 4532 bytes --]

    
Move some of the aperf/mperf code out from the cpufreq driver
thingy so that other people can enjoy it too.
    

Index: linux-2.6.31.4-rt14-lb1/arch/x86/include/asm/processor.h
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/arch/x86/include/asm/processor.h	2009-10-21 10:47:17.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/arch/x86/include/asm/processor.h	2009-10-21 10:48:41.000000000 -0400
@@ -27,6 +27,7 @@
 #include <linux/cpumask.h>
 #include <linux/cache.h>
 #include <linux/threads.h>
+#include <linux/math64.h>
 #include <linux/init.h>
 
 /*
@@ -1010,4 +1011,33 @@
 extern int get_tsc_mode(unsigned long adr);
 extern int set_tsc_mode(unsigned int val);
 
+struct aperfmperf {
+	u64 aperf, mperf;
+};
+
+static inline void get_aperfmperf(struct aperfmperf *am)
+{
+	WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_APERFMPERF));
+
+	rdmsrl(MSR_IA32_APERF, am->aperf);
+	rdmsrl(MSR_IA32_MPERF, am->mperf);
+}
+
+#define APERFMPERF_SHIFT 10
+
+static inline
+unsigned long calc_aperfmperf_ratio(struct aperfmperf *old,
+				    struct aperfmperf *new)
+{
+	u64 aperf = new->aperf - old->aperf;
+	u64 mperf = new->mperf - old->mperf;
+	unsigned long ratio = aperf;
+
+	mperf >>= APERFMPERF_SHIFT;
+	if (mperf)
+		ratio = div64_u64(aperf, mperf);
+
+	return ratio;
+}
+
 #endif /* _ASM_X86_PROCESSOR_H */
Index: linux-2.6.31.4-rt14-lb1/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c	2009-10-21 10:47:17.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c	2009-10-21 10:48:41.000000000 -0400
@@ -70,11 +70,7 @@
 
 static DEFINE_PER_CPU(struct acpi_cpufreq_data *, drv_data);
 
-struct acpi_msr_data {
-	u64 saved_aperf, saved_mperf;
-};
-
-static DEFINE_PER_CPU(struct acpi_msr_data, msr_data);
+static DEFINE_PER_CPU(struct aperfmperf, old_perf);
 
 DEFINE_TRACE(power_mark);
 
@@ -243,23 +239,12 @@
 	return cmd.val;
 }
 
-struct perf_pair {
-	union {
-		struct {
-			u32 lo;
-			u32 hi;
-		} split;
-		u64 whole;
-	} aperf, mperf;
-};
-
 /* Called via smp_call_function_single(), on the target CPU */
 static void read_measured_perf_ctrs(void *_cur)
 {
-	struct perf_pair *cur = _cur;
+	struct aperfmperf *am = _cur;
 
-	rdmsr(MSR_IA32_APERF, cur->aperf.split.lo, cur->aperf.split.hi);
-	rdmsr(MSR_IA32_MPERF, cur->mperf.split.lo, cur->mperf.split.hi);
+	get_aperfmperf(am);
 }
 
 /*
@@ -278,63 +263,17 @@
 static unsigned int get_measured_perf(struct cpufreq_policy *policy,
 				      unsigned int cpu)
 {
-	struct perf_pair readin, cur;
-	unsigned int perf_percent;
+	struct aperfmperf perf;
+	unsigned long ratio;
 	unsigned int retval;
 
-	if (smp_call_function_single(cpu, read_measured_perf_ctrs, &readin, 1))
+	if (smp_call_function_single(cpu, read_measured_perf_ctrs, &perf, 1))
 		return 0;
 
-	cur.aperf.whole = readin.aperf.whole -
-				per_cpu(msr_data, cpu).saved_aperf;
-	cur.mperf.whole = readin.mperf.whole -
-				per_cpu(msr_data, cpu).saved_mperf;
-	per_cpu(msr_data, cpu).saved_aperf = readin.aperf.whole;
-	per_cpu(msr_data, cpu).saved_mperf = readin.mperf.whole;
-
-#ifdef __i386__
-	/*
-	 * We dont want to do 64 bit divide with 32 bit kernel
-	 * Get an approximate value. Return failure in case we cannot get
-	 * an approximate value.
-	 */
-	if (unlikely(cur.aperf.split.hi || cur.mperf.split.hi)) {
-		int shift_count;
-		u32 h;
-
-		h = max_t(u32, cur.aperf.split.hi, cur.mperf.split.hi);
-		shift_count = fls(h);
-
-		cur.aperf.whole >>= shift_count;
-		cur.mperf.whole >>= shift_count;
-	}
-
-	if (((unsigned long)(-1) / 100) < cur.aperf.split.lo) {
-		int shift_count = 7;
-		cur.aperf.split.lo >>= shift_count;
-		cur.mperf.split.lo >>= shift_count;
-	}
-
-	if (cur.aperf.split.lo && cur.mperf.split.lo)
-		perf_percent = (cur.aperf.split.lo * 100) / cur.mperf.split.lo;
-	else
-		perf_percent = 0;
-
-#else
-	if (unlikely(((unsigned long)(-1) / 100) < cur.aperf.whole)) {
-		int shift_count = 7;
-		cur.aperf.whole >>= shift_count;
-		cur.mperf.whole >>= shift_count;
-	}
-
-	if (cur.aperf.whole && cur.mperf.whole)
-		perf_percent = (cur.aperf.whole * 100) / cur.mperf.whole;
-	else
-		perf_percent = 0;
-
-#endif
+	ratio = calc_aperfmperf_ratio(&per_cpu(old_perf, cpu), &perf);
+	per_cpu(old_perf, cpu) = perf;
 
-	retval = (policy->cpuinfo.max_freq * perf_percent) / 100;
+	retval = (policy->cpuinfo.max_freq * ratio) >> APERFMPERF_SHIFT;
 
 	return retval;
 }

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 11/17] Provide an arch specific hook for cpufreq based scaling of cpu_power.
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
                   ` (9 preceding siblings ...)
  2009-10-22 12:37 ` [patch -rt 10/17] x86: Add generic aperf/mperf code dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:37 ` [patch -rt 12/17] x86: sched: provide arch implementations using aperf/mperf dino
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: sched-lb-10.patch --]
[-- Type: text/plain, Size: 1637 bytes --]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
 kernel/sched.c |   21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

Index: linux-2.6.31.4-rt14-lb1/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/kernel/sched.c	2009-10-21 10:47:15.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/kernel/sched.c	2009-10-21 10:48:58.000000000 -0400
@@ -3793,7 +3793,18 @@
 }
 #endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
 
-unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu)
+
+unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
+{
+	return SCHED_LOAD_SCALE;
+}
+
+unsigned long __weak arch_scale_freq_power(struct sched_domain *sd, int cpu)
+{
+	return default_scale_freq_power(sd, cpu);
+}
+
+unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu)
 {
 	unsigned long weight = cpumask_weight(sched_domain_span(sd));
 	unsigned long smt_gain = sd->smt_gain;
@@ -3803,6 +3814,11 @@
 	return smt_gain;
 }
 
+unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu)
+{
+	return default_scale_smt_power(sd, cpu);
+}
+
 unsigned long scale_rt_power(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -3827,7 +3843,8 @@
 	unsigned long power = SCHED_LOAD_SCALE;
 	struct sched_group *sdg = sd->groups;
 
-	/* here we could scale based on cpufreq */
+	power *= arch_scale_freq_power(sd, cpu);
+	power >> SCHED_LOAD_SHIFT;
 
 	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
 		power *= arch_scale_smt_power(sd, cpu);

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 12/17] x86: sched: provide arch implementations using aperf/mperf
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
                   ` (10 preceding siblings ...)
  2009-10-22 12:37 ` [patch -rt 11/17] Provide an arch specific hook for cpufreq based scaling of cpu_power dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:37 ` [patch -rt 13/17] sched: cleanup wake_idle power saving dino
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: sched-lb-11.patch --]
[-- Type: text/plain, Size: 3637 bytes --]

APERF/MPERF support for cpu_power.

APERF/MPERF is arch defined to be a relative scale of work capacity
per logical cpu, this is assumed to include SMT and Turbo mode.

APERF/MPERF are specified to both reset to 0 when either counter
wraps, which is highly inconvenient, since that'll give a blimp when
that happens. The manual specifies writing 0 to the counters after
each read, but that's 1) too expensive, and 2) destroys the
possibility of sharing these counters with other users, so we live
with the blimp - the other existing user does too.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
 arch/x86/kernel/cpu/Makefile |    2 -
 arch/x86/kernel/cpu/sched.c  |   58 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h        |    4 ++
 3 files changed, 63 insertions(+), 1 deletion(-)

Index: linux-2.6.31.4-rt14-lb1/arch/x86/kernel/cpu/Makefile
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/arch/x86/kernel/cpu/Makefile	2009-10-21 10:47:15.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/arch/x86/kernel/cpu/Makefile	2009-10-21 10:49:00.000000000 -0400
@@ -13,7 +13,7 @@
 
 obj-y			:= intel_cacheinfo.o addon_cpuid_features.o
 obj-y			+= proc.o capflags.o powerflags.o common.o
-obj-y			+= vmware.o hypervisor.o
+obj-y			+= vmware.o hypervisor.o sched.o
 
 obj-$(CONFIG_X86_32)	+= bugs.o cmpxchg.o
 obj-$(CONFIG_X86_64)	+= bugs_64.o
Index: linux-2.6.31.4-rt14-lb1/arch/x86/kernel/cpu/sched.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.31.4-rt14-lb1/arch/x86/kernel/cpu/sched.c	2009-10-21 10:49:00.000000000 -0400
@@ -0,0 +1,58 @@
+#include <linux/sched.h>
+#include <linux/math64.h>
+#include <linux/percpu.h>
+#include <linux/irqflags.h>
+
+#include <asm/cpufeature.h>
+#include <asm/processor.h>
+
+static DEFINE_PER_CPU(struct aperfmperf, old_aperfmperf);
+
+static unsigned long scale_aperfmperf(void)
+{
+	struct aperfmperf cur, val, *old = &__get_cpu_var(old_aperfmperf);
+	unsigned long ratio = SCHED_LOAD_SCALE;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	get_aperfmperf(&val);
+	local_irq_restore(flags);
+
+	cur = val;
+	cur.aperf -= old->aperf;
+	cur.mperf -= old->mperf;
+	*old = val;
+
+	cur.mperf >>= SCHED_LOAD_SHIFT;
+	if (cur.mperf)
+		ratio = div_u64(cur.aperf, cur.mperf);
+
+	return ratio;
+}
+
+unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu)
+{
+	/*
+	 * do aperf/mperf on the cpu level because it includes things
+	 * like turbo mode, which are relevant to full cores.
+	 */
+	if (boot_cpu_has(X86_FEATURE_APERFMPERF))
+		return scale_aperfmperf();
+
+	/*
+	 * maybe have something cpufreq here
+	 */
+
+	return default_scale_freq_power(sd, cpu);
+}
+
+unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu)
+{
+	/*
+	 * aperf/mperf already includes the smt gain
+	 */
+	if (boot_cpu_has(X86_FEATURE_APERFMPERF))
+		return SCHED_LOAD_SCALE;
+
+	return default_scale_smt_power(sd, cpu);
+}
Index: linux-2.6.31.4-rt14-lb1/include/linux/sched.h
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/include/linux/sched.h	2009-10-21 10:47:15.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/include/linux/sched.h	2009-10-21 10:49:00.000000000 -0400
@@ -1047,6 +1047,10 @@
 }
 #endif	/* !CONFIG_SMP */
 
+
+unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu);
+unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu);
+
 struct io_context;			/* See blkdev.h */
 
 

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 13/17] sched: cleanup wake_idle power saving
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
                   ` (11 preceding siblings ...)
  2009-10-22 12:37 ` [patch -rt 12/17] x86: sched: provide arch implementations using aperf/mperf dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:37 ` [patch -rt 14/17] sched: cleanup wake_idle dino
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: sched-lb-12.patch --]
[-- Type: text/plain, Size: 3094 bytes --]

Hopefully a more readable version of the same.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
 kernel/sched_fair.c |   58 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 39 insertions(+), 19 deletions(-)

Index: linux-2.6.31.4-rt14-lb1/kernel/sched_fair.c
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/kernel/sched_fair.c	2009-10-21 10:47:14.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/kernel/sched_fair.c	2009-10-21 10:49:01.000000000 -0400
@@ -1040,6 +1040,41 @@
 	se->vruntime = rightmost->vruntime + 1;
 }
 
+#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
+/*
+ * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
+ * are idle and this is not a kernel thread and this task's affinity
+ * allows it to be moved to preferred cpu, then just move!
+ *
+ * XXX - can generate significant overload on perferred_wakeup_cpu
+ *       with plenty of idle cpus, leading to a significant loss in
+ *       throughput.
+ *
+ * Returns: <  0 - no placement decision made
+ *          >= 0 - place on cpu
+ */
+static int wake_idle_power_save(int cpu, struct task_struct *p)
+{
+	int this_cpu = smp_processor_id();
+	int wakeup_cpu;
+
+	if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+		return -1;
+
+	if (!idle_cpu(cpu) || !idle_cpu(this_cpu))
+		return -1;
+
+	if (!p->mm || (p->flags & PF_KTHREAD))
+		return -1;
+
+	wakeup_cpu = cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
+
+	if (!cpu_isset(wakeup_cpu, p->cpus_allowed))
+		return -1;
+
+	return wakeup_cpu;
+}
+
 /*
  * wake_idle() will wake a task on an idle cpu if task->cpu is
  * not idle and an idle cpu is available.  The span of cpus to
@@ -1050,29 +1085,14 @@
  *
  * Returns the CPU we should wake onto.
  */
-#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
 static int wake_idle(int cpu, struct task_struct *p)
 {
 	struct sched_domain *sd;
 	int i;
-	unsigned int chosen_wakeup_cpu;
-	int this_cpu;
-
-	/*
-	 * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
-	 * are idle and this is not a kernel thread and this task's affinity
-	 * allows it to be moved to preferred cpu, then just move!
-	 */
-
-	this_cpu = smp_processor_id();
-	chosen_wakeup_cpu =
-		cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
 
-	if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
-		idle_cpu(cpu) && idle_cpu(this_cpu) &&
-		p->mm && !(p->flags & PF_KTHREAD) &&
-		cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))
-		return chosen_wakeup_cpu;
+	i = wake_idle_power_save(cpu, p);
+	if (i >= 0)
+		return i;
 
 	/*
 	 * If it is idle, then it is the best cpu to run this task.
@@ -1081,7 +1101,7 @@
 	 * Siblings must be also busy(in most cases) as they didn't already
 	 * pickup the extra load from this cpu and hence we need not check
 	 * sibling runqueue info. This will avoid the checks and cache miss
-	 * penalities associated with that.
+	 * penalties associated with that.
 	 */
 	if (idle_cpu(cpu) || cpu_rq(cpu)->cfs.nr_running > 1)
 		return cpu;

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 14/17] sched: cleanup wake_idle
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
                   ` (12 preceding siblings ...)
  2009-10-22 12:37 ` [patch -rt 13/17] sched: cleanup wake_idle power saving dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:37 ` [patch -rt 15/17] sched: Add a missing = dino
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: sched-lb-13.patch --]
[-- Type: text/plain, Size: 2490 bytes --]

A more readable version, with a few differences:

 - don't check against the root domain, but instead check
   SD_LOAD_BALANCE

 - don't re-iterate the cpus already iterated on the previous SD

 - use rcu_read_lock() around the sd iteration
 
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
 kernel/sched_fair.c |   45 +++++++++++++++++++++++++--------------------
 1 file changed, 25 insertions(+), 20 deletions(-)

Index: linux-2.6.31.4-rt14-lb1/kernel/sched_fair.c
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/kernel/sched_fair.c	2009-10-21 10:49:01.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/kernel/sched_fair.c	2009-10-21 10:49:02.000000000 -0400
@@ -1080,14 +1080,13 @@
  * not idle and an idle cpu is available.  The span of cpus to
  * search starts with cpus closest then further out as needed,
  * so we always favor a closer, idle cpu.
- * Domains may include CPUs that are not usable for migration,
- * hence we need to mask them out (cpu_active_mask)
  *
  * Returns the CPU we should wake onto.
  */
 static int wake_idle(int cpu, struct task_struct *p)
 {
-	struct sched_domain *sd;
+	struct rq *task_rq = task_rq(p);
+	struct sched_domain *sd, *child = NULL;
 	int i;
 
 	i = wake_idle_power_save(cpu, p);
@@ -1106,24 +1105,34 @@
 	if (idle_cpu(cpu) || cpu_rq(cpu)->cfs.nr_running > 1)
 		return cpu;
 
-	for_each_domain(cpu, sd) {
-		if ((sd->flags & SD_WAKE_IDLE)
-		    || ((sd->flags & SD_WAKE_IDLE_FAR)
-			&& !task_hot(p, task_rq(p)->clock, sd))) {
-			for_each_cpu_and(i, sched_domain_span(sd),
-					 &p->cpus_allowed) {
-				if (cpu_active(i) && idle_cpu(i)) {
-					if (i != task_cpu(p)) {
-						schedstat_inc(p,
-						       se.nr_wakeups_idle);
-					}
-					return i;
-				}
-			}
-		} else {
+	rcu_read_lock();
+ 	for_each_domain(cpu, sd) {
+		if (!(sd->flags & SD_LOAD_BALANCE))
+ 			break;
+
+		if (!(sd->flags & SD_WAKE_IDLE) &&
+		    (task_hot(p, task_rq->clock, sd) || !(sd->flags & SD_WAKE_IDLE_FAR)))
 			break;
-		}
-	}
+
+		for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
+			if (child && cpumask_test_cpu(i, sched_domain_span(child)))
+				continue;
+
+			if (!idle_cpu(i))
+				continue;
+
+			if (task_cpu(p) != i)
+				schedstat_inc(p, se.nr_wakeups_idle);
+
+			cpu = i;
+			goto unlock;
+ 		}
+
+		child = sd;
+ 	}
+unlock:
+	rcu_read_unlock();
+
 	return cpu;
 }
 #else /* !ARCH_HAS_SCHED_WAKE_IDLE*/

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 15/17] sched: Add a missing =
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
                   ` (13 preceding siblings ...)
  2009-10-22 12:37 ` [patch -rt 14/17] sched: cleanup wake_idle dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:37 ` [patch -rt 16/17] sched: Deal with low-load in wake_affine() dino
  2009-10-22 12:38 ` [patch -rt 17/17] sched: Fix dynamic power-balancing crash dino
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: sched_fixes.patch --]
[-- Type: text/plain, Size: 601 bytes --]

Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>

Index: linux-2.6.31.4-rt14-lb1/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/kernel/sched.c	2009-10-21 10:48:58.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/kernel/sched.c	2009-10-21 10:49:03.000000000 -0400
@@ -3844,7 +3844,7 @@
 	struct sched_group *sdg = sd->groups;
 
 	power *= arch_scale_freq_power(sd, cpu);
-	power >> SCHED_LOAD_SHIFT;
+	power >>= SCHED_LOAD_SHIFT;
 
 	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
 		power *= arch_scale_smt_power(sd, cpu);

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 16/17] sched: Deal with low-load in wake_affine()
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
                   ` (14 preceding siblings ...)
  2009-10-22 12:37 ` [patch -rt 15/17] sched: Add a missing = dino
@ 2009-10-22 12:37 ` dino
  2009-10-22 12:38 ` [patch -rt 17/17] sched: Fix dynamic power-balancing crash dino
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: wake_affine_low_load.patch --]
[-- Type: text/plain, Size: 1403 bytes --]

    wake_affine() would always fail under low-load situations where
    both prev and this were idle, because adding a single task will
    always be a significant imbalance, even if there's nothing
    around that could balance it.
    
    Deal with this by allowing imbalance when there's nothing you
    can do about it.
    
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>

Index: linux-2.6.31.4-rt14-lb1/kernel/sched_fair.c
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/kernel/sched_fair.c	2009-10-21 10:49:02.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/kernel/sched_fair.c	2009-10-21 10:49:04.000000000 -0400
@@ -1264,7 +1264,17 @@
 	tg = task_group(p);
 	weight = p->se.load.weight;
 
-	balanced = 100*(tl + effective_load(tg, this_cpu, weight, weight)) <=
+	/*
+	 * In low-load situations, where prev_cpu is idle and this_cpu is idle
+	 * due to the sync cause above having dropped tl to 0, we'll always have
+	 * an imbalance, but there's really nothing you can do about that, so
+	 * that's good too.
+	 *
+	 * Otherwise check if either cpus are near enough in load to allow this
+	 * task to be woken on this_cpu.
+	 */
+	balanced = !tl ||
+		100*(tl + effective_load(tg, this_cpu, weight, weight)) <=
 		imbalance*(load + effective_load(tg, prev_cpu, 0, weight));
 
 	/*

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch -rt 17/17] sched: Fix dynamic power-balancing crash
  2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
                   ` (15 preceding siblings ...)
  2009-10-22 12:37 ` [patch -rt 16/17] sched: Deal with low-load in wake_affine() dino
@ 2009-10-22 12:38 ` dino
  16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:38 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
	John Kacur

[-- Attachment #1: fix_power_bal_crash.patch --]
[-- Type: text/plain, Size: 1520 bytes --]

    
    This crash:
    
    [ 1774.088275] divide error: 0000 [#1] SMP
    [ 1774.100355] CPU 13
    [ 1774.102498] Modules linked in:
    [ 1774.105631] Pid: 30881, comm: hackbench Not tainted 2.6.31-rc8-tip-01308-g484d664-dirty #1629 X8DTN
    [ 1774.114807] RIP: 0010:[<ffffffff81041c38>]  [<ffffffff81041c38>]
    sched_balance_self+0x19b/0x2d4
    
    Triggers because update_group_power() modifies the sd tree and does
    temporary calculations there - not considering that other CPUs
    could observe intermediate values, such as the zero initial value.
    
    Calculate it in a temporary variable instead. (we need no memory
    barrier as these are all statistical values anyway)
    
Got the same oops with the backport to -rt
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>

Index: linux-2.6.31.4-rt14-lb1/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/kernel/sched.c	2009-10-21 10:49:03.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/kernel/sched.c	2009-10-22 01:48:41.000000000 -0400
@@ -3864,19 +3864,22 @@
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
+	unsigned long power;
 
 	if (!child) {
 		update_cpu_power(sd, cpu);
 		return;
 	}
 
-	sdg->cpu_power = 0;
+	power = 0;
 
 	group = child->groups;
 	do {
-		sdg->cpu_power += group->cpu_power;
+		power += group->cpu_power;
 		group = group->next;
 	} while (group != child->groups);
+
+	sdg->cpu_power = power;
 }
 
 /**

--

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2009-10-22 12:44 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
2009-10-22 12:37 ` [patch -rt 01/17] sched: restore __cpu_power to a straight sum of power dino
2009-10-22 12:37 ` [patch -rt 02/17] sched: SD_PREFER_SIBLING dino
2009-10-22 12:37 ` [patch -rt 03/17] sched: update the cpu_power sum during load-balance dino
2009-10-22 12:37 ` [patch -rt 04/17] sched: add smt_gain dino
2009-10-22 12:37 ` [patch -rt 05/17] sched: dynamic cpu_power dino
2009-10-22 12:37 ` [patch -rt 06/17] sched: scale down cpu_power due to RT tasks dino
2009-10-22 12:37 ` [patch -rt 07/17] sched: try to deal with low capacity dino
2009-10-22 12:37 ` [patch -rt 08/17] sched: remove reciprocal for cpu_power dino
2009-10-22 12:37 ` [patch -rt 09/17] x86: move APERF/MPERF into a X86_FEATURE dino
2009-10-22 12:37 ` [patch -rt 10/17] x86: Add generic aperf/mperf code dino
2009-10-22 12:37 ` [patch -rt 11/17] Provide an arch specific hook for cpufreq based scaling of cpu_power dino
2009-10-22 12:37 ` [patch -rt 12/17] x86: sched: provide arch implementations using aperf/mperf dino
2009-10-22 12:37 ` [patch -rt 13/17] sched: cleanup wake_idle power saving dino
2009-10-22 12:37 ` [patch -rt 14/17] sched: cleanup wake_idle dino
2009-10-22 12:37 ` [patch -rt 15/17] sched: Add a missing = dino
2009-10-22 12:37 ` [patch -rt 16/17] sched: Deal with low-load in wake_affine() dino
2009-10-22 12:38 ` [patch -rt 17/17] sched: Fix dynamic power-balancing crash dino

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox