* [patch -rt 01/17] sched: restore __cpu_power to a straight sum of power
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:37 ` [patch -rt 02/17] sched: SD_PREFER_SIBLING dino
` (15 subsequent siblings)
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: sched-lb-1.patch --]
[-- Type: text/plain, Size: 2390 bytes --]
cpu_power is supposed to be a representation of the process capacity
of the cpu, not a value to randomly tweak in order to affect
placement.
Remove the placement hacks.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
include/linux/sched.h | 1 +
include/linux/topology.h | 1 +
kernel/sched.c | 34 ++++++++++++++++++----------------
3 files changed, 20 insertions(+), 16 deletions(-)
Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c 2009-10-16 08:56:17.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c 2009-10-16 09:15:18.000000000 -0400
@@ -8670,15 +8670,13 @@
* there are asymmetries in the topology. If there are asymmetries, group
* having more cpu_power will pickup more load compared to the group having
* less cpu_power.
- *
- * cpu_power will be a multiple of SCHED_LOAD_SCALE. This multiple represents
- * the maximum number of tasks a group can handle in the presence of other idle
- * or lightly loaded groups in the same sched domain.
*/
static void init_sched_groups_power(int cpu, struct sched_domain *sd)
{
struct sched_domain *child;
struct sched_group *group;
+ long power;
+ int weight;
WARN_ON(!sd || !sd->groups);
@@ -8689,22 +8687,20 @@
sd->groups->__cpu_power = 0;
- /*
- * For perf policy, if the groups in child domain share resources
- * (for example cores sharing some portions of the cache hierarchy
- * or SMT), then set this domain groups cpu_power such that each group
- * can handle only one task, when there are other idle groups in the
- * same sched domain.
- */
- if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
- (child->flags &
- (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
- sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
+ if (!child) {
+ power = SCHED_LOAD_SCALE;
+ weight = cpumask_weight(sched_domain_span(sd));
+ /*
+ * SMT siblings share the power of a single core.
+ */
+ if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1)
+ power /= weight;
+ sg_inc_cpu_power(sd->groups, power);
return;
}
/*
- * add cpu_power of each child group to this groups cpu_power
+ * Add cpu_power of each child group to this groups cpu_power.
*/
group = child->groups;
do {
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 02/17] sched: SD_PREFER_SIBLING
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
2009-10-22 12:37 ` [patch -rt 01/17] sched: restore __cpu_power to a straight sum of power dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:37 ` [patch -rt 03/17] sched: update the cpu_power sum during load-balance dino
` (14 subsequent siblings)
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: sched-lb-2.patch --]
[-- Type: text/plain, Size: 4002 bytes --]
Do the placement thing using SD flags
XXX: consider degenerate bits
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
include/linux/sched.h | 29 +++++++++++++++--------------
kernel/sched.c | 14 +++++++++++++-
2 files changed, 28 insertions(+), 15 deletions(-)
Index: linux-2.6.31.4-rt14/include/linux/sched.h
===================================================================
--- linux-2.6.31.4-rt14.orig/include/linux/sched.h 2009-10-16 09:15:18.000000000 -0400
+++ linux-2.6.31.4-rt14/include/linux/sched.h 2009-10-16 09:15:30.000000000 -0400
@@ -843,18 +843,19 @@
#define SCHED_LOAD_SCALE_FUZZ SCHED_LOAD_SCALE
#ifdef CONFIG_SMP
-#define SD_LOAD_BALANCE 1 /* Do load balancing on this domain. */
-#define SD_BALANCE_NEWIDLE 2 /* Balance when about to become idle */
-#define SD_BALANCE_EXEC 4 /* Balance on exec */
-#define SD_BALANCE_FORK 8 /* Balance on fork, clone */
-#define SD_WAKE_IDLE 16 /* Wake to idle CPU on task wakeup */
-#define SD_WAKE_AFFINE 32 /* Wake task to waking CPU */
-#define SD_WAKE_BALANCE 64 /* Perform balancing at task wakeup */
-#define SD_SHARE_CPUPOWER 128 /* Domain members share cpu power */
-#define SD_POWERSAVINGS_BALANCE 256 /* Balance for power savings */
-#define SD_SHARE_PKG_RESOURCES 512 /* Domain members share cpu pkg resources */
-#define SD_SERIALIZE 1024 /* Only a single load balancing instance */
-#define SD_WAKE_IDLE_FAR 2048 /* Gain latency sacrificing cache hit */
+#define SD_LOAD_BALANCE 0x0001 /* Do load balancing on this domain. */
+#define SD_BALANCE_NEWIDLE 0x0002 /* Balance when about to become idle */
+#define SD_BALANCE_EXEC 0x0004 /* Balance on exec */
+#define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */
+#define SD_WAKE_IDLE 0x0010 /* Wake to idle CPU on task wakeup */
+#define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
+#define SD_WAKE_BALANCE 0x0040 /* Perform balancing at task wakeup */
+#define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */
+#define SD_POWERSAVINGS_BALANCE 0x0100 /* Balance for power savings */
+#define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */
+#define SD_SERIALIZE 0x0400 /* Only a single load balancing instance */
+#define SD_WAKE_IDLE_FAR 0x0800 /* Gain latency sacrificing cache hit */
+#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
enum powersavings_balance_level {
POWERSAVINGS_BALANCE_NONE = 0, /* No power saving load balance */
@@ -874,7 +875,7 @@
if (sched_smt_power_savings)
return SD_POWERSAVINGS_BALANCE;
- return 0;
+ return SD_PREFER_SIBLING;
}
static inline int sd_balance_for_package_power(void)
@@ -882,7 +883,7 @@
if (sched_mc_power_savings | sched_smt_power_savings)
return SD_POWERSAVINGS_BALANCE;
- return 0;
+ return SD_PREFER_SIBLING;
}
/*
Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c 2009-10-16 09:15:18.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c 2009-10-16 09:15:30.000000000 -0400
@@ -3892,9 +3892,13 @@
const struct cpumask *cpus, int *balance,
struct sd_lb_stats *sds)
{
+ struct sched_domain *child = sd->child;
struct sched_group *group = sd->groups;
struct sg_lb_stats sgs;
- int load_idx;
+ int load_idx, prefer_sibling = 0;
+
+ if (child && child->flags & SD_PREFER_SIBLING)
+ prefer_sibling = 1;
init_sd_power_savings_stats(sd, sds, idle);
load_idx = get_sd_load_idx(sd, idle);
@@ -3914,6 +3918,14 @@
sds->total_load += sgs.group_load;
sds->total_pwr += group->__cpu_power;
+ /*
+ * In case the child domain prefers tasks go to siblings
+ * first, lower the group capacity to one so that we'll try
+ * and move all the excess tasks away.
+ */
+ if (prefer_sibling)
+ sgs.group_capacity = 1;
+
if (local_group) {
sds->this_load = sgs.avg_load;
sds->this = group;
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 03/17] sched: update the cpu_power sum during load-balance
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
2009-10-22 12:37 ` [patch -rt 01/17] sched: restore __cpu_power to a straight sum of power dino
2009-10-22 12:37 ` [patch -rt 02/17] sched: SD_PREFER_SIBLING dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:37 ` [patch -rt 04/17] sched: add smt_gain dino
` (13 subsequent siblings)
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: sched-lb-3.patch --]
[-- Type: text/plain, Size: 2625 bytes --]
In order to prepare for a more dynamic cpu_power, update the group sum
while walking the sched domains during load-balance.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
kernel/sched.c | 33 +++++++++++++++++++++++++++++----
1 file changed, 29 insertions(+), 4 deletions(-)
Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c 2009-10-16 09:15:30.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c 2009-10-16 09:15:32.000000000 -0400
@@ -3780,6 +3780,28 @@
}
#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
+static void update_sched_power(struct sched_domain *sd)
+{
+ struct sched_domain *child = sd->child;
+ struct sched_group *group, *sdg = sd->groups;
+ unsigned long power = sdg->__cpu_power;
+
+ if (!child) {
+ /* compute cpu power for this cpu */
+ return;
+ }
+
+ sdg->__cpu_power = 0;
+
+ group = child->groups;
+ do {
+ sdg->__cpu_power += group->__cpu_power;
+ group = group->next;
+ } while (group != child->groups);
+
+ if (power != sdg->__cpu_power)
+ sdg->reciprocal_cpu_power = reciprocal_value(sdg->__cpu_power);
+}
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
@@ -3793,7 +3815,8 @@
* @balance: Should we balance.
* @sgs: variable to hold the statistics for this group.
*/
-static inline void update_sg_lb_stats(struct sched_group *group, int this_cpu,
+static inline void update_sg_lb_stats(struct sched_domain *sd,
+ struct sched_group *group, int this_cpu,
enum cpu_idle_type idle, int load_idx, int *sd_idle,
int local_group, const struct cpumask *cpus,
int *balance, struct sg_lb_stats *sgs)
@@ -3804,8 +3827,11 @@
unsigned long sum_avg_load_per_task;
unsigned long avg_load_per_task;
- if (local_group)
+ if (local_group) {
balance_cpu = group_first_cpu(group);
+ if (balance_cpu == this_cpu)
+ update_sched_power(sd);
+ }
/* Tally up the load of all CPUs in the group */
sum_avg_load_per_task = avg_load_per_task = 0;
@@ -3909,7 +3935,7 @@
local_group = cpumask_test_cpu(this_cpu,
sched_group_cpus(group));
memset(&sgs, 0, sizeof(sgs));
- update_sg_lb_stats(group, this_cpu, idle, load_idx, sd_idle,
+ update_sg_lb_stats(sd, group, this_cpu, idle, load_idx, sd_idle,
local_group, cpus, balance, &sgs);
if (local_group && balance && !(*balance))
@@ -3944,7 +3970,6 @@
update_sd_power_savings_stats(group, sds, local_group, &sgs);
group = group->next;
} while (group != sd->groups);
-
}
/**
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 04/17] sched: add smt_gain
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
` (2 preceding siblings ...)
2009-10-22 12:37 ` [patch -rt 03/17] sched: update the cpu_power sum during load-balance dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:37 ` [patch -rt 05/17] sched: dynamic cpu_power dino
` (12 subsequent siblings)
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: sched-lb-1b.patch --]
[-- Type: text/plain, Size: 2175 bytes --]
The idea is that multi-threading a core yields more work capacity than
a single thread, provide a way to express a static gain for threads.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
include/linux/sched.h | 1 +
include/linux/topology.h | 1 +
kernel/sched.c | 8 +++++++-
3 files changed, 9 insertions(+), 1 deletion(-)
Index: linux-2.6.31.4-rt14/include/linux/sched.h
===================================================================
--- linux-2.6.31.4-rt14.orig/include/linux/sched.h 2009-10-16 09:15:30.000000000 -0400
+++ linux-2.6.31.4-rt14/include/linux/sched.h 2009-10-16 09:15:34.000000000 -0400
@@ -966,6 +966,7 @@
unsigned int newidle_idx;
unsigned int wake_idx;
unsigned int forkexec_idx;
+ unsigned int smt_gain;
int flags; /* See SD_* */
enum sched_domain_level level;
Index: linux-2.6.31.4-rt14/include/linux/topology.h
===================================================================
--- linux-2.6.31.4-rt14.orig/include/linux/topology.h 2009-10-16 09:15:16.000000000 -0400
+++ linux-2.6.31.4-rt14/include/linux/topology.h 2009-10-16 09:15:34.000000000 -0400
@@ -99,6 +99,7 @@
| SD_SHARE_CPUPOWER, \
.last_balance = jiffies, \
.balance_interval = 1, \
+ .smt_gain = 1178, /* 15% */ \
}
#endif
#endif /* CONFIG_SCHED_SMT */
Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c 2009-10-16 09:15:32.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c 2009-10-16 09:15:34.000000000 -0400
@@ -8729,9 +8729,15 @@
weight = cpumask_weight(sched_domain_span(sd));
/*
* SMT siblings share the power of a single core.
+ * Usually multiple threads get a better yield out of
+ * that one core than a single thread would have,
+ * reflect that in sd->smt_gain.
*/
- if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1)
+ if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
+ power *= sd->smt_gain;
power /= weight;
+ power >>= SCHED_LOAD_SHIFT;
+ }
sg_inc_cpu_power(sd->groups, power);
return;
}
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 05/17] sched: dynamic cpu_power
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
` (3 preceding siblings ...)
2009-10-22 12:37 ` [patch -rt 04/17] sched: add smt_gain dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:37 ` [patch -rt 06/17] sched: scale down cpu_power due to RT tasks dino
` (11 subsequent siblings)
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: sched-lb-4.patch --]
[-- Type: text/plain, Size: 2053 bytes --]
Recompute the cpu_power for each cpu during load-balance
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
kernel/sched.c | 38 +++++++++++++++++++++++++++++++++++---
1 file changed, 35 insertions(+), 3 deletions(-)
Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c 2009-10-16 09:15:34.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c 2009-10-16 09:15:35.000000000 -0400
@@ -3780,14 +3780,46 @@
}
#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
-static void update_sched_power(struct sched_domain *sd)
+unsigned long __weak arch_smt_gain(struct sched_domain *sd, int cpu)
+{
+ unsigned long weight = cpumask_weight(sched_domain_span(sd));
+ unsigned long smt_gain = sd->smt_gain;
+
+ smt_gain /= weight;
+
+ return smt_gain;
+}
+
+static void update_cpu_power(struct sched_domain *sd, int cpu)
+{
+ unsigned long weight = cpumask_weight(sched_domain_span(sd));
+ unsigned long power = SCHED_LOAD_SCALE;
+ struct sched_group *sdg = sd->groups;
+ unsigned long old = sdg->__cpu_power;
+
+ /* here we could scale based on cpufreq */
+
+ if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
+ power *= arch_smt_gain(sd, cpu);
+ power >>= SCHED_LOAD_SHIFT;
+ }
+
+ /* here we could scale based on RT time */
+
+ if (power != old) {
+ sdg->__cpu_power = power;
+ sdg->reciprocal_cpu_power = reciprocal_value(power);
+ }
+}
+
+static void update_group_power(struct sched_domain *sd, int cpu)
{
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
unsigned long power = sdg->__cpu_power;
if (!child) {
- /* compute cpu power for this cpu */
+ update_cpu_power(sd, cpu);
return;
}
@@ -3830,7 +3862,7 @@
if (local_group) {
balance_cpu = group_first_cpu(group);
if (balance_cpu == this_cpu)
- update_sched_power(sd);
+ update_group_power(sd, this_cpu);
}
/* Tally up the load of all CPUs in the group */
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 06/17] sched: scale down cpu_power due to RT tasks
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
` (4 preceding siblings ...)
2009-10-22 12:37 ` [patch -rt 05/17] sched: dynamic cpu_power dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:37 ` [patch -rt 07/17] sched: try to deal with low capacity dino
` (10 subsequent siblings)
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: sched-lb-5.patch --]
[-- Type: text/plain, Size: 5365 bytes --]
Keep an average on the amount of time spend on RT tasks and use that
fraction to scale down the cpu_power for regular tasks.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
include/linux/sched.h | 1
kernel/sched.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched_rt.c | 6 +---
kernel/sysctl.c | 8 ++++++
4 files changed, 72 insertions(+), 7 deletions(-)
Index: linux-2.6.31.4-rt14/include/linux/sched.h
===================================================================
--- linux-2.6.31.4-rt14.orig/include/linux/sched.h 2009-10-16 09:15:34.000000000 -0400
+++ linux-2.6.31.4-rt14/include/linux/sched.h 2009-10-16 09:15:36.000000000 -0400
@@ -1915,6 +1915,7 @@
extern unsigned int sysctl_sched_features;
extern unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_nr_migrate;
+extern unsigned int sysctl_sched_time_avg;
extern unsigned int sysctl_timer_migration;
int sched_nr_latency_handler(struct ctl_table *table, int write,
Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c 2009-10-16 09:15:35.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c 2009-10-16 09:15:36.000000000 -0400
@@ -673,6 +673,9 @@
struct task_struct *migration_thread;
struct list_head migration_queue;
+
+ u64 rt_avg;
+ u64 age_stamp;
#endif
/* calc_load related fields */
@@ -927,6 +930,14 @@
unsigned int sysctl_sched_shares_thresh = 4;
/*
+ * period over which we average the RT time consumption, measured
+ * in ms.
+ *
+ * default: 1s
+ */
+const_debug unsigned int sysctl_sched_time_avg = MSEC_PER_SEC;
+
+/*
* period over which we measure -rt task cpu usage in us.
* default: 1s
*/
@@ -1370,12 +1381,37 @@
}
#endif /* CONFIG_NO_HZ */
+static u64 sched_avg_period(void)
+{
+ return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
+}
+
+static void sched_avg_update(struct rq *rq)
+{
+ s64 period = sched_avg_period();
+
+ while ((s64)(rq->clock - rq->age_stamp) > period) {
+ rq->age_stamp += period;
+ rq->rt_avg /= 2;
+ }
+}
+
+static void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
+{
+ rq->rt_avg += rt_delta;
+ sched_avg_update(rq);
+}
+
#else /* !CONFIG_SMP */
static void resched_task(struct task_struct *p)
{
assert_atomic_spin_locked(&task_rq(p)->lock);
set_tsk_need_resched(p);
}
+
+static void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
+{
+}
#endif /* CONFIG_SMP */
#if BITS_PER_LONG == 32
@@ -3780,7 +3816,7 @@
}
#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
-unsigned long __weak arch_smt_gain(struct sched_domain *sd, int cpu)
+unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu)
{
unsigned long weight = cpumask_weight(sched_domain_span(sd));
unsigned long smt_gain = sd->smt_gain;
@@ -3790,6 +3826,24 @@
return smt_gain;
}
+unsigned long scale_rt_power(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+ u64 total, available;
+
+ sched_avg_update(rq);
+
+ total = sched_avg_period() + (rq->clock - rq->age_stamp);
+ available = total - rq->rt_avg;
+
+ if (unlikely((s64)total < SCHED_LOAD_SCALE))
+ total = SCHED_LOAD_SCALE;
+
+ total >>= SCHED_LOAD_SHIFT;
+
+ return div_u64(available, total);
+}
+
static void update_cpu_power(struct sched_domain *sd, int cpu)
{
unsigned long weight = cpumask_weight(sched_domain_span(sd));
@@ -3800,11 +3854,15 @@
/* here we could scale based on cpufreq */
if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
- power *= arch_smt_gain(sd, cpu);
+ power *= arch_scale_smt_power(sd, cpu);
power >>= SCHED_LOAD_SHIFT;
}
- /* here we could scale based on RT time */
+ power *= scale_rt_power(cpu);
+ power >>= SCHED_LOAD_SHIFT;
+
+ if (!power)
+ power = 1;
if (power != old) {
sdg->__cpu_power = power;
Index: linux-2.6.31.4-rt14/kernel/sched_rt.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched_rt.c 2009-10-16 09:15:15.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched_rt.c 2009-10-16 09:15:36.000000000 -0400
@@ -602,6 +602,8 @@
curr->se.exec_start = rq->clock;
cpuacct_charge(curr, delta_exec);
+ sched_rt_avg_update(rq, delta_exec);
+
if (!rt_bandwidth_enabled())
return;
@@ -926,8 +928,6 @@
if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
enqueue_pushable_task(rq, p);
-
- inc_cpu_load(rq, p->se.load.weight);
}
static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
@@ -942,8 +942,6 @@
dequeue_rt_entity(rt_se);
dequeue_pushable_task(rq, p);
-
- dec_cpu_load(rq, p->se.load.weight);
}
/*
Index: linux-2.6.31.4-rt14/kernel/sysctl.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sysctl.c 2009-10-16 09:15:15.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sysctl.c 2009-10-16 09:15:36.000000000 -0400
@@ -332,6 +332,14 @@
},
{
.ctl_name = CTL_UNNUMBERED,
+ .procname = "sched_time_avg",
+ .data = &sysctl_sched_time_avg,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
.procname = "timer_migration",
.data = &sysctl_timer_migration,
.maxlen = sizeof(unsigned int),
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 07/17] sched: try to deal with low capacity
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
` (5 preceding siblings ...)
2009-10-22 12:37 ` [patch -rt 06/17] sched: scale down cpu_power due to RT tasks dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:37 ` [patch -rt 08/17] sched: remove reciprocal for cpu_power dino
` (9 subsequent siblings)
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: sched-lb-6.patch --]
[-- Type: text/plain, Size: 2465 bytes --]
When the capacity drops low, we want to migrate load away. Allow the
load-balancer to remove all tasks when we hit rock bottom.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
[ego@in.ibm.com: fix to update_sd_power_savings_stats]
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
kernel/sched.c | 35 +++++++++++++++++++++++++++++------
1 file changed, 29 insertions(+), 6 deletions(-)
Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c 2009-10-16 09:15:36.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c 2009-10-16 09:15:37.000000000 -0400
@@ -3749,7 +3749,7 @@
* capacity but still has some space to pick up some load
* from other group and save more power
*/
- if (sgs->sum_nr_running > sgs->group_capacity - 1)
+ if (sgs->sum_nr_running + 1 > sgs->group_capacity)
return;
if (sgs->sum_nr_running > sds->leader_nr_running ||
@@ -3989,8 +3989,8 @@
if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
sgs->group_imb = 1;
- sgs->group_capacity = group->__cpu_power / SCHED_LOAD_SCALE;
-
+ sgs->group_capacity =
+ DIV_ROUND_CLOSEST(group->__cpu_power, SCHED_LOAD_SCALE);
}
/**
@@ -4040,7 +4040,7 @@
* and move all the excess tasks away.
*/
if (prefer_sibling)
- sgs.group_capacity = 1;
+ sgs.group_capacity = min(sgs.group_capacity, 1UL);
if (local_group) {
sds->this_load = sgs.avg_load;
@@ -4272,6 +4272,26 @@
return NULL;
}
+static struct sched_group *group_of(int cpu)
+{
+ struct sched_domain *sd = rcu_dereference(cpu_rq(cpu)->sd);
+
+ if (!sd)
+ return NULL;
+
+ return sd->groups;
+}
+
+static unsigned long power_of(int cpu)
+{
+ struct sched_group *group = group_of(cpu);
+
+ if (!group)
+ return SCHED_LOAD_SCALE;
+
+ return group->__cpu_power;
+}
+
/*
* find_busiest_queue - find the busiest runqueue among the cpus in group.
*/
@@ -4284,15 +4304,18 @@
int i;
for_each_cpu(i, sched_group_cpus(group)) {
+ unsigned long power = power_of(i);
+ unsigned long capacity = DIV_ROUND_CLOSEST(power, SCHED_LOAD_SCALE);
unsigned long wl;
if (!cpumask_test_cpu(i, cpus))
continue;
rq = cpu_rq(i);
- wl = weighted_cpuload(i);
+ wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
+ wl /= power;
- if (rq->nr_running == 1 && wl > imbalance)
+ if (capacity && rq->nr_running == 1 && wl > imbalance)
continue;
if (wl > max_load) {
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 08/17] sched: remove reciprocal for cpu_power
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
` (6 preceding siblings ...)
2009-10-22 12:37 ` [patch -rt 07/17] sched: try to deal with low capacity dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:37 ` [patch -rt 09/17] x86: move APERF/MPERF into a X86_FEATURE dino
` (8 subsequent siblings)
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: sched-lb-7.patch --]
[-- Type: text/plain, Size: 8756 bytes --]
Its a source of fail, also, now that cpu_power is dynamical, its a
waste of time.
before:
<idle>-0 [000] 132.877936: find_busiest_group: avg_load: 0 group_load: 8241 power: 1
after:
bash-1689 [001] 137.862151: find_busiest_group: avg_load: 10636288 group_load: 10387 power: 1
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
[andreas.herrmann3@amd.com: remove include]
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
include/linux/sched.h | 10 +----
kernel/sched.c | 100 +++++++++++++++++---------------------------------
2 files changed, 36 insertions(+), 74 deletions(-)
Index: linux-2.6.31.4-rt14/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14.orig/kernel/sched.c 2009-10-16 09:15:37.000000000 -0400
+++ linux-2.6.31.4-rt14/kernel/sched.c 2009-10-16 09:15:38.000000000 -0400
@@ -137,30 +137,8 @@
*/
#define RUNTIME_INF ((u64)~0ULL)
-#ifdef CONFIG_SMP
-
static void double_rq_lock(struct rq *rq1, struct rq *rq2);
-/*
- * Divide a load by a sched group cpu_power : (load / sg->__cpu_power)
- * Since cpu_power is a 'constant', we can use a reciprocal divide.
- */
-static inline u32 sg_div_cpu_power(const struct sched_group *sg, u32 load)
-{
- return reciprocal_divide(load, sg->reciprocal_cpu_power);
-}
-
-/*
- * Each time a sched group cpu_power is changed,
- * we must compute its reciprocal value
- */
-static inline void sg_inc_cpu_power(struct sched_group *sg, u32 val)
-{
- sg->__cpu_power += val;
- sg->reciprocal_cpu_power = reciprocal_value(sg->__cpu_power);
-}
-#endif
-
#define TASK_PREEMPTS_CURR(p, rq) \
((p)->prio < (rq)->curr->prio)
@@ -2401,8 +2379,7 @@
}
/* Adjust by relative CPU power of the group */
- avg_load = sg_div_cpu_power(group,
- avg_load * SCHED_LOAD_SCALE);
+ avg_load = (avg_load * SCHED_LOAD_SCALE) / group->cpu_power;
if (local_group) {
this_load = avg_load;
@@ -3849,7 +3826,6 @@
unsigned long weight = cpumask_weight(sched_domain_span(sd));
unsigned long power = SCHED_LOAD_SCALE;
struct sched_group *sdg = sd->groups;
- unsigned long old = sdg->__cpu_power;
/* here we could scale based on cpufreq */
@@ -3864,33 +3840,26 @@
if (!power)
power = 1;
- if (power != old) {
- sdg->__cpu_power = power;
- sdg->reciprocal_cpu_power = reciprocal_value(power);
- }
+ sdg->cpu_power = power;
}
static void update_group_power(struct sched_domain *sd, int cpu)
{
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
- unsigned long power = sdg->__cpu_power;
if (!child) {
update_cpu_power(sd, cpu);
return;
}
- sdg->__cpu_power = 0;
+ sdg->cpu_power = 0;
group = child->groups;
do {
- sdg->__cpu_power += group->__cpu_power;
+ sdg->cpu_power += group->cpu_power;
group = group->next;
} while (group != child->groups);
-
- if (power != sdg->__cpu_power)
- sdg->reciprocal_cpu_power = reciprocal_value(sdg->__cpu_power);
}
/**
@@ -3970,8 +3939,7 @@
}
/* Adjust by relative CPU power of the group */
- sgs->avg_load = sg_div_cpu_power(group,
- sgs->group_load * SCHED_LOAD_SCALE);
+ sgs->avg_load = (sgs->group_load * SCHED_LOAD_SCALE) / group->cpu_power;
/*
@@ -3983,14 +3951,14 @@
* normalized nr_running number somewhere that negates
* the hierarchy?
*/
- avg_load_per_task = sg_div_cpu_power(group,
- sum_avg_load_per_task * SCHED_LOAD_SCALE);
+ avg_load_per_task = (sum_avg_load_per_task * SCHED_LOAD_SCALE) /
+ group->cpu_power;
if ((max_cpu_load - min_cpu_load) > 2*avg_load_per_task)
sgs->group_imb = 1;
sgs->group_capacity =
- DIV_ROUND_CLOSEST(group->__cpu_power, SCHED_LOAD_SCALE);
+ DIV_ROUND_CLOSEST(group->cpu_power, SCHED_LOAD_SCALE);
}
/**
@@ -4032,7 +4000,7 @@
return;
sds->total_load += sgs.group_load;
- sds->total_pwr += group->__cpu_power;
+ sds->total_pwr += group->cpu_power;
/*
* In case the child domain prefers tasks go to siblings
@@ -4097,28 +4065,28 @@
* moving them.
*/
- pwr_now += sds->busiest->__cpu_power *
+ pwr_now += sds->busiest->cpu_power *
min(sds->busiest_load_per_task, sds->max_load);
- pwr_now += sds->this->__cpu_power *
+ pwr_now += sds->this->cpu_power *
min(sds->this_load_per_task, sds->this_load);
pwr_now /= SCHED_LOAD_SCALE;
/* Amount of load we'd subtract */
- tmp = sg_div_cpu_power(sds->busiest,
- sds->busiest_load_per_task * SCHED_LOAD_SCALE);
+ tmp = (sds->busiest_load_per_task * SCHED_LOAD_SCALE) /
+ sds->busiest->cpu_power;
if (sds->max_load > tmp)
- pwr_move += sds->busiest->__cpu_power *
+ pwr_move += sds->busiest->cpu_power *
min(sds->busiest_load_per_task, sds->max_load - tmp);
/* Amount of load we'd add */
- if (sds->max_load * sds->busiest->__cpu_power <
+ if (sds->max_load * sds->busiest->cpu_power <
sds->busiest_load_per_task * SCHED_LOAD_SCALE)
- tmp = sg_div_cpu_power(sds->this,
- sds->max_load * sds->busiest->__cpu_power);
+ tmp = (sds->max_load * sds->busiest->cpu_power) /
+ sds->this->cpu_power;
else
- tmp = sg_div_cpu_power(sds->this,
- sds->busiest_load_per_task * SCHED_LOAD_SCALE);
- pwr_move += sds->this->__cpu_power *
+ tmp = (sds->busiest_load_per_task * SCHED_LOAD_SCALE) /
+ sds->this->cpu_power;
+ pwr_move += sds->this->cpu_power *
min(sds->this_load_per_task, sds->this_load + tmp);
pwr_move /= SCHED_LOAD_SCALE;
@@ -4153,8 +4121,8 @@
sds->max_load - sds->busiest_load_per_task);
/* How much load to actually move to equalise the imbalance */
- *imbalance = min(max_pull * sds->busiest->__cpu_power,
- (sds->avg_load - sds->this_load) * sds->this->__cpu_power)
+ *imbalance = min(max_pull * sds->busiest->cpu_power,
+ (sds->avg_load - sds->this_load) * sds->this->cpu_power)
/ SCHED_LOAD_SCALE;
/*
@@ -4289,7 +4257,7 @@
if (!group)
return SCHED_LOAD_SCALE;
- return group->__cpu_power;
+ return group->cpu_power;
}
/*
@@ -8226,7 +8194,7 @@
break;
}
- if (!group->__cpu_power) {
+ if (!group->cpu_power) {
printk(KERN_CONT "\n");
printk(KERN_ERR "ERROR: domain->cpu_power not "
"set\n");
@@ -8250,9 +8218,9 @@
cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group));
printk(KERN_CONT " %s", str);
- if (group->__cpu_power != SCHED_LOAD_SCALE) {
- printk(KERN_CONT " (__cpu_power = %d)",
- group->__cpu_power);
+ if (group->cpu_power != SCHED_LOAD_SCALE) {
+ printk(KERN_CONT " (cpu_power = %d)",
+ group->cpu_power);
}
group = group->next;
@@ -8537,7 +8505,7 @@
continue;
cpumask_clear(sched_group_cpus(sg));
- sg->__cpu_power = 0;
+ sg->cpu_power = 0;
for_each_cpu(j, span) {
if (group_fn(j, cpu_map, NULL, tmpmask) != group)
@@ -8762,7 +8730,7 @@
continue;
}
- sg_inc_cpu_power(sg, sd->groups->__cpu_power);
+ sg->cpu_power += sd->groups->cpu_power;
}
sg = sg->next;
} while (sg != group_head);
@@ -8835,7 +8803,7 @@
child = sd->child;
- sd->groups->__cpu_power = 0;
+ sd->groups->cpu_power = 0;
if (!child) {
power = SCHED_LOAD_SCALE;
@@ -8851,7 +8819,7 @@
power /= weight;
power >>= SCHED_LOAD_SHIFT;
}
- sg_inc_cpu_power(sd->groups, power);
+ sd->groups->cpu_power += power;
return;
}
@@ -8860,7 +8828,7 @@
*/
group = child->groups;
do {
- sg_inc_cpu_power(sd->groups, group->__cpu_power);
+ sd->groups->cpu_power += group->cpu_power;
group = group->next;
} while (group != child->groups);
}
@@ -9133,7 +9101,7 @@
sd = &per_cpu(node_domains, j).sd;
sd->groups = sg;
}
- sg->__cpu_power = 0;
+ sg->cpu_power = 0;
cpumask_copy(sched_group_cpus(sg), nodemask);
sg->next = sg;
cpumask_or(covered, covered, nodemask);
@@ -9160,7 +9128,7 @@
"Can not alloc domain group for node %d\n", j);
goto error;
}
- sg->__cpu_power = 0;
+ sg->cpu_power = 0;
cpumask_copy(sched_group_cpus(sg), tmpmask);
sg->next = prev->next;
cpumask_or(covered, covered, tmpmask);
Index: linux-2.6.31.4-rt14/include/linux/sched.h
===================================================================
--- linux-2.6.31.4-rt14.orig/include/linux/sched.h 2009-10-16 09:15:36.000000000 -0400
+++ linux-2.6.31.4-rt14/include/linux/sched.h 2009-10-16 09:15:38.000000000 -0400
@@ -905,15 +905,9 @@
/*
* CPU power of this group, SCHED_LOAD_SCALE being max power for a
- * single CPU. This is read only (except for setup, hotplug CPU).
- * Note : Never change cpu_power without recompute its reciprocal
+ * single CPU.
*/
- unsigned int __cpu_power;
- /*
- * reciprocal value of cpu_power to avoid expensive divides
- * (see include/linux/reciprocal_div.h)
- */
- u32 reciprocal_cpu_power;
+ unsigned int cpu_power;
/*
* The CPUs this group covers.
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 09/17] x86: move APERF/MPERF into a X86_FEATURE
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
` (7 preceding siblings ...)
2009-10-22 12:37 ` [patch -rt 08/17] sched: remove reciprocal for cpu_power dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:37 ` [patch -rt 10/17] x86: Add generic aperf/mperf code dino
` (7 subsequent siblings)
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: sched-lb-8.patch --]
[-- Type: text/plain, Size: 2880 bytes --]
Move the APERFMPERF capacility into a X86_FEATURE flag so that it can
be used outside of the acpi cpufreq driver.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
arch/x86/include/asm/cpufeature.h | 1 +
arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c | 9 ++-------
arch/x86/kernel/cpu/intel.c | 6 ++++++
3 files changed, 9 insertions(+), 7 deletions(-)
Index: linux-2.6.31.4-rt14/arch/x86/include/asm/cpufeature.h
===================================================================
--- linux-2.6.31.4-rt14.orig/arch/x86/include/asm/cpufeature.h 2009-10-12 16:15:40.000000000 -0400
+++ linux-2.6.31.4-rt14/arch/x86/include/asm/cpufeature.h 2009-10-16 09:15:39.000000000 -0400
@@ -95,6 +95,7 @@
#define X86_FEATURE_NONSTOP_TSC (3*32+24) /* TSC does not stop in C states */
#define X86_FEATURE_CLFLUSH_MONITOR (3*32+25) /* "" clflush reqd with monitor */
#define X86_FEATURE_EXTD_APICID (3*32+26) /* has extended APICID (8 bits) */
+#define X86_FEATURE_APERFMPERF (3*32+27) /* APERFMPERF */
/* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
#define X86_FEATURE_XMM3 (4*32+ 0) /* "pni" SSE-3 */
Index: linux-2.6.31.4-rt14/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-2.6.31.4-rt14.orig/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c 2009-10-12 16:15:40.000000000 -0400
+++ linux-2.6.31.4-rt14/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c 2009-10-16 09:15:39.000000000 -0400
@@ -60,7 +60,6 @@
};
#define INTEL_MSR_RANGE (0xffff)
-#define CPUID_6_ECX_APERFMPERF_CAPABILITY (0x1)
struct acpi_cpufreq_data {
struct acpi_processor_performance *acpi_data;
@@ -731,12 +730,8 @@
acpi_processor_notify_smm(THIS_MODULE);
/* Check for APERF/MPERF support in hardware */
- if (c->x86_vendor == X86_VENDOR_INTEL && c->cpuid_level >= 6) {
- unsigned int ecx;
- ecx = cpuid_ecx(6);
- if (ecx & CPUID_6_ECX_APERFMPERF_CAPABILITY)
- acpi_cpufreq_driver.getavg = get_measured_perf;
- }
+ if (cpu_has(c, X86_FEATURE_APERFMPERF))
+ acpi_cpufreq_driver.getavg = get_measured_perf;
dprintk("CPU%u - ACPI performance management activated.\n", cpu);
for (i = 0; i < perf->state_count; i++)
Index: linux-2.6.31.4-rt14/arch/x86/kernel/cpu/intel.c
===================================================================
--- linux-2.6.31.4-rt14.orig/arch/x86/kernel/cpu/intel.c 2009-10-12 16:15:40.000000000 -0400
+++ linux-2.6.31.4-rt14/arch/x86/kernel/cpu/intel.c 2009-10-16 09:15:39.000000000 -0400
@@ -349,6 +349,12 @@
set_cpu_cap(c, X86_FEATURE_ARCH_PERFMON);
}
+ if (c->cpuid_level > 6) {
+ unsigned ecx = cpuid_ecx(6);
+ if (ecx & 0x01)
+ set_cpu_cap(c, X86_FEATURE_APERFMPERF);
+ }
+
if (cpu_has_xmm2)
set_cpu_cap(c, X86_FEATURE_LFENCE_RDTSC);
if (cpu_has_ds) {
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 10/17] x86: Add generic aperf/mperf code
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
` (8 preceding siblings ...)
2009-10-22 12:37 ` [patch -rt 09/17] x86: move APERF/MPERF into a X86_FEATURE dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:37 ` [patch -rt 11/17] Provide an arch specific hook for cpufreq based scaling of cpu_power dino
` (6 subsequent siblings)
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: sched-lb-9-new.patch --]
[-- Type: text/plain, Size: 4532 bytes --]
Move some of the aperf/mperf code out from the cpufreq driver
thingy so that other people can enjoy it too.
Index: linux-2.6.31.4-rt14-lb1/arch/x86/include/asm/processor.h
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/arch/x86/include/asm/processor.h 2009-10-21 10:47:17.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/arch/x86/include/asm/processor.h 2009-10-21 10:48:41.000000000 -0400
@@ -27,6 +27,7 @@
#include <linux/cpumask.h>
#include <linux/cache.h>
#include <linux/threads.h>
+#include <linux/math64.h>
#include <linux/init.h>
/*
@@ -1010,4 +1011,33 @@
extern int get_tsc_mode(unsigned long adr);
extern int set_tsc_mode(unsigned int val);
+struct aperfmperf {
+ u64 aperf, mperf;
+};
+
+static inline void get_aperfmperf(struct aperfmperf *am)
+{
+ WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_APERFMPERF));
+
+ rdmsrl(MSR_IA32_APERF, am->aperf);
+ rdmsrl(MSR_IA32_MPERF, am->mperf);
+}
+
+#define APERFMPERF_SHIFT 10
+
+static inline
+unsigned long calc_aperfmperf_ratio(struct aperfmperf *old,
+ struct aperfmperf *new)
+{
+ u64 aperf = new->aperf - old->aperf;
+ u64 mperf = new->mperf - old->mperf;
+ unsigned long ratio = aperf;
+
+ mperf >>= APERFMPERF_SHIFT;
+ if (mperf)
+ ratio = div64_u64(aperf, mperf);
+
+ return ratio;
+}
+
#endif /* _ASM_X86_PROCESSOR_H */
Index: linux-2.6.31.4-rt14-lb1/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c 2009-10-21 10:47:17.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c 2009-10-21 10:48:41.000000000 -0400
@@ -70,11 +70,7 @@
static DEFINE_PER_CPU(struct acpi_cpufreq_data *, drv_data);
-struct acpi_msr_data {
- u64 saved_aperf, saved_mperf;
-};
-
-static DEFINE_PER_CPU(struct acpi_msr_data, msr_data);
+static DEFINE_PER_CPU(struct aperfmperf, old_perf);
DEFINE_TRACE(power_mark);
@@ -243,23 +239,12 @@
return cmd.val;
}
-struct perf_pair {
- union {
- struct {
- u32 lo;
- u32 hi;
- } split;
- u64 whole;
- } aperf, mperf;
-};
-
/* Called via smp_call_function_single(), on the target CPU */
static void read_measured_perf_ctrs(void *_cur)
{
- struct perf_pair *cur = _cur;
+ struct aperfmperf *am = _cur;
- rdmsr(MSR_IA32_APERF, cur->aperf.split.lo, cur->aperf.split.hi);
- rdmsr(MSR_IA32_MPERF, cur->mperf.split.lo, cur->mperf.split.hi);
+ get_aperfmperf(am);
}
/*
@@ -278,63 +263,17 @@
static unsigned int get_measured_perf(struct cpufreq_policy *policy,
unsigned int cpu)
{
- struct perf_pair readin, cur;
- unsigned int perf_percent;
+ struct aperfmperf perf;
+ unsigned long ratio;
unsigned int retval;
- if (smp_call_function_single(cpu, read_measured_perf_ctrs, &readin, 1))
+ if (smp_call_function_single(cpu, read_measured_perf_ctrs, &perf, 1))
return 0;
- cur.aperf.whole = readin.aperf.whole -
- per_cpu(msr_data, cpu).saved_aperf;
- cur.mperf.whole = readin.mperf.whole -
- per_cpu(msr_data, cpu).saved_mperf;
- per_cpu(msr_data, cpu).saved_aperf = readin.aperf.whole;
- per_cpu(msr_data, cpu).saved_mperf = readin.mperf.whole;
-
-#ifdef __i386__
- /*
- * We dont want to do 64 bit divide with 32 bit kernel
- * Get an approximate value. Return failure in case we cannot get
- * an approximate value.
- */
- if (unlikely(cur.aperf.split.hi || cur.mperf.split.hi)) {
- int shift_count;
- u32 h;
-
- h = max_t(u32, cur.aperf.split.hi, cur.mperf.split.hi);
- shift_count = fls(h);
-
- cur.aperf.whole >>= shift_count;
- cur.mperf.whole >>= shift_count;
- }
-
- if (((unsigned long)(-1) / 100) < cur.aperf.split.lo) {
- int shift_count = 7;
- cur.aperf.split.lo >>= shift_count;
- cur.mperf.split.lo >>= shift_count;
- }
-
- if (cur.aperf.split.lo && cur.mperf.split.lo)
- perf_percent = (cur.aperf.split.lo * 100) / cur.mperf.split.lo;
- else
- perf_percent = 0;
-
-#else
- if (unlikely(((unsigned long)(-1) / 100) < cur.aperf.whole)) {
- int shift_count = 7;
- cur.aperf.whole >>= shift_count;
- cur.mperf.whole >>= shift_count;
- }
-
- if (cur.aperf.whole && cur.mperf.whole)
- perf_percent = (cur.aperf.whole * 100) / cur.mperf.whole;
- else
- perf_percent = 0;
-
-#endif
+ ratio = calc_aperfmperf_ratio(&per_cpu(old_perf, cpu), &perf);
+ per_cpu(old_perf, cpu) = perf;
- retval = (policy->cpuinfo.max_freq * perf_percent) / 100;
+ retval = (policy->cpuinfo.max_freq * ratio) >> APERFMPERF_SHIFT;
return retval;
}
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 11/17] Provide an arch specific hook for cpufreq based scaling of cpu_power.
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
` (9 preceding siblings ...)
2009-10-22 12:37 ` [patch -rt 10/17] x86: Add generic aperf/mperf code dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:37 ` [patch -rt 12/17] x86: sched: provide arch implementations using aperf/mperf dino
` (5 subsequent siblings)
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: sched-lb-10.patch --]
[-- Type: text/plain, Size: 1637 bytes --]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
kernel/sched.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
Index: linux-2.6.31.4-rt14-lb1/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/kernel/sched.c 2009-10-21 10:47:15.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/kernel/sched.c 2009-10-21 10:48:58.000000000 -0400
@@ -3793,7 +3793,18 @@
}
#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
-unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu)
+
+unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
+{
+ return SCHED_LOAD_SCALE;
+}
+
+unsigned long __weak arch_scale_freq_power(struct sched_domain *sd, int cpu)
+{
+ return default_scale_freq_power(sd, cpu);
+}
+
+unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu)
{
unsigned long weight = cpumask_weight(sched_domain_span(sd));
unsigned long smt_gain = sd->smt_gain;
@@ -3803,6 +3814,11 @@
return smt_gain;
}
+unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu)
+{
+ return default_scale_smt_power(sd, cpu);
+}
+
unsigned long scale_rt_power(int cpu)
{
struct rq *rq = cpu_rq(cpu);
@@ -3827,7 +3843,8 @@
unsigned long power = SCHED_LOAD_SCALE;
struct sched_group *sdg = sd->groups;
- /* here we could scale based on cpufreq */
+ power *= arch_scale_freq_power(sd, cpu);
+ power >> SCHED_LOAD_SHIFT;
if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
power *= arch_scale_smt_power(sd, cpu);
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 12/17] x86: sched: provide arch implementations using aperf/mperf
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
` (10 preceding siblings ...)
2009-10-22 12:37 ` [patch -rt 11/17] Provide an arch specific hook for cpufreq based scaling of cpu_power dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:37 ` [patch -rt 13/17] sched: cleanup wake_idle power saving dino
` (4 subsequent siblings)
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: sched-lb-11.patch --]
[-- Type: text/plain, Size: 3637 bytes --]
APERF/MPERF support for cpu_power.
APERF/MPERF is arch defined to be a relative scale of work capacity
per logical cpu, this is assumed to include SMT and Turbo mode.
APERF/MPERF are specified to both reset to 0 when either counter
wraps, which is highly inconvenient, since that'll give a blimp when
that happens. The manual specifies writing 0 to the counters after
each read, but that's 1) too expensive, and 2) destroys the
possibility of sharing these counters with other users, so we live
with the blimp - the other existing user does too.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
arch/x86/kernel/cpu/Makefile | 2 -
arch/x86/kernel/cpu/sched.c | 58 +++++++++++++++++++++++++++++++++++++++++++
include/linux/sched.h | 4 ++
3 files changed, 63 insertions(+), 1 deletion(-)
Index: linux-2.6.31.4-rt14-lb1/arch/x86/kernel/cpu/Makefile
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/arch/x86/kernel/cpu/Makefile 2009-10-21 10:47:15.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/arch/x86/kernel/cpu/Makefile 2009-10-21 10:49:00.000000000 -0400
@@ -13,7 +13,7 @@
obj-y := intel_cacheinfo.o addon_cpuid_features.o
obj-y += proc.o capflags.o powerflags.o common.o
-obj-y += vmware.o hypervisor.o
+obj-y += vmware.o hypervisor.o sched.o
obj-$(CONFIG_X86_32) += bugs.o cmpxchg.o
obj-$(CONFIG_X86_64) += bugs_64.o
Index: linux-2.6.31.4-rt14-lb1/arch/x86/kernel/cpu/sched.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.31.4-rt14-lb1/arch/x86/kernel/cpu/sched.c 2009-10-21 10:49:00.000000000 -0400
@@ -0,0 +1,58 @@
+#include <linux/sched.h>
+#include <linux/math64.h>
+#include <linux/percpu.h>
+#include <linux/irqflags.h>
+
+#include <asm/cpufeature.h>
+#include <asm/processor.h>
+
+static DEFINE_PER_CPU(struct aperfmperf, old_aperfmperf);
+
+static unsigned long scale_aperfmperf(void)
+{
+ struct aperfmperf cur, val, *old = &__get_cpu_var(old_aperfmperf);
+ unsigned long ratio = SCHED_LOAD_SCALE;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ get_aperfmperf(&val);
+ local_irq_restore(flags);
+
+ cur = val;
+ cur.aperf -= old->aperf;
+ cur.mperf -= old->mperf;
+ *old = val;
+
+ cur.mperf >>= SCHED_LOAD_SHIFT;
+ if (cur.mperf)
+ ratio = div_u64(cur.aperf, cur.mperf);
+
+ return ratio;
+}
+
+unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu)
+{
+ /*
+ * do aperf/mperf on the cpu level because it includes things
+ * like turbo mode, which are relevant to full cores.
+ */
+ if (boot_cpu_has(X86_FEATURE_APERFMPERF))
+ return scale_aperfmperf();
+
+ /*
+ * maybe have something cpufreq here
+ */
+
+ return default_scale_freq_power(sd, cpu);
+}
+
+unsigned long arch_scale_smt_power(struct sched_domain *sd, int cpu)
+{
+ /*
+ * aperf/mperf already includes the smt gain
+ */
+ if (boot_cpu_has(X86_FEATURE_APERFMPERF))
+ return SCHED_LOAD_SCALE;
+
+ return default_scale_smt_power(sd, cpu);
+}
Index: linux-2.6.31.4-rt14-lb1/include/linux/sched.h
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/include/linux/sched.h 2009-10-21 10:47:15.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/include/linux/sched.h 2009-10-21 10:49:00.000000000 -0400
@@ -1047,6 +1047,10 @@
}
#endif /* !CONFIG_SMP */
+
+unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu);
+unsigned long default_scale_smt_power(struct sched_domain *sd, int cpu);
+
struct io_context; /* See blkdev.h */
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 13/17] sched: cleanup wake_idle power saving
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
` (11 preceding siblings ...)
2009-10-22 12:37 ` [patch -rt 12/17] x86: sched: provide arch implementations using aperf/mperf dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:37 ` [patch -rt 14/17] sched: cleanup wake_idle dino
` (3 subsequent siblings)
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: sched-lb-12.patch --]
[-- Type: text/plain, Size: 3094 bytes --]
Hopefully a more readable version of the same.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
kernel/sched_fair.c | 58 ++++++++++++++++++++++++++++++++++------------------
1 file changed, 39 insertions(+), 19 deletions(-)
Index: linux-2.6.31.4-rt14-lb1/kernel/sched_fair.c
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/kernel/sched_fair.c 2009-10-21 10:47:14.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/kernel/sched_fair.c 2009-10-21 10:49:01.000000000 -0400
@@ -1040,6 +1040,41 @@
se->vruntime = rightmost->vruntime + 1;
}
+#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
+/*
+ * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
+ * are idle and this is not a kernel thread and this task's affinity
+ * allows it to be moved to preferred cpu, then just move!
+ *
+ * XXX - can generate significant overload on perferred_wakeup_cpu
+ * with plenty of idle cpus, leading to a significant loss in
+ * throughput.
+ *
+ * Returns: < 0 - no placement decision made
+ * >= 0 - place on cpu
+ */
+static int wake_idle_power_save(int cpu, struct task_struct *p)
+{
+ int this_cpu = smp_processor_id();
+ int wakeup_cpu;
+
+ if (sched_mc_power_savings < POWERSAVINGS_BALANCE_WAKEUP)
+ return -1;
+
+ if (!idle_cpu(cpu) || !idle_cpu(this_cpu))
+ return -1;
+
+ if (!p->mm || (p->flags & PF_KTHREAD))
+ return -1;
+
+ wakeup_cpu = cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
+
+ if (!cpu_isset(wakeup_cpu, p->cpus_allowed))
+ return -1;
+
+ return wakeup_cpu;
+}
+
/*
* wake_idle() will wake a task on an idle cpu if task->cpu is
* not idle and an idle cpu is available. The span of cpus to
@@ -1050,29 +1085,14 @@
*
* Returns the CPU we should wake onto.
*/
-#if defined(ARCH_HAS_SCHED_WAKE_IDLE)
static int wake_idle(int cpu, struct task_struct *p)
{
struct sched_domain *sd;
int i;
- unsigned int chosen_wakeup_cpu;
- int this_cpu;
-
- /*
- * At POWERSAVINGS_BALANCE_WAKEUP level, if both this_cpu and prev_cpu
- * are idle and this is not a kernel thread and this task's affinity
- * allows it to be moved to preferred cpu, then just move!
- */
-
- this_cpu = smp_processor_id();
- chosen_wakeup_cpu =
- cpu_rq(this_cpu)->rd->sched_mc_preferred_wakeup_cpu;
- if (sched_mc_power_savings >= POWERSAVINGS_BALANCE_WAKEUP &&
- idle_cpu(cpu) && idle_cpu(this_cpu) &&
- p->mm && !(p->flags & PF_KTHREAD) &&
- cpu_isset(chosen_wakeup_cpu, p->cpus_allowed))
- return chosen_wakeup_cpu;
+ i = wake_idle_power_save(cpu, p);
+ if (i >= 0)
+ return i;
/*
* If it is idle, then it is the best cpu to run this task.
@@ -1081,7 +1101,7 @@
* Siblings must be also busy(in most cases) as they didn't already
* pickup the extra load from this cpu and hence we need not check
* sibling runqueue info. This will avoid the checks and cache miss
- * penalities associated with that.
+ * penalties associated with that.
*/
if (idle_cpu(cpu) || cpu_rq(cpu)->cfs.nr_running > 1)
return cpu;
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 14/17] sched: cleanup wake_idle
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
` (12 preceding siblings ...)
2009-10-22 12:37 ` [patch -rt 13/17] sched: cleanup wake_idle power saving dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:37 ` [patch -rt 15/17] sched: Add a missing = dino
` (2 subsequent siblings)
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: sched-lb-13.patch --]
[-- Type: text/plain, Size: 2490 bytes --]
A more readable version, with a few differences:
- don't check against the root domain, but instead check
SD_LOAD_BALANCE
- don't re-iterate the cpus already iterated on the previous SD
- use rcu_read_lock() around the sd iteration
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
---
kernel/sched_fair.c | 45 +++++++++++++++++++++++++--------------------
1 file changed, 25 insertions(+), 20 deletions(-)
Index: linux-2.6.31.4-rt14-lb1/kernel/sched_fair.c
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/kernel/sched_fair.c 2009-10-21 10:49:01.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/kernel/sched_fair.c 2009-10-21 10:49:02.000000000 -0400
@@ -1080,14 +1080,13 @@
* not idle and an idle cpu is available. The span of cpus to
* search starts with cpus closest then further out as needed,
* so we always favor a closer, idle cpu.
- * Domains may include CPUs that are not usable for migration,
- * hence we need to mask them out (cpu_active_mask)
*
* Returns the CPU we should wake onto.
*/
static int wake_idle(int cpu, struct task_struct *p)
{
- struct sched_domain *sd;
+ struct rq *task_rq = task_rq(p);
+ struct sched_domain *sd, *child = NULL;
int i;
i = wake_idle_power_save(cpu, p);
@@ -1106,24 +1105,34 @@
if (idle_cpu(cpu) || cpu_rq(cpu)->cfs.nr_running > 1)
return cpu;
- for_each_domain(cpu, sd) {
- if ((sd->flags & SD_WAKE_IDLE)
- || ((sd->flags & SD_WAKE_IDLE_FAR)
- && !task_hot(p, task_rq(p)->clock, sd))) {
- for_each_cpu_and(i, sched_domain_span(sd),
- &p->cpus_allowed) {
- if (cpu_active(i) && idle_cpu(i)) {
- if (i != task_cpu(p)) {
- schedstat_inc(p,
- se.nr_wakeups_idle);
- }
- return i;
- }
- }
- } else {
+ rcu_read_lock();
+ for_each_domain(cpu, sd) {
+ if (!(sd->flags & SD_LOAD_BALANCE))
+ break;
+
+ if (!(sd->flags & SD_WAKE_IDLE) &&
+ (task_hot(p, task_rq->clock, sd) || !(sd->flags & SD_WAKE_IDLE_FAR)))
break;
- }
- }
+
+ for_each_cpu_and(i, sched_domain_span(sd), &p->cpus_allowed) {
+ if (child && cpumask_test_cpu(i, sched_domain_span(child)))
+ continue;
+
+ if (!idle_cpu(i))
+ continue;
+
+ if (task_cpu(p) != i)
+ schedstat_inc(p, se.nr_wakeups_idle);
+
+ cpu = i;
+ goto unlock;
+ }
+
+ child = sd;
+ }
+unlock:
+ rcu_read_unlock();
+
return cpu;
}
#else /* !ARCH_HAS_SCHED_WAKE_IDLE*/
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 15/17] sched: Add a missing =
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
` (13 preceding siblings ...)
2009-10-22 12:37 ` [patch -rt 14/17] sched: cleanup wake_idle dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:37 ` [patch -rt 16/17] sched: Deal with low-load in wake_affine() dino
2009-10-22 12:38 ` [patch -rt 17/17] sched: Fix dynamic power-balancing crash dino
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: sched_fixes.patch --]
[-- Type: text/plain, Size: 601 bytes --]
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Index: linux-2.6.31.4-rt14-lb1/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/kernel/sched.c 2009-10-21 10:48:58.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/kernel/sched.c 2009-10-21 10:49:03.000000000 -0400
@@ -3844,7 +3844,7 @@
struct sched_group *sdg = sd->groups;
power *= arch_scale_freq_power(sd, cpu);
- power >> SCHED_LOAD_SHIFT;
+ power >>= SCHED_LOAD_SHIFT;
if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
power *= arch_scale_smt_power(sd, cpu);
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 16/17] sched: Deal with low-load in wake_affine()
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
` (14 preceding siblings ...)
2009-10-22 12:37 ` [patch -rt 15/17] sched: Add a missing = dino
@ 2009-10-22 12:37 ` dino
2009-10-22 12:38 ` [patch -rt 17/17] sched: Fix dynamic power-balancing crash dino
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:37 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: wake_affine_low_load.patch --]
[-- Type: text/plain, Size: 1403 bytes --]
wake_affine() would always fail under low-load situations where
both prev and this were idle, because adding a single task will
always be a significant imbalance, even if there's nothing
around that could balance it.
Deal with this by allowing imbalance when there's nothing you
can do about it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Index: linux-2.6.31.4-rt14-lb1/kernel/sched_fair.c
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/kernel/sched_fair.c 2009-10-21 10:49:02.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/kernel/sched_fair.c 2009-10-21 10:49:04.000000000 -0400
@@ -1264,7 +1264,17 @@
tg = task_group(p);
weight = p->se.load.weight;
- balanced = 100*(tl + effective_load(tg, this_cpu, weight, weight)) <=
+ /*
+ * In low-load situations, where prev_cpu is idle and this_cpu is idle
+ * due to the sync cause above having dropped tl to 0, we'll always have
+ * an imbalance, but there's really nothing you can do about that, so
+ * that's good too.
+ *
+ * Otherwise check if either cpus are near enough in load to allow this
+ * task to be woken on this_cpu.
+ */
+ balanced = !tl ||
+ 100*(tl + effective_load(tg, this_cpu, weight, weight)) <=
imbalance*(load + effective_load(tg, prev_cpu, 0, weight));
/*
--
^ permalink raw reply [flat|nested] 18+ messages in thread* [patch -rt 17/17] sched: Fix dynamic power-balancing crash
2009-10-22 12:37 [patch -rt 00/17] [patch -rt] Sched load balance backport dino
` (15 preceding siblings ...)
2009-10-22 12:37 ` [patch -rt 16/17] sched: Deal with low-load in wake_affine() dino
@ 2009-10-22 12:38 ` dino
16 siblings, 0 replies; 18+ messages in thread
From: dino @ 2009-10-22 12:38 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
Cc: linux-kernel, linux-rt-users, John Stultz, Darren Hart,
John Kacur
[-- Attachment #1: fix_power_bal_crash.patch --]
[-- Type: text/plain, Size: 1520 bytes --]
This crash:
[ 1774.088275] divide error: 0000 [#1] SMP
[ 1774.100355] CPU 13
[ 1774.102498] Modules linked in:
[ 1774.105631] Pid: 30881, comm: hackbench Not tainted 2.6.31-rc8-tip-01308-g484d664-dirty #1629 X8DTN
[ 1774.114807] RIP: 0010:[<ffffffff81041c38>] [<ffffffff81041c38>]
sched_balance_self+0x19b/0x2d4
Triggers because update_group_power() modifies the sd tree and does
temporary calculations there - not considering that other CPUs
could observe intermediate values, such as the zero initial value.
Calculate it in a temporary variable instead. (we need no memory
barrier as these are all statistical values anyway)
Got the same oops with the backport to -rt
Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com>
Index: linux-2.6.31.4-rt14-lb1/kernel/sched.c
===================================================================
--- linux-2.6.31.4-rt14-lb1.orig/kernel/sched.c 2009-10-21 10:49:03.000000000 -0400
+++ linux-2.6.31.4-rt14-lb1/kernel/sched.c 2009-10-22 01:48:41.000000000 -0400
@@ -3864,19 +3864,22 @@
{
struct sched_domain *child = sd->child;
struct sched_group *group, *sdg = sd->groups;
+ unsigned long power;
if (!child) {
update_cpu_power(sd, cpu);
return;
}
- sdg->cpu_power = 0;
+ power = 0;
group = child->groups;
do {
- sdg->cpu_power += group->cpu_power;
+ power += group->cpu_power;
group = group->next;
} while (group != child->groups);
+
+ sdg->cpu_power = power;
}
/**
--
^ permalink raw reply [flat|nested] 18+ messages in thread