* [RFC PATCH 2 1/2] sched: Nominate idle load balancer from a semi-idle package.
2009-04-14 4:55 [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB Gautham R Shenoy
@ 2009-04-14 4:55 ` Gautham R Shenoy
2009-04-14 10:01 ` [tip:sched/core] " tip-bot for Gautham R Shenoy
2009-04-14 17:32 ` [RFC PATCH 2 1/2] " Randy Dunlap
2009-04-14 4:55 ` [RFC PATCH 2 2/2] sched: Nominate a power-efficient ilb in select_nohz_balancer() Gautham R Shenoy
` (2 subsequent siblings)
3 siblings, 2 replies; 10+ messages in thread
From: Gautham R Shenoy @ 2009-04-14 4:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vaidyanathan Srinivasan
Cc: linux-kernel, Balbir Singh, Suresh Siddha, Andi Kleen,
Randy Dunlap, Gautham R Shenoy
Currently the nomination of idle-load balancer is done by choosing the first
idle cpu in the nohz.cpu_mask. This may not be power-efficient, since
such an idle cpu could come from a completely idle core/package thereby
preventing the whole core/package from being in a low-power state.
For eg, consider a quad-core dual package system. The cpu numbering need
not be sequential and can something like [0, 2, 4, 6] and [1, 3, 5, 7].
With sched_mc/smt_power_savings and the power-aware IRQ balance, we try to keep
as fewer Packages/Cores active. But the current idle load balancer logic
goes against this by choosing the first_cpu in the nohz.cpu_mask and not
taking the system topology into consideration.
Improve the algorithm to nominate the idle load balancer from a semi idle
cores/packages thereby increasing the probability of the cores/packages being
in deeper sleep states for longer duration.
The algorithm is activated only when sched_mc/smt_power_savings != 0.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
---
kernel/sched.c | 127 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 118 insertions(+), 9 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 6cc1fd5..d702be7 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4228,10 +4228,126 @@ static void active_load_balance(struct rq *busiest_rq, int busiest_cpu)
static struct {
atomic_t load_balancer;
cpumask_var_t cpu_mask;
+ cpumask_var_t ilb_grp_nohz_mask;
} nohz ____cacheline_aligned = {
.load_balancer = ATOMIC_INIT(-1),
};
+#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+/**
+ * lowest_flag_domain - Return lowest sched_domain containing flag.
+ * @cpu: The cpu whose lowest level of sched domain is to
+ * be returned.
+ * @flag: The flag to check for the lowest sched_domain
+ * for the given cpu.
+ *
+ * Returns the lowest sched_domain of a cpu which contains the given flag.
+ */
+static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
+{
+ struct sched_domain *sd;
+
+ for_each_domain(cpu, sd)
+ if (sd && (sd->flags & flag))
+ break;
+
+ return sd;
+}
+
+/**
+ * for_each_flag_domain - Iterates over sched_domains containing the flag.
+ * @cpu: The cpu whose domains we're iterating over.
+ * @sd: variable holding the value of the power_savings_sd
+ * for cpu.
+ * @flag: The flag to filter the sched_domains to be iterated.
+ *
+ * Iterates over all the scheduler domains for a given cpu that has the 'flag'
+ * set, starting from the lowest sched_domain to the highest.
+ */
+#define for_each_flag_domain(cpu, sd, flag) \
+ for (sd = lowest_flag_domain(cpu, flag); \
+ (sd && (sd->flags & flag)); sd = sd->parent)
+
+/**
+ * is_semi_idle_group - Checks if the given sched_group is semi-idle.
+ * @ilb_group: group to be checked for semi-idleness
+ *
+ * Returns: 1 if the group is semi-idle. 0 otherwise.
+ *
+ * We define a sched_group to be semi idle if it has atleast one idle-CPU
+ * and atleast one non-idle CPU. This helper function checks if the given
+ * sched_group is semi-idle or not.
+ */
+static inline int is_semi_idle_group(struct sched_group *ilb_group)
+{
+ cpumask_and(nohz.ilb_grp_nohz_mask, nohz.cpu_mask,
+ sched_group_cpus(ilb_group));
+
+ /*
+ * A sched_group is semi-idle when it has atleast one busy cpu
+ * and atleast one idle cpu.
+ */
+ if (cpumask_empty(nohz.ilb_grp_nohz_mask))
+ return 0;
+
+ if (cpumask_equal(nohz.ilb_grp_nohz_mask, sched_group_cpus(ilb_group)))
+ return 0;
+
+ return 1;
+}
+/**
+ * find_new_ilb - Finds the optimum idle load balancer for nomination.
+ * @cpu: The cpu which is nominating a new idle_load_balancer.
+ *
+ * Returns: Returns the id of the idle load balancer if it exists,
+ * Else, returns >= nr_cpu_ids.
+ *
+ * This algorithm picks the idle load balancer such that it belongs to a
+ * semi-idle powersavings sched_domain. The idea is to try and avoid
+ * completely idle packages/cores just for the purpose of idle load balancing
+ * when there are other idle cpu's which are better suited for that job.
+ */
+static int find_new_ilb(int cpu)
+{
+ struct sched_domain *sd;
+ struct sched_group *ilb_group;
+
+ /*
+ * Have idle load balancer selection from semi-idle packages only
+ * when power-aware load balancing is enabled
+ */
+ if (!(sched_smt_power_savings || sched_mc_power_savings))
+ goto out_done;
+
+ /*
+ * Optimize for the case when we have no idle CPUs or only one
+ * idle CPU. Don't walk the sched_domain hierarchy in such cases
+ */
+ if (cpumask_weight(nohz.cpu_mask) < 2)
+ goto out_done;
+
+ for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
+ ilb_group = sd->groups;
+
+ do {
+ if (is_semi_idle_group(ilb_group))
+ return cpumask_first(nohz.ilb_grp_nohz_mask);
+
+ ilb_group = ilb_group->next;
+
+ } while (ilb_group != sd->groups);
+ }
+
+out_done:
+ return cpumask_first(nohz.cpu_mask);
+}
+#else /* (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
+static inline int find_new_ilb(int call_cpu)
+{
+ return first_cpu(nohz.cpu_mask);
+}
+#endif
+
/*
* This routine will try to nominate the ilb (idle load balancing)
* owner among the cpus whose ticks are stopped. ilb owner will do the idle
@@ -4456,15 +4572,7 @@ static inline void trigger_load_balance(struct rq *rq, int cpu)
}
if (atomic_read(&nohz.load_balancer) == -1) {
- /*
- * simple selection for now: Nominate the
- * first cpu in the nohz list to be the next
- * ilb owner.
- *
- * TBD: Traverse the sched domains and nominate
- * the nearest cpu in the nohz.cpu_mask.
- */
- int ilb = cpumask_first(nohz.cpu_mask);
+ int ilb = find_new_ilb(cpu);
if (ilb < nr_cpu_ids)
resched_cpu(ilb);
@@ -8985,6 +9093,7 @@ void __init sched_init(void)
#ifdef CONFIG_SMP
#ifdef CONFIG_NO_HZ
alloc_bootmem_cpumask_var(&nohz.cpu_mask);
+ alloc_bootmem_cpumask_var(&nohz.ilb_grp_nohz_mask);
#endif
alloc_bootmem_cpumask_var(&cpu_isolated_map);
#endif /* SMP */
^ permalink raw reply related [flat|nested] 10+ messages in thread* [tip:sched/core] sched: Nominate idle load balancer from a semi-idle package.
2009-04-14 4:55 ` [RFC PATCH 2 1/2] sched: Nominate idle load balancer from a semi-idle package Gautham R Shenoy
@ 2009-04-14 10:01 ` tip-bot for Gautham R Shenoy
2009-04-14 17:32 ` [RFC PATCH 2 1/2] " Randy Dunlap
1 sibling, 0 replies; 10+ messages in thread
From: tip-bot for Gautham R Shenoy @ 2009-04-14 10:01 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, ego, hpa, mingo, a.p.zijlstra, tglx, mingo
Commit-ID: f711f6090a81cbd396b63de90f415d33f563af9b
Gitweb: http://git.kernel.org/tip/f711f6090a81cbd396b63de90f415d33f563af9b
Author: Gautham R Shenoy <ego@in.ibm.com>
AuthorDate: Tue, 14 Apr 2009 10:25:30 +0530
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 14 Apr 2009 11:49:19 +0200
sched: Nominate idle load balancer from a semi-idle package.
Currently the nomination of idle-load balancer is done by choosing the first
idle cpu in the nohz.cpu_mask. This may not be power-efficient, since
such an idle cpu could come from a completely idle core/package thereby
preventing the whole core/package from being in a low-power state.
For eg, consider a quad-core dual package system. The cpu numbering need
not be sequential and can something like [0, 2, 4, 6] and [1, 3, 5, 7].
With sched_mc/smt_power_savings and the power-aware IRQ balance, we try to keep
as fewer Packages/Cores active. But the current idle load balancer logic
goes against this by choosing the first_cpu in the nohz.cpu_mask and not
taking the system topology into consideration.
Improve the algorithm to nominate the idle load balancer from a semi idle
cores/packages thereby increasing the probability of the cores/packages being
in deeper sleep states for longer duration.
The algorithm is activated only when sched_mc/smt_power_savings != 0.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090414045530.7645.12175.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched.c | 127 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 118 insertions(+), 9 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 5724508..b0fefa3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4240,10 +4240,126 @@ static void active_load_balance(struct rq *busiest_rq, int busiest_cpu)
static struct {
atomic_t load_balancer;
cpumask_var_t cpu_mask;
+ cpumask_var_t ilb_grp_nohz_mask;
} nohz ____cacheline_aligned = {
.load_balancer = ATOMIC_INIT(-1),
};
+#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+/**
+ * lowest_flag_domain - Return lowest sched_domain containing flag.
+ * @cpu: The cpu whose lowest level of sched domain is to
+ * be returned.
+ * @flag: The flag to check for the lowest sched_domain
+ * for the given cpu.
+ *
+ * Returns the lowest sched_domain of a cpu which contains the given flag.
+ */
+static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
+{
+ struct sched_domain *sd;
+
+ for_each_domain(cpu, sd)
+ if (sd && (sd->flags & flag))
+ break;
+
+ return sd;
+}
+
+/**
+ * for_each_flag_domain - Iterates over sched_domains containing the flag.
+ * @cpu: The cpu whose domains we're iterating over.
+ * @sd: variable holding the value of the power_savings_sd
+ * for cpu.
+ * @flag: The flag to filter the sched_domains to be iterated.
+ *
+ * Iterates over all the scheduler domains for a given cpu that has the 'flag'
+ * set, starting from the lowest sched_domain to the highest.
+ */
+#define for_each_flag_domain(cpu, sd, flag) \
+ for (sd = lowest_flag_domain(cpu, flag); \
+ (sd && (sd->flags & flag)); sd = sd->parent)
+
+/**
+ * is_semi_idle_group - Checks if the given sched_group is semi-idle.
+ * @ilb_group: group to be checked for semi-idleness
+ *
+ * Returns: 1 if the group is semi-idle. 0 otherwise.
+ *
+ * We define a sched_group to be semi idle if it has atleast one idle-CPU
+ * and atleast one non-idle CPU. This helper function checks if the given
+ * sched_group is semi-idle or not.
+ */
+static inline int is_semi_idle_group(struct sched_group *ilb_group)
+{
+ cpumask_and(nohz.ilb_grp_nohz_mask, nohz.cpu_mask,
+ sched_group_cpus(ilb_group));
+
+ /*
+ * A sched_group is semi-idle when it has atleast one busy cpu
+ * and atleast one idle cpu.
+ */
+ if (cpumask_empty(nohz.ilb_grp_nohz_mask))
+ return 0;
+
+ if (cpumask_equal(nohz.ilb_grp_nohz_mask, sched_group_cpus(ilb_group)))
+ return 0;
+
+ return 1;
+}
+/**
+ * find_new_ilb - Finds the optimum idle load balancer for nomination.
+ * @cpu: The cpu which is nominating a new idle_load_balancer.
+ *
+ * Returns: Returns the id of the idle load balancer if it exists,
+ * Else, returns >= nr_cpu_ids.
+ *
+ * This algorithm picks the idle load balancer such that it belongs to a
+ * semi-idle powersavings sched_domain. The idea is to try and avoid
+ * completely idle packages/cores just for the purpose of idle load balancing
+ * when there are other idle cpu's which are better suited for that job.
+ */
+static int find_new_ilb(int cpu)
+{
+ struct sched_domain *sd;
+ struct sched_group *ilb_group;
+
+ /*
+ * Have idle load balancer selection from semi-idle packages only
+ * when power-aware load balancing is enabled
+ */
+ if (!(sched_smt_power_savings || sched_mc_power_savings))
+ goto out_done;
+
+ /*
+ * Optimize for the case when we have no idle CPUs or only one
+ * idle CPU. Don't walk the sched_domain hierarchy in such cases
+ */
+ if (cpumask_weight(nohz.cpu_mask) < 2)
+ goto out_done;
+
+ for_each_flag_domain(cpu, sd, SD_POWERSAVINGS_BALANCE) {
+ ilb_group = sd->groups;
+
+ do {
+ if (is_semi_idle_group(ilb_group))
+ return cpumask_first(nohz.ilb_grp_nohz_mask);
+
+ ilb_group = ilb_group->next;
+
+ } while (ilb_group != sd->groups);
+ }
+
+out_done:
+ return cpumask_first(nohz.cpu_mask);
+}
+#else /* (CONFIG_SCHED_MC || CONFIG_SCHED_SMT) */
+static inline int find_new_ilb(int call_cpu)
+{
+ return first_cpu(nohz.cpu_mask);
+}
+#endif
+
/*
* This routine will try to nominate the ilb (idle load balancing)
* owner among the cpus whose ticks are stopped. ilb owner will do the idle
@@ -4468,15 +4584,7 @@ static inline void trigger_load_balance(struct rq *rq, int cpu)
}
if (atomic_read(&nohz.load_balancer) == -1) {
- /*
- * simple selection for now: Nominate the
- * first cpu in the nohz list to be the next
- * ilb owner.
- *
- * TBD: Traverse the sched domains and nominate
- * the nearest cpu in the nohz.cpu_mask.
- */
- int ilb = cpumask_first(nohz.cpu_mask);
+ int ilb = find_new_ilb(cpu);
if (ilb < nr_cpu_ids)
resched_cpu(ilb);
@@ -9051,6 +9159,7 @@ void __init sched_init(void)
#ifdef CONFIG_SMP
#ifdef CONFIG_NO_HZ
alloc_bootmem_cpumask_var(&nohz.cpu_mask);
+ alloc_bootmem_cpumask_var(&nohz.ilb_grp_nohz_mask);
#endif
alloc_bootmem_cpumask_var(&cpu_isolated_map);
#endif /* SMP */
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [RFC PATCH 2 1/2] sched: Nominate idle load balancer from a semi-idle package.
2009-04-14 4:55 ` [RFC PATCH 2 1/2] sched: Nominate idle load balancer from a semi-idle package Gautham R Shenoy
2009-04-14 10:01 ` [tip:sched/core] " tip-bot for Gautham R Shenoy
@ 2009-04-14 17:32 ` Randy Dunlap
1 sibling, 0 replies; 10+ messages in thread
From: Randy Dunlap @ 2009-04-14 17:32 UTC (permalink / raw)
To: Gautham R Shenoy
Cc: Ingo Molnar, Peter Zijlstra, Vaidyanathan Srinivasan,
linux-kernel, Balbir Singh, Suresh Siddha, Andi Kleen
Gautham R Shenoy wrote:
> kernel/sched.c | 127 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
> 1 files changed, 118 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 6cc1fd5..d702be7 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
>
> +/**
> + * is_semi_idle_group - Checks if the given sched_group is semi-idle.
> + * @ilb_group: group to be checked for semi-idleness
> + *
> + * Returns: 1 if the group is semi-idle. 0 otherwise.
> + *
> + * We define a sched_group to be semi idle if it has atleast one idle-CPU
> + * and atleast one non-idle CPU. This helper function checks if the given
"at least" (2x)
> + * sched_group is semi-idle or not.
> + */
> +static inline int is_semi_idle_group(struct sched_group *ilb_group)
> +{
> + cpumask_and(nohz.ilb_grp_nohz_mask, nohz.cpu_mask,
> + sched_group_cpus(ilb_group));
> +
> + /*
> + * A sched_group is semi-idle when it has atleast one busy cpu
> + * and atleast one idle cpu.
"at least" (2x)
> + */
> + if (cpumask_empty(nohz.ilb_grp_nohz_mask))
> + return 0;
> +
> + if (cpumask_equal(nohz.ilb_grp_nohz_mask, sched_group_cpus(ilb_group)))
> + return 0;
> +
> + return 1;
> +}
--
~Randy
^ permalink raw reply [flat|nested] 10+ messages in thread
* [RFC PATCH 2 2/2] sched: Nominate a power-efficient ilb in select_nohz_balancer()
2009-04-14 4:55 [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB Gautham R Shenoy
2009-04-14 4:55 ` [RFC PATCH 2 1/2] sched: Nominate idle load balancer from a semi-idle package Gautham R Shenoy
@ 2009-04-14 4:55 ` Gautham R Shenoy
2009-04-14 10:01 ` [tip:sched/core] " tip-bot for Gautham R Shenoy
2009-04-14 9:48 ` [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB Peter Zijlstra
2009-04-22 1:05 ` Suresh Siddha
3 siblings, 1 reply; 10+ messages in thread
From: Gautham R Shenoy @ 2009-04-14 4:55 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vaidyanathan Srinivasan
Cc: linux-kernel, Balbir Singh, Suresh Siddha, Andi Kleen,
Randy Dunlap, Gautham R Shenoy
The CPU that first goes idle becomes the idle-load-balancer and remains
that until either it picks up a task or till all the CPUs of the system
goes idle.
Optimize this further to allow it to relinquish it's post
once all it's siblings in the power-aware sched_domain go idle, thereby
allowing the whole package-core to go idle. While relinquising the post,
nominate another an idle-load balancer from a semi-idle core/package.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
---
kernel/sched.c | 18 +++++++++++++++++-
1 files changed, 17 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index d702be7..bbf367d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4402,8 +4402,24 @@ int select_nohz_load_balancer(int stop_tick)
/* make me the ilb owner */
if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
return 1;
- } else if (atomic_read(&nohz.load_balancer) == cpu)
+ } else if (atomic_read(&nohz.load_balancer) == cpu) {
+ int new_ilb;
+
+ if (!(sched_smt_power_savings ||
+ sched_mc_power_savings))
+ return 1;
+ /*
+ * Check to see if there is a more power-efficient
+ * ilb.
+ */
+ new_ilb = find_new_ilb(cpu);
+ if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
+ atomic_set(&nohz.load_balancer, -1);
+ resched_cpu(new_ilb);
+ return 0;
+ }
return 1;
+ }
} else {
if (!cpumask_test_cpu(cpu, nohz.cpu_mask))
return 0;
^ permalink raw reply related [flat|nested] 10+ messages in thread* [tip:sched/core] sched: Nominate a power-efficient ilb in select_nohz_balancer()
2009-04-14 4:55 ` [RFC PATCH 2 2/2] sched: Nominate a power-efficient ilb in select_nohz_balancer() Gautham R Shenoy
@ 2009-04-14 10:01 ` tip-bot for Gautham R Shenoy
0 siblings, 0 replies; 10+ messages in thread
From: tip-bot for Gautham R Shenoy @ 2009-04-14 10:01 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, ego, hpa, mingo, a.p.zijlstra, tglx, mingo
Commit-ID: e790fb0ba64bfec158e1219d899cb588275d12ab
Gitweb: http://git.kernel.org/tip/e790fb0ba64bfec158e1219d899cb588275d12ab
Author: Gautham R Shenoy <ego@in.ibm.com>
AuthorDate: Tue, 14 Apr 2009 10:25:35 +0530
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 14 Apr 2009 11:49:19 +0200
sched: Nominate a power-efficient ilb in select_nohz_balancer()
The CPU that first goes idle becomes the idle-load-balancer and remains
that until either it picks up a task or till all the CPUs of the system
goes idle.
Optimize this further to allow it to relinquish it's post
once all it's siblings in the power-aware sched_domain go idle, thereby
allowing the whole package-core to go idle. While relinquising the post,
nominate another an idle-load balancer from a semi-idle core/package.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090414045535.7645.31641.stgit@sofia.in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/sched.c | 18 +++++++++++++++++-
1 files changed, 17 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index b0fefa3..36d213b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4414,8 +4414,24 @@ int select_nohz_load_balancer(int stop_tick)
/* make me the ilb owner */
if (atomic_cmpxchg(&nohz.load_balancer, -1, cpu) == -1)
return 1;
- } else if (atomic_read(&nohz.load_balancer) == cpu)
+ } else if (atomic_read(&nohz.load_balancer) == cpu) {
+ int new_ilb;
+
+ if (!(sched_smt_power_savings ||
+ sched_mc_power_savings))
+ return 1;
+ /*
+ * Check to see if there is a more power-efficient
+ * ilb.
+ */
+ new_ilb = find_new_ilb(cpu);
+ if (new_ilb < nr_cpu_ids && new_ilb != cpu) {
+ atomic_set(&nohz.load_balancer, -1);
+ resched_cpu(new_ilb);
+ return 0;
+ }
return 1;
+ }
} else {
if (!cpumask_test_cpu(cpu, nohz.cpu_mask))
return 0;
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB
2009-04-14 4:55 [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB Gautham R Shenoy
2009-04-14 4:55 ` [RFC PATCH 2 1/2] sched: Nominate idle load balancer from a semi-idle package Gautham R Shenoy
2009-04-14 4:55 ` [RFC PATCH 2 2/2] sched: Nominate a power-efficient ilb in select_nohz_balancer() Gautham R Shenoy
@ 2009-04-14 9:48 ` Peter Zijlstra
2009-04-14 9:58 ` Gautham R Shenoy
2009-04-22 1:05 ` Suresh Siddha
3 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2009-04-14 9:48 UTC (permalink / raw)
To: Gautham R Shenoy
Cc: Ingo Molnar, Vaidyanathan Srinivasan, linux-kernel, Balbir Singh,
Suresh Siddha, Andi Kleen, Randy Dunlap
On Tue, 2009-04-14 at 10:25 +0530, Gautham R Shenoy wrote:
> Hi,
>
> This is the second iteration of the patchset which aims at improving
> the idle-load balancer nomination logic, by taking the system topology
> into consideration.
>
> Changes from v1 (found here: http://lkml.org/lkml/2009/4/2/246)
> o Fixed the kernel-doc style comments.
> o Renamed a variable to better reflect it's usage.
>
> Background
> ======================================
> An idle-load balancer is an idle-cpu which does not turn off it's sched_ticks
> and performs load-balancing on behalf of the other idle CPUs. Currently,
> this idle load balancer is nominated as the first_cpu(nohz.cpu_mask)
>
> The drawback of the current method is that the CPU numbering in the
> cores/packages need not necessarily be sequential. For example, on a
> two-socket, Quad core system, the CPU numbering can be as follows:
>
> |-------------------------------| |-------------------------------|
> | | | | | |
> | 0 | 2 | | 1 | 3 |
> |-------------------------------| |-------------------------------|
> | | | | | |
> | 4 | 6 | | 5 | 7 |
> |-------------------------------| |-------------------------------|
>
> Now, the other power-savings settings such as the sched_mc/smt_power_savings
> and the power-aware IRQ balancer try to balance tasks/IRQs by taking
> the system topology into consideration, with the intention of keeping
> as many "power-domains" (cores/packages) in the low-power state.
>
> The current idle-load-balancer nomination does not necessarily align towards
> this policy. For eg, we could be having tasks and interrupts largely running
> on the first package with the intention of keeping the second package idle.
> Hence, CPU 0 may be busy. The first_cpu in the nohz.cpu_mask happens to be CPU1,
> which in-turn becomes nominated as the idle-load balancer. CPU1 being from
> the 2nd package, would in turn prevent the 2nd package from going into a
> deeper sleep state.
>
> Instead the role of the idle-load balancer could have been assumed by an
> idle CPU from the first package, thereby helping the second package go
> completely idle.
>
> This patchset has been tested with 2.6.30-rc1 on a Two-Socket
> Quad core system with the topology as mentioned above.
>
> |----------------------------------------------------------------------------|
> | With Patchset + sched_mc_power_savings = 1 |
> |----------------------------------------------------------------------------|
> |make -j2 options| time taken | LOC timer interrupts | LOC timer interrupts|
> | | | on Package 0 | on Package 1 |
> |----------------------------------------------------------------------------|
> |taskset -c 0,2 | | CPU0 | CPU2 | CPU1 | CPU3 |
> | | 227.234s | 56969 | 57080 | 1003 | 588 |
> | | |----------------------------------------------|
> | | | CPU4 | CPU6 | CPU5 | CPU7 |
> | | | 55995 | 703 | 583 | 600 |
> |----------------------------------------------------------------------------|
> |taskset -c 1,3 | | CPU0 | CPU2 | CPU1 | CPU3 |
> | | 227.136s | 1109 | 611 | 57074 | 57091 |
> | | |----------------------------------------------|
> | | | CPU4 | CPU6 | CPU5 | CPU7 |
> | | | 709 | 637 | 56133 | 587 |
> |----------------------------------------------------------------------------|
>
> We see here that the idle load balancer is chosen from the package which is
> busy. In the first case, it's CPU4 and in the second case it's CPU5.
>
> |----------------------------------------------------------------------------|
> | With Patchset + sched_mc_power_savings = 1 |
> |----------------------------------------------------------------------------|
> |make -j2 options| time taken | LOC timer interrupts | LOC timer interrupts|
> | | | on Package 0 | on Package 1 |
> |----------------------------------------------------------------------------|
> |taskset -c 0,2 | | CPU0 | CPU2 | CPU1 | CPU3 |
> | | 228.786s | 59094 | 61994 | 13984 | 43652 |
> | | |----------------------------------------------|
> | | | CPU4 | CPU6 | CPU5 | CPU7 |
> | | | 1827 | 734 | 748 | 760 |
> |----------------------------------------------------------------------------|
> |taskset -c 1,3 | | CPU0 | CPU2 | CPU1 | CPU3 |
> | | 228.435s | 57013 | 876 | 58596 | 61633 |
> | | |----------------------------------------------|
> | | | CPU4 | CPU6 | CPU5 | CPU7 |
> | | | 772 | 1133 | 850 | 910 |
> |----------------------------------------------------------------------------|
>
> Here, we see that the idle load balancer is chosen from the other package,
> despite choosing sched_mc_power_savings = 1. In the first case, we have
> CPU1 and CPU3 sharing the responsibility among themselves. In the second case,
> it's CPU0 and CPU6, which assume that role.
Both tables above claim to be _with_ the pathes :-), from the
accompanying text one can deduce its the bottom one that is without.
Patches look straight-forward enough, seems good stuff.
Thanks!
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB
2009-04-14 9:48 ` [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB Peter Zijlstra
@ 2009-04-14 9:58 ` Gautham R Shenoy
0 siblings, 0 replies; 10+ messages in thread
From: Gautham R Shenoy @ 2009-04-14 9:58 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Vaidyanathan Srinivasan, linux-kernel, Balbir Singh,
Suresh Siddha, Andi Kleen, Randy Dunlap
On Tue, Apr 14, 2009 at 11:48:04AM +0200, Peter Zijlstra wrote:
> On Tue, 2009-04-14 at 10:25 +0530, Gautham R Shenoy wrote:
> > Hi,
> >
> > This is the second iteration of the patchset which aims at improving
> > the idle-load balancer nomination logic, by taking the system topology
> > into consideration.
> >
> > Changes from v1 (found here: http://lkml.org/lkml/2009/4/2/246)
> > o Fixed the kernel-doc style comments.
> > o Renamed a variable to better reflect it's usage.
> >
> > Background
> > ======================================
> > An idle-load balancer is an idle-cpu which does not turn off it's sched_ticks
> > and performs load-balancing on behalf of the other idle CPUs. Currently,
> > this idle load balancer is nominated as the first_cpu(nohz.cpu_mask)
> >
> > The drawback of the current method is that the CPU numbering in the
> > cores/packages need not necessarily be sequential. For example, on a
> > two-socket, Quad core system, the CPU numbering can be as follows:
> >
> > |-------------------------------| |-------------------------------|
> > | | | | | |
> > | 0 | 2 | | 1 | 3 |
> > |-------------------------------| |-------------------------------|
> > | | | | | |
> > | 4 | 6 | | 5 | 7 |
> > |-------------------------------| |-------------------------------|
> >
> > Now, the other power-savings settings such as the sched_mc/smt_power_savings
> > and the power-aware IRQ balancer try to balance tasks/IRQs by taking
> > the system topology into consideration, with the intention of keeping
> > as many "power-domains" (cores/packages) in the low-power state.
> >
> > The current idle-load-balancer nomination does not necessarily align towards
> > this policy. For eg, we could be having tasks and interrupts largely running
> > on the first package with the intention of keeping the second package idle.
> > Hence, CPU 0 may be busy. The first_cpu in the nohz.cpu_mask happens to be CPU1,
> > which in-turn becomes nominated as the idle-load balancer. CPU1 being from
> > the 2nd package, would in turn prevent the 2nd package from going into a
> > deeper sleep state.
> >
> > Instead the role of the idle-load balancer could have been assumed by an
> > idle CPU from the first package, thereby helping the second package go
> > completely idle.
> >
> > This patchset has been tested with 2.6.30-rc1 on a Two-Socket
> > Quad core system with the topology as mentioned above.
> >
> > |----------------------------------------------------------------------------|
> > | With Patchset + sched_mc_power_savings = 1 |
> > |----------------------------------------------------------------------------|
> > |make -j2 options| time taken | LOC timer interrupts | LOC timer interrupts|
> > | | | on Package 0 | on Package 1 |
> > |----------------------------------------------------------------------------|
> > |taskset -c 0,2 | | CPU0 | CPU2 | CPU1 | CPU3 |
> > | | 227.234s | 56969 | 57080 | 1003 | 588 |
> > | | |----------------------------------------------|
> > | | | CPU4 | CPU6 | CPU5 | CPU7 |
> > | | | 55995 | 703 | 583 | 600 |
> > |----------------------------------------------------------------------------|
> > |taskset -c 1,3 | | CPU0 | CPU2 | CPU1 | CPU3 |
> > | | 227.136s | 1109 | 611 | 57074 | 57091 |
> > | | |----------------------------------------------|
> > | | | CPU4 | CPU6 | CPU5 | CPU7 |
> > | | | 709 | 637 | 56133 | 587 |
> > |----------------------------------------------------------------------------|
> >
> > We see here that the idle load balancer is chosen from the package which is
> > busy. In the first case, it's CPU4 and in the second case it's CPU5.
> >
> > |----------------------------------------------------------------------------|
> > | With Patchset + sched_mc_power_savings = 1 |
^^^^
Without
> > |----------------------------------------------------------------------------|
> > |make -j2 options| time taken | LOC timer interrupts | LOC timer interrupts|
> > | | | on Package 0 | on Package 1 |
> > |----------------------------------------------------------------------------|
> > |taskset -c 0,2 | | CPU0 | CPU2 | CPU1 | CPU3 |
> > | | 228.786s | 59094 | 61994 | 13984 | 43652 |
> > | | |----------------------------------------------|
> > | | | CPU4 | CPU6 | CPU5 | CPU7 |
> > | | | 1827 | 734 | 748 | 760 |
> > |----------------------------------------------------------------------------|
> > |taskset -c 1,3 | | CPU0 | CPU2 | CPU1 | CPU3 |
> > | | 228.435s | 57013 | 876 | 58596 | 61633 |
> > | | |----------------------------------------------|
> > | | | CPU4 | CPU6 | CPU5 | CPU7 |
> > | | | 772 | 1133 | 850 | 910 |
> > |----------------------------------------------------------------------------|
> >
> > Here, we see that the idle load balancer is chosen from the other package,
> > despite choosing sched_mc_power_savings = 1. In the first case, we have
> > CPU1 and CPU3 sharing the responsibility among themselves. In the second case,
> > it's CPU0 and CPU6, which assume that role.
>
> Both tables above claim to be _with_ the pathes :-), from the
> accompanying text one can deduce its the bottom one that is without.
Sorry, copy pasted the 2nd table from the first, and updated only the
values.
>
> Patches look straight-forward enough, seems good stuff.
Thanks for the review!
>
> Thanks!
--
Thanks and Regards
gautham
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB
2009-04-14 4:55 [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB Gautham R Shenoy
` (2 preceding siblings ...)
2009-04-14 9:48 ` [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB Peter Zijlstra
@ 2009-04-22 1:05 ` Suresh Siddha
2009-04-26 16:56 ` Vaidyanathan Srinivasan
3 siblings, 1 reply; 10+ messages in thread
From: Suresh Siddha @ 2009-04-22 1:05 UTC (permalink / raw)
To: Gautham R Shenoy
Cc: Ingo Molnar, Peter Zijlstra, Vaidyanathan Srinivasan,
linux-kernel@vger.kernel.org, Balbir Singh, Andi Kleen,
Randy Dunlap
On Mon, 2009-04-13 at 21:55 -0700, Gautham R Shenoy wrote:
> Now, the other power-savings settings such as the sched_mc/smt_power_savings
> and the power-aware IRQ balancer try to balance tasks/IRQs by taking
> the system topology into consideration, with the intention of keeping
> as many "power-domains" (cores/packages) in the low-power state.
>
> The current idle-load-balancer nomination does not necessarily align towards
> this policy. For eg, we could be having tasks and interrupts largely running
> on the first package with the intention of keeping the second package idle.
> Hence, CPU 0 may be busy. The first_cpu in the nohz.cpu_mask happens to be CPU1,
> which in-turn becomes nominated as the idle-load balancer. CPU1 being from
> the 2nd package, would in turn prevent the 2nd package from going into a
> deeper sleep state.
>
> Instead the role of the idle-load balancer could have been assumed by an
> idle CPU from the first package, thereby helping the second package go
> completely idle.
Can we also do this by default? i.e., even when no power-savings policy
is selected.
I don't see anything wrong by enabling this logic for all the cases.
thanks,
suresh
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [RFC PATCH v2 0/2] sched: Nominate a power-efficient ILB
2009-04-22 1:05 ` Suresh Siddha
@ 2009-04-26 16:56 ` Vaidyanathan Srinivasan
0 siblings, 0 replies; 10+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-04-26 16:56 UTC (permalink / raw)
To: Suresh Siddha
Cc: Gautham R Shenoy, Ingo Molnar, Peter Zijlstra,
linux-kernel@vger.kernel.org, Balbir Singh, Andi Kleen,
Randy Dunlap
* Suresh Siddha <suresh.b.siddha@intel.com> [2009-04-21 18:05:22]:
> On Mon, 2009-04-13 at 21:55 -0700, Gautham R Shenoy wrote:
> > Now, the other power-savings settings such as the sched_mc/smt_power_savings
> > and the power-aware IRQ balancer try to balance tasks/IRQs by taking
> > the system topology into consideration, with the intention of keeping
> > as many "power-domains" (cores/packages) in the low-power state.
> >
> > The current idle-load-balancer nomination does not necessarily align towards
> > this policy. For eg, we could be having tasks and interrupts largely running
> > on the first package with the intention of keeping the second package idle.
> > Hence, CPU 0 may be busy. The first_cpu in the nohz.cpu_mask happens to be CPU1,
> > which in-turn becomes nominated as the idle-load balancer. CPU1 being from
> > the 2nd package, would in turn prevent the 2nd package from going into a
> > deeper sleep state.
> >
> > Instead the role of the idle-load balancer could have been assumed by an
> > idle CPU from the first package, thereby helping the second package go
> > completely idle.
>
> Can we also do this by default? i.e., even when no power-savings policy
> is selected.
>
> I don't see anything wrong by enabling this logic for all the cases.
Hi Suresh,
Thanks for the review. The selection of the idle-load-balancer
depends on the sched domain level with SD_POWERSAVINGS_BALANCE flag
set that indicates the cpu-package boundary. Hence this scheme
depends on sched_mc_powersavings being non zero so that the sched
domains (cpu level) are build based on cpu-package information.
The logical-cpu to cpu-package grouping information is not readily
available (in the sched domain) when power-saving policies are off.
We may need to look at selection of idle-load-balancer for maximum
performance (bandwidth utilisation) when power save policies are off.
--Vaidy
^ permalink raw reply [flat|nested] 10+ messages in thread