* [RFC PATCH v2 1/2] sched: add sched_max_capacity_pct
2009-05-13 13:11 [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Vaidyanathan Srinivasan
@ 2009-05-13 13:11 ` Vaidyanathan Srinivasan
2009-05-13 13:11 ` [RFC PATCH v2 2/2] sched: loadbalancer hacks for forced packing of tasks Vaidyanathan Srinivasan
` (2 subsequent siblings)
3 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-05-13 13:11 UTC (permalink / raw)
To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
Peter Zijlstra, Arjan van de Ven
Cc: Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
Gautham R Shenoy, Andi Kleen, Gregory Haskins, Mike Galbraith,
Thomas Gleixner, Arun Bharadwaj, Vaidyanathan Srinivasan
Add a new sysfs variable that can be used by user space
to pass the number of core to evacuate or force idle.
/sys/devices/system/cpu/sched_max_capacity_pct defaults to 100
This is percentage value that can be used to force idle cores.
The percentage number shall be in steps corresponding to number
of cores in the system.
On a 8 core system (dual socket quad core), each core step will
be 12.5% rounded to 12%.
Echoing 88 will use 7 cores in the system:
% No of cores
100 8
87 7
75 6
62 5
50 4
...
...
This patch will evacuate only one package (50%) in ths case.
** This is a RFC patch for discussion ***
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
kernel/sched.c | 37 +++++++++++++++++++++++++++++++++++++
1 files changed, 37 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index b902e58..f22b9f6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3291,6 +3291,9 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+
+int sched_evacuate_cores; /* No of forced-idle cores */
+
/**
* init_sd_power_savings_stats - Initialize power savings statistics for
* the given sched_domain, during load balancing.
@@ -8604,6 +8607,37 @@ static ssize_t sched_mc_power_savings_store(struct sysdev_class *class,
static SYSDEV_CLASS_ATTR(sched_mc_power_savings, 0644,
sched_mc_power_savings_show,
sched_mc_power_savings_store);
+
+static ssize_t sched_max_capacity_pct_show(struct sysdev_class *class,
+ char *page)
+{
+ int capacity;
+ /* Convert no of cores to system capacity percentage */
+ /* FIXME: Will work only for non-threaded systems */
+ capacity = 100 - sched_evacuate_cores * 100 / nr_cpu_ids;
+ return sprintf(page, "%u\n", capacity);
+}
+static ssize_t sched_max_capacity_pct_store(struct sysdev_class *class,
+ const char *buf, size_t count)
+{
+ int capacity;
+ if (!sscanf(buf, "%u", &capacity))
+ return -EINVAL;
+
+ if (capacity < 1 || capacity > 100)
+ return -EINVAL;
+
+ /* Convert user provided percentage into no-of-cores to evacuate */
+ /* FIXME: Will work only for non-threaded systems */
+ sched_evacuate_cores = (101 - capacity) * nr_cpu_ids / 100;
+ return count;
+}
+
+
+static SYSDEV_CLASS_ATTR(sched_max_capacity_pct, 0644,
+ sched_max_capacity_pct_show,
+ sched_max_capacity_pct_store);
+
#endif
#ifdef CONFIG_SCHED_SMT
@@ -8635,6 +8669,9 @@ int __init sched_create_sysfs_power_savings_entries(struct sysdev_class *cls)
if (!err && mc_capable())
err = sysfs_create_file(&cls->kset.kobj,
&attr_sched_mc_power_savings.attr);
+ if (!err)
+ err = sysfs_create_file(&cls->kset.kobj,
+ &attr_sched_max_capacity_pct.attr);
#endif
return err;
}
^ permalink raw reply related [flat|nested] 22+ messages in thread* [RFC PATCH v2 2/2] sched: loadbalancer hacks for forced packing of tasks
2009-05-13 13:11 [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Vaidyanathan Srinivasan
2009-05-13 13:11 ` [RFC PATCH v2 1/2] sched: add sched_max_capacity_pct Vaidyanathan Srinivasan
@ 2009-05-13 13:11 ` Vaidyanathan Srinivasan
2009-05-13 13:14 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Peter Zijlstra
2009-05-13 14:35 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Andi Kleen
3 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-05-13 13:11 UTC (permalink / raw)
To: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
Peter Zijlstra, Arjan van de Ven
Cc: Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
Gautham R Shenoy, Andi Kleen, Gregory Haskins, Mike Galbraith,
Thomas Gleixner, Arun Bharadwaj, Vaidyanathan Srinivasan
Pack more tasks in a group so as to reduce number of CPUs
used to run the work in the system.
Just for load balancing purpose, assume the group capacity
has been increased by group_capacity_bump()
Hacks:
o Make non-idle cpus also perform powersave balance so
that we can pull more tasks into the group
o Increase group capacity for calculation
o Increase load-balancing threshold so that even if a
group is overloaded by group_capacity_bump(), consider
it balanced
Basically if we want to evacuate 2 cores, the group capacity
is increased by 2 (*SCHED_LOAD_SCALE) and the power save
balancer will accommodate the tasks after selecting the group
leader.
This will not work if the system is overloaded. Even
after pulling 2 extra tasks, there could be tasks to fill the
other package. At this point we are not yet reducing the
group capacity of the other group.
*** RFC patch for discussion ***
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
---
kernel/sched.c | 28 +++++++++++++++++++++++++++-
1 files changed, 27 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index f22b9f6..186b0ec 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3234,6 +3234,7 @@ struct sd_lb_stats {
int group_imb; /* Is there imbalance in this sd */
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
int power_savings_balance; /* Is powersave balance needed for this sd */
+ unsigned int group_capacity_bump; /* % increase in group capacity */
struct sched_group *group_min; /* Least loaded group in sd */
struct sched_group *group_leader; /* Group which relieves group_min */
unsigned long min_load_per_task; /* load_per_task in group_min */
@@ -3294,6 +3295,15 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
int sched_evacuate_cores; /* No of forced-idle cores */
+static inline unsigned int group_capacity_bump(struct sched_domain *sd)
+{
+
+ if (sd->flags & SD_POWERSAVINGS_BALANCE)
+ return sched_evacuate_cores;
+
+ return 0;
+}
+
/**
* init_sd_power_savings_stats - Initialize power savings statistics for
* the given sched_domain, during load balancing.
@@ -3309,12 +3319,14 @@ static inline void init_sd_power_savings_stats(struct sched_domain *sd,
* Busy processors will not participate in power savings
* balance.
*/
- if (idle == CPU_NOT_IDLE || !(sd->flags & SD_POWERSAVINGS_BALANCE))
+ if ((idle == CPU_NOT_IDLE && !sched_evacuate_cores) ||
+ !(sd->flags & SD_POWERSAVINGS_BALANCE))
sds->power_savings_balance = 0;
else {
sds->power_savings_balance = 1;
sds->min_nr_running = ULONG_MAX;
sds->leader_nr_running = 0;
+ sds->group_capacity_bump = group_capacity_bump(sd);
}
}
@@ -3436,6 +3448,12 @@ static inline int check_power_save_busiest_group(struct sd_lb_stats *sds,
{
return 0;
}
+
+static inline unsigned int group_capacity_bump(struct sched_domain *sd)
+{
+ return 0;
+}
+
#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
@@ -3568,6 +3586,8 @@ static inline void update_sd_lb_stats(struct sched_domain *sd, int this_cpu,
if (local_group && balance && !(*balance))
return;
+ /* Bump up group capacity for forced packing of tasks */
+ sgs.group_capacity += sds->group_capacity_bump;
sds->total_load += sgs.group_load;
sds->total_pwr += group->__cpu_power;
@@ -3768,6 +3788,12 @@ find_busiest_group(struct sched_domain *sd, int this_cpu,
if (100 * sds.max_load <= sd->imbalance_pct * sds.this_load)
goto out_balanced;
+ /* Push the upper limits for overload */
+ if (sds.max_load <= (sds.busiest->__cpu_power +
+ sds.group_capacity_bump * SCHED_LOAD_SCALE) /
+ sds.busiest->__cpu_power * SCHED_LOAD_SCALE)
+ goto out_balanced;
+
sds.busiest_load_per_task /= sds.busiest_nr_running;
if (sds.group_imb)
sds.busiest_load_per_task =
^ permalink raw reply related [flat|nested] 22+ messages in thread* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
2009-05-13 13:11 [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Vaidyanathan Srinivasan
2009-05-13 13:11 ` [RFC PATCH v2 1/2] sched: add sched_max_capacity_pct Vaidyanathan Srinivasan
2009-05-13 13:11 ` [RFC PATCH v2 2/2] sched: loadbalancer hacks for forced packing of tasks Vaidyanathan Srinivasan
@ 2009-05-13 13:14 ` Peter Zijlstra
2009-05-13 13:42 ` [RFC PATCH v2 0/2] Saving power by cpu evacuationsched_max_capacity_pct=n Vaidyanathan Srinivasan
2009-05-13 13:45 ` Balbir Singh
2009-05-13 14:35 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Andi Kleen
3 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2009-05-13 13:14 UTC (permalink / raw)
To: Vaidyanathan Srinivasan
Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
Arjan van de Ven, Ingo Molnar, Dipankar Sarma, Balbir Singh,
Vatsa, Gautham R Shenoy, Andi Kleen, Gregory Haskins,
Mike Galbraith, Thomas Gleixner, Arun Bharadwaj
On Wed, 2009-05-13 at 18:41 +0530, Vaidyanathan Srinivasan wrote:
> * Peter Zijlstra wanted more justifications for throttling at the core
> level. Throttling may be a resource management problem rather than
> scheduler/load balancer
No, I mandate that it be thermal management. Any other reason and you've
got a NAK.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuationsched_max_capacity_pct=n
2009-05-13 13:14 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Peter Zijlstra
@ 2009-05-13 13:42 ` Vaidyanathan Srinivasan
2009-05-13 13:45 ` Balbir Singh
1 sibling, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-05-13 13:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
Arjan van de Ven, Ingo Molnar, Dipankar Sarma, Balbir Singh,
Vatsa, Gautham R Shenoy, Andi Kleen, Gregory Haskins,
Mike Galbraith, Thomas Gleixner, Arun Bharadwaj
* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-05-13 15:14:57]:
> On Wed, 2009-05-13 at 18:41 +0530, Vaidyanathan Srinivasan wrote:
>
> > * Peter Zijlstra wanted more justifications for throttling at the core
> > level. Throttling may be a resource management problem rather than
> > scheduler/load balancer
>
> No, I mandate that it be thermal management. Any other reason and you've
> got a NAK.
Hi Peter,
Yes, I understand your objection. Your want throttling to be done for
the purpose of thermal management only. The primary purpose for
throttling should be thermal management (power savings may be
a side-effect)
What I meant in the above comment was that the implementation for
throttling could be solved using resource management framework,
cpuset/cgroup rather than biasing the load balancer to avoid work on
a particular core.
I am open to ideas for a clean and easy framework for core level
throttling.
Thanks,
Vaidy
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuationsched_max_capacity_pct=n
2009-05-13 13:14 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Peter Zijlstra
2009-05-13 13:42 ` [RFC PATCH v2 0/2] Saving power by cpu evacuationsched_max_capacity_pct=n Vaidyanathan Srinivasan
@ 2009-05-13 13:45 ` Balbir Singh
2009-05-13 13:47 ` Peter Zijlstra
1 sibling, 1 reply; 22+ messages in thread
From: Balbir Singh @ 2009-05-13 13:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Vaidyanathan Srinivasan, Linux Kernel, Suresh B Siddha,
Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
Dipankar Sarma, Vatsa, Gautham R Shenoy, Andi Kleen,
Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj
* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-05-13 15:14:57]:
> On Wed, 2009-05-13 at 18:41 +0530, Vaidyanathan Srinivasan wrote:
>
> > * Peter Zijlstra wanted more justifications for throttling at the core
> > level. Throttling may be a resource management problem rather than
> > scheduler/load balancer
>
> No, I mandate that it be thermal management. Any other reason and you've
> got a NAK.
We've been discussing hard limits from the resource management view
point. Bharata is working on a RFC that we should be able to share
soon
--
Balbir
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuationsched_max_capacity_pct=n
2009-05-13 13:45 ` Balbir Singh
@ 2009-05-13 13:47 ` Peter Zijlstra
2009-05-13 14:42 ` [RFC PATCH v2 0/2] Saving power by cpuevacuationsched_max_capacity_pct=n Balbir Singh
0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-05-13 13:47 UTC (permalink / raw)
To: balbir
Cc: Vaidyanathan Srinivasan, Linux Kernel, Suresh B Siddha,
Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
Dipankar Sarma, Vatsa, Gautham R Shenoy, Andi Kleen,
Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj
On Wed, 2009-05-13 at 19:15 +0530, Balbir Singh wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-05-13 15:14:57]:
>
> > On Wed, 2009-05-13 at 18:41 +0530, Vaidyanathan Srinivasan wrote:
> >
> > > * Peter Zijlstra wanted more justifications for throttling at the core
> > > level. Throttling may be a resource management problem rather than
> > > scheduler/load balancer
> >
> > No, I mandate that it be thermal management. Any other reason and you've
> > got a NAK.
>
> We've been discussing hard limits from the resource management view
> point. Bharata is working on a RFC that we should be able to share
> soon
Right, and you well know I dislike them.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpuevacuationsched_max_capacity_pct=n
2009-05-13 13:47 ` Peter Zijlstra
@ 2009-05-13 14:42 ` Balbir Singh
0 siblings, 0 replies; 22+ messages in thread
From: Balbir Singh @ 2009-05-13 14:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Vaidyanathan Srinivasan, Linux Kernel, Suresh B Siddha,
Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
Dipankar Sarma, Vatsa, Gautham R Shenoy, Andi Kleen,
Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj
* Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-05-13 15:47:55]:
> On Wed, 2009-05-13 at 19:15 +0530, Balbir Singh wrote:
> > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2009-05-13 15:14:57]:
> >
> > > On Wed, 2009-05-13 at 18:41 +0530, Vaidyanathan Srinivasan wrote:
> > >
> > > > * Peter Zijlstra wanted more justifications for throttling at the core
> > > > level. Throttling may be a resource management problem rather than
> > > > scheduler/load balancer
> > >
> > > No, I mandate that it be thermal management. Any other reason and you've
> > > got a NAK.
> >
> > We've been discussing hard limits from the resource management view
> > point. Bharata is working on a RFC that we should be able to share
> > soon
>
> Right, and you well know I dislike them.
>
Yes, but we hope to convince you with use cases and examples.
--
Balbir
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
2009-05-13 13:11 [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Vaidyanathan Srinivasan
` (2 preceding siblings ...)
2009-05-13 13:14 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Peter Zijlstra
@ 2009-05-13 14:35 ` Andi Kleen
2009-05-13 14:36 ` Peter Zijlstra
3 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2009-05-13 14:35 UTC (permalink / raw)
To: Vaidyanathan Srinivasan
Cc: Linux Kernel, Suresh B Siddha, Venkatesh Pallipadi,
Peter Zijlstra, Arjan van de Ven, Ingo Molnar, Dipankar Sarma,
Balbir Singh, Vatsa, Gautham R Shenoy, Andi Kleen,
Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj
On Wed, May 13, 2009 at 06:41:00PM +0530, Vaidyanathan Srinivasan wrote:
> * Using sched_mc=3,4,5 to evacuate 1,2,4 cores is completely
> non-intuitive and broken interface. Ingo wanted to see if we can
> model a global percentile tunable that would map to core throttling.
I have one request. CPU throttling is already a very well established
term in the x86 world, refering to thermal throttling when the CPU
overheats. This is implemented by ACPI and the CPU. It's always
a very bad thing that should be avoided at all costs.
You seem to use it for something else. Can you please use a different
term for that? Reusing the same word for something else is confusing.
Thanks,
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
2009-05-13 14:35 ` [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n Andi Kleen
@ 2009-05-13 14:36 ` Peter Zijlstra
2009-05-13 14:46 ` Andi Kleen
0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-05-13 14:36 UTC (permalink / raw)
To: Andi Kleen
Cc: Vaidyanathan Srinivasan, Linux Kernel, Suresh B Siddha,
Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj
On Wed, 2009-05-13 at 16:35 +0200, Andi Kleen wrote:
> On Wed, May 13, 2009 at 06:41:00PM +0530, Vaidyanathan Srinivasan wrote:
> > * Using sched_mc=3,4,5 to evacuate 1,2,4 cores is completely
> > non-intuitive and broken interface. Ingo wanted to see if we can
> > model a global percentile tunable that would map to core throttling.
>
> I have one request. CPU throttling is already a very well established
> term in the x86 world, refering to thermal throttling when the CPU
> overheats. This is implemented by ACPI and the CPU. It's always
> a very bad thing that should be avoided at all costs.
Its about avoiding that.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
2009-05-13 14:36 ` Peter Zijlstra
@ 2009-05-13 14:46 ` Andi Kleen
2009-05-13 14:50 ` Peter Zijlstra
0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2009-05-13 14:46 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Andi Kleen, Vaidyanathan Srinivasan, Linux Kernel,
Suresh B Siddha, Venkatesh Pallipadi, Arjan van de Ven,
Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
Gautham R Shenoy, Gregory Haskins, Mike Galbraith,
Thomas Gleixner, Arun Bharadwaj
On Wed, May 13, 2009 at 04:36:42PM +0200, Peter Zijlstra wrote:
> On Wed, 2009-05-13 at 16:35 +0200, Andi Kleen wrote:
> > On Wed, May 13, 2009 at 06:41:00PM +0530, Vaidyanathan Srinivasan wrote:
> > > * Using sched_mc=3,4,5 to evacuate 1,2,4 cores is completely
> > > non-intuitive and broken interface. Ingo wanted to see if we can
> > > model a global percentile tunable that would map to core throttling.
> >
> > I have one request. CPU throttling is already a very well established
> > term in the x86 world, refering to thermal throttling when the CPU
> > overheats. This is implemented by ACPI and the CPU. It's always
> > a very bad thing that should be avoided at all costs.
>
> Its about avoiding that.
Hmm? Can you explain please? CPU throttling should only happen when your
cooling system is broken in some way.
It's not a power saving feature, just a "don't make CPU melt" feature.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
2009-05-13 14:46 ` Andi Kleen
@ 2009-05-13 14:50 ` Peter Zijlstra
2009-05-13 15:01 ` Andi Kleen
0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-05-13 14:50 UTC (permalink / raw)
To: Andi Kleen
Cc: Vaidyanathan Srinivasan, Linux Kernel, Suresh B Siddha,
Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj
On Wed, 2009-05-13 at 16:46 +0200, Andi Kleen wrote:
> On Wed, May 13, 2009 at 04:36:42PM +0200, Peter Zijlstra wrote:
> > On Wed, 2009-05-13 at 16:35 +0200, Andi Kleen wrote:
> > > On Wed, May 13, 2009 at 06:41:00PM +0530, Vaidyanathan Srinivasan wrote:
> > > > * Using sched_mc=3,4,5 to evacuate 1,2,4 cores is completely
> > > > non-intuitive and broken interface. Ingo wanted to see if we can
> > > > model a global percentile tunable that would map to core throttling.
> > >
> > > I have one request. CPU throttling is already a very well established
> > > term in the x86 world, refering to thermal throttling when the CPU
> > > overheats. This is implemented by ACPI and the CPU. It's always
> > > a very bad thing that should be avoided at all costs.
> >
> > Its about avoiding that.
>
> Hmm? Can you explain please? CPU throttling should only happen when your
> cooling system is broken in some way.
>
> It's not a power saving feature, just a "don't make CPU melt" feature.
>From what I've been told its popular to over-commit the cooling capacity
in a rack, so that a number of servers can run at full thermal capacity
but not all.
I've also been told that hardware sucks at throttling, therefore people
want to fix the OS so as to limit the thermal capacity and avoid the
hardware throttle from kicking in, whilst still not exceeding the rack
capacity or similar nonsense.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
2009-05-13 14:50 ` Peter Zijlstra
@ 2009-05-13 15:01 ` Andi Kleen
2009-05-13 15:02 ` Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 22+ messages in thread
From: Andi Kleen @ 2009-05-13 15:01 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Andi Kleen, Vaidyanathan Srinivasan, Linux Kernel,
Suresh B Siddha, Venkatesh Pallipadi, Arjan van de Ven,
Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
Gautham R Shenoy, Gregory Haskins, Mike Galbraith,
Thomas Gleixner, Arun Bharadwaj
> >From what I've been told its popular to over-commit the cooling capacity
> in a rack, so that a number of servers can run at full thermal capacity
> but not all.
Yes. But in this case you don't want to use throttling, you want
to use p-states which actually safe power unlike throttling.
> I've also been told that hardware sucks at throttling,
Throttling is not really something you should use in normal
operation, it's just a emergency measure. For that it works
quite well, but you really don't want it in normal operation.
> therefore people
> want to fix the OS so as to limit the thermal capacity and avoid the
> hardware throttle from kicking in, whilst still not exceeding the rack
> capacity or similar nonsense.
Yes that's fine and common, but you actually need to save power for this,
which throttling doesn't do.
My understanding this work is a extension of the existing
sched_mc_power_savings features that tries to be optionally more
aggressive to keep complete package idle so that package level
power saving kicks in.
I'm just requesting that they don't call that throttling.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
2009-05-13 15:01 ` Andi Kleen
@ 2009-05-13 15:02 ` Peter Zijlstra
2009-05-13 15:10 ` Andi Kleen
2009-05-14 15:13 ` Vaidyanathan Srinivasan
2009-05-19 20:40 ` Pavel Machek
2 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2009-05-13 15:02 UTC (permalink / raw)
To: Andi Kleen
Cc: Vaidyanathan Srinivasan, Linux Kernel, Suresh B Siddha,
Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj
On Wed, 2009-05-13 at 17:01 +0200, Andi Kleen wrote:
> > >From what I've been told its popular to over-commit the cooling capacity
> > in a rack, so that a number of servers can run at full thermal capacity
> > but not all.
>
> Yes. But in this case you don't want to use throttling, you want
> to use p-states which actually safe power unlike throttling.
>
> > I've also been told that hardware sucks at throttling,
>
> Throttling is not really something you should use in normal
> operation, it's just a emergency measure. For that it works
> quite well, but you really don't want it in normal operation.
>
> > therefore people
> > want to fix the OS so as to limit the thermal capacity and avoid the
> > hardware throttle from kicking in, whilst still not exceeding the rack
> > capacity or similar nonsense.
>
> Yes that's fine and common, but you actually need to save power for this,
> which throttling doesn't do.
>
> My understanding this work is a extension of the existing
> sched_mc_power_savings features that tries to be optionally more
> aggressive to keep complete package idle so that package level
> power saving kicks in.
>
> I'm just requesting that they don't call that throttling.
Ah no, this work differs in that regard in that it actually 'generates'
idle time, instead of optimizing idle time.
Therefore it takes actual cpu time away from real work, which is
throttling. Granted, one could call it limiting or similar, but
throttling is a correct name.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
2009-05-13 15:02 ` Peter Zijlstra
@ 2009-05-13 15:10 ` Andi Kleen
2009-05-14 14:58 ` Vaidyanathan Srinivasan
0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2009-05-13 15:10 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Andi Kleen, Vaidyanathan Srinivasan, Linux Kernel,
Suresh B Siddha, Venkatesh Pallipadi, Arjan van de Ven,
Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
Gautham R Shenoy, Gregory Haskins, Mike Galbraith,
Thomas Gleixner, Arun Bharadwaj
> > Yes that's fine and common, but you actually need to save power for this,
> > which throttling doesn't do.
> >
> > My understanding this work is a extension of the existing
> > sched_mc_power_savings features that tries to be optionally more
> > aggressive to keep complete package idle so that package level
> > power saving kicks in.
> >
> > I'm just requesting that they don't call that throttling.
>
> Ah no, this work differs in that regard in that it actually 'generates'
> idle time, instead of optimizing idle time.
That is what i meant with "more aggressive to keep complete packages idle"
above.
>
> Therefore it takes actual cpu time away from real work, which is
> throttling. Granted, one could call it limiting or similar, but
> throttling is a correct name.
That will be always ongoing confusion with the existing established
term.
If you really need to call it throttling use "scheduler throttling"
or something like that, but a different word would be better.
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
2009-05-13 15:10 ` Andi Kleen
@ 2009-05-14 14:58 ` Vaidyanathan Srinivasan
2009-05-14 15:06 ` Andi Kleen
0 siblings, 1 reply; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-05-14 14:58 UTC (permalink / raw)
To: Andi Kleen
Cc: Peter Zijlstra, Linux Kernel, Suresh B Siddha,
Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj
* Andi Kleen <andi@firstfloor.org> [2009-05-13 17:10:54]:
> > > Yes that's fine and common, but you actually need to save power for this,
> > > which throttling doesn't do.
> > >
> > > My understanding this work is a extension of the existing
> > > sched_mc_power_savings features that tries to be optionally more
> > > aggressive to keep complete package idle so that package level
> > > power saving kicks in.
> > >
> > > I'm just requesting that they don't call that throttling.
> >
> > Ah no, this work differs in that regard in that it actually 'generates'
> > idle time, instead of optimizing idle time.
>
> That is what i meant with "more aggressive to keep complete packages idle"
> above.
Hi Andi,
There is a difference in the framework as Peter has mentioned, we are
trying to create idle times by forcefully reducing work. From an
end-user point of view, this can be seen as a logical extension of
sched_mc_power_savings... v1 of the RFC extends the framework.
However Ingo suggested that the knob is not intuitive and hence I have
tried to switch to a percentage knob sched_max_capacity_pct.
I am interested in an easy, simple and intuitive framework to evacuate
cores which may imply forcefully reducing (throttling) work.
> > Therefore it takes actual cpu time away from real work, which is
> > throttling. Granted, one could call it limiting or similar, but
> > throttling is a correct name.
>
> That will be always ongoing confusion with the existing established
> term.
>
> If you really need to call it throttling use "scheduler throttling"
> or something like that, but a different word would be better.
I think 'scheduler throttling' is good so that we avoid the term 'CPU
throttling' or core throttling. I had named this cpu evacuation or
core evacuation just to avoid confusion with hardware throttling.
--Vaidy
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
2009-05-14 14:58 ` Vaidyanathan Srinivasan
@ 2009-05-14 15:06 ` Andi Kleen
2009-05-14 15:43 ` Vaidyanathan Srinivasan
0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2009-05-14 15:06 UTC (permalink / raw)
To: Vaidyanathan Srinivasan
Cc: Andi Kleen, Peter Zijlstra, Linux Kernel, Suresh B Siddha,
Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj
> I think 'scheduler throttling' is good so that we avoid the term 'CPU
> throttling' or core throttling. I had named this cpu evacuation or
> core evacuation just to avoid confusion with hardware throttling.
Evacuation sounds good, although shouldn't it be package or
socket evacuation?
-Andi
--
ak@linux.intel.com -- Speaking for myself only.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
2009-05-14 15:06 ` Andi Kleen
@ 2009-05-14 15:43 ` Vaidyanathan Srinivasan
0 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-05-14 15:43 UTC (permalink / raw)
To: Andi Kleen
Cc: Peter Zijlstra, Linux Kernel, Suresh B Siddha,
Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj
* Andi Kleen <andi@firstfloor.org> [2009-05-14 17:06:32]:
> > I think 'scheduler throttling' is good so that we avoid the term 'CPU
> > throttling' or core throttling. I had named this cpu evacuation or
> > core evacuation just to avoid confusion with hardware throttling.
>
> Evacuation sounds good, although shouldn't it be package or
> socket evacuation?
Lets start with 'core evacuation' since that seems to be the lowest
granularity on a threaded system. We certainly want the framework to
provide socket/package and node evacuation on much larger systems.
--Vaidy
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
2009-05-13 15:01 ` Andi Kleen
2009-05-13 15:02 ` Peter Zijlstra
@ 2009-05-14 15:13 ` Vaidyanathan Srinivasan
2009-05-19 20:40 ` Pavel Machek
2 siblings, 0 replies; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-05-14 15:13 UTC (permalink / raw)
To: Andi Kleen
Cc: Peter Zijlstra, Linux Kernel, Suresh B Siddha,
Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj
* Andi Kleen <andi@firstfloor.org> [2009-05-13 17:01:00]:
> > >From what I've been told its popular to over-commit the cooling capacity
> > in a rack, so that a number of servers can run at full thermal capacity
> > but not all.
>
> Yes. But in this case you don't want to use throttling, you want
> to use p-states which actually safe power unlike throttling.
One of the design points for the discussion is to bring in C-States
into the equation. As you have mentioned today we can effectively use
P-States to reduce core frequency and thereby reduce average power
and heat. With the introduction of very low power deep sleep states
in the processor, C-States can provide substantial power savings apart
from just P-State based methods. Forcefully idling cores will lead to
exploitation of C-States and their power savings benefits.
As mentioned earlier, cpu throttling as it exist today should not
be used in normal operating conditions. However exploiting P-States
and C-States as two control variables, the system can be made to
operate at various power (thermal) and performance points.
>
> > I've also been told that hardware sucks at throttling,
>
> Throttling is not really something you should use in normal
> operation, it's just a emergency measure. For that it works
> quite well, but you really don't want it in normal operation.
>
> > therefore people
> > want to fix the OS so as to limit the thermal capacity and avoid the
> > hardware throttle from kicking in, whilst still not exceeding the rack
> > capacity or similar nonsense.
>
> Yes that's fine and common, but you actually need to save power for this,
> which throttling doesn't do.
Reducing work, scheduling them smartly in the OS can greatly save
power as compared to throttling in hardware in order to reduce power
or heat.
> My understanding this work is a extension of the existing
> sched_mc_power_savings features that tries to be optionally more
> aggressive to keep complete package idle so that package level
> power saving kicks in.
Scheduling work smartly (power efficiently) is part of the
sched_mc_power_savings framework, while this RFC/discussion is around
reducing work or forcing idle times but at a granularity of
cores/packages to provide maximum power/thermal benefits.
--Vaidy
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
2009-05-13 15:01 ` Andi Kleen
2009-05-13 15:02 ` Peter Zijlstra
2009-05-14 15:13 ` Vaidyanathan Srinivasan
@ 2009-05-19 20:40 ` Pavel Machek
2009-05-22 9:14 ` Vaidyanathan Srinivasan
2 siblings, 1 reply; 22+ messages in thread
From: Pavel Machek @ 2009-05-19 20:40 UTC (permalink / raw)
To: Andi Kleen
Cc: Peter Zijlstra, Vaidyanathan Srinivasan, Linux Kernel,
Suresh B Siddha, Venkatesh Pallipadi, Arjan van de Ven,
Ingo Molnar, Dipankar Sarma, Balbir Singh, Vatsa,
Gautham R Shenoy, Gregory Haskins, Mike Galbraith,
Thomas Gleixner, Arun Bharadwaj
On Wed 2009-05-13 17:01:00, Andi Kleen wrote:
> > >From what I've been told its popular to over-commit the cooling capacity
> > in a rack, so that a number of servers can run at full thermal capacity
> > but not all.
>
> Yes. But in this case you don't want to use throttling, you want
> to use p-states which actually safe power unlike throttling.
>
> > I've also been told that hardware sucks at throttling,
>
> Throttling is not really something you should use in normal
> operation, it's just a emergency measure. For that it works
> quite well, but you really don't want it in normal operation.
>
> > therefore people
> > want to fix the OS so as to limit the thermal capacity and avoid the
> > hardware throttle from kicking in, whilst still not exceeding the rack
> > capacity or similar nonsense.
>
> Yes that's fine and common, but you actually need to save power for this,
> which throttling doesn't do.
Actually throttling will lower power consumption at any given moment
(not power consumption for any given task!) and will keep your rack
from melting.
But I don't see why it is neccessary to evacuate cores for this. Why
not just schedule special task that enters C3 instead of computing?
That was what I planned to do on athlon 900 (1 core) with broken
fan...
For what you are doing, cpu hotplug seems more suitable. Can you
enhance it so that it is fast enough for you?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
2009-05-19 20:40 ` Pavel Machek
@ 2009-05-22 9:14 ` Vaidyanathan Srinivasan
2009-05-28 20:36 ` Pavel Machek
0 siblings, 1 reply; 22+ messages in thread
From: Vaidyanathan Srinivasan @ 2009-05-22 9:14 UTC (permalink / raw)
To: Pavel Machek
Cc: Andi Kleen, Peter Zijlstra, Linux Kernel, Suresh B Siddha,
Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj
* Pavel Machek <pavel@ucw.cz> [2009-05-19 22:40:15]:
> On Wed 2009-05-13 17:01:00, Andi Kleen wrote:
> > > >From what I've been told its popular to over-commit the cooling capacity
> > > in a rack, so that a number of servers can run at full thermal capacity
> > > but not all.
> >
> > Yes. But in this case you don't want to use throttling, you want
> > to use p-states which actually safe power unlike throttling.
> >
> > > I've also been told that hardware sucks at throttling,
> >
> > Throttling is not really something you should use in normal
> > operation, it's just a emergency measure. For that it works
> > quite well, but you really don't want it in normal operation.
> >
> > > therefore people
> > > want to fix the OS so as to limit the thermal capacity and avoid the
> > > hardware throttle from kicking in, whilst still not exceeding the rack
> > > capacity or similar nonsense.
> >
> > Yes that's fine and common, but you actually need to save power for this,
> > which throttling doesn't do.
>
> Actually throttling will lower power consumption at any given moment
> (not power consumption for any given task!) and will keep your rack
> from melting.
Yes, we want to reduce overall power consumption.
> But I don't see why it is neccessary to evacuate cores for this. Why
> not just schedule special task that enters C3 instead of computing?
This is what essentially happens in the load balancer approach. Not
scheduling on a particular core will run the scheduler's idle task
that will transition the core to lowest power state. Pinning a user
space task and using special driver to hold the core in C3 state will
break scheduling fairness. At this point the application decides when
to give the core back to scheduler.
> That was what I planned to do on athlon 900 (1 core) with broken
> fan...
>
> For what you are doing, cpu hotplug seems more suitable. Can you
> enhance it so that it is fast enough for you?
Yes cpu hotplug framework can be used. That is definitely an
alternative to this approach. However in the case of cpuhotplug, the
evacuation is directed to a particular core which may affect user
space affinity and cpusets. But in this case we can limit the overall
system capacity, like run at most 7 cores at a time in an 8 core
system, but we actually don't need to care which particular core is
'forced to idle' at any given point in time.
Further discussion regarding this can be found in the following
thread: http://lkml.org/lkml/2009/5/19/54
--Vaidy
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC PATCH v2 0/2] Saving power by cpu evacuation sched_max_capacity_pct=n
2009-05-22 9:14 ` Vaidyanathan Srinivasan
@ 2009-05-28 20:36 ` Pavel Machek
0 siblings, 0 replies; 22+ messages in thread
From: Pavel Machek @ 2009-05-28 20:36 UTC (permalink / raw)
To: Vaidyanathan Srinivasan
Cc: Andi Kleen, Peter Zijlstra, Linux Kernel, Suresh B Siddha,
Venkatesh Pallipadi, Arjan van de Ven, Ingo Molnar,
Dipankar Sarma, Balbir Singh, Vatsa, Gautham R Shenoy,
Gregory Haskins, Mike Galbraith, Thomas Gleixner, Arun Bharadwaj
Hi!
> > But I don't see why it is neccessary to evacuate cores for this. Why
> > not just schedule special task that enters C3 instead of computing?
>
> This is what essentially happens in the load balancer approach. Not
> scheduling on a particular core will run the scheduler's idle task
> that will transition the core to lowest power state. Pinning a user
> space task and using special driver to hold the core in C3 state will
> break scheduling fairness. At this point the application decides when
> to give the core back to scheduler.
Why would it break scheduling fairness? You just schedule "realtime"
task that does C3 instead of computation. The behaviour is very
similar to "normal" realtime task. Why would it break scheduler?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
^ permalink raw reply [flat|nested] 22+ messages in thread