* [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
@ 2010-10-19 18:36 Trinabh Gupta
2010-10-19 18:38 ` Arjan van de Ven
0 siblings, 1 reply; 20+ messages in thread
From: Trinabh Gupta @ 2010-10-19 18:36 UTC (permalink / raw)
To: peterz, arjan, lenb, suresh.b.siddha, benh, venki; +Cc: linux-kernel, g.trinabh
The core of the kernel's idle routine on x86 presently depends on an
exported pm_idle function pointer that is unmanaged and causing
hazard to various subsystems when they save and restore it.
The first problem is that this exported pointer can be modified/flipped
by any subsystem. There is no tracking or notification mechanism.
Secondly and more importantly, various subsystems save the value of
this pointer, flip it and later restore to the saved value. There is
no guarantee that the saved value is still valid. The problem has
been discussed in [2,3] and Peter Zijlstra suggested removing pm_idle
and implementing a list based registration [1].
This patch is an initial RFC implementation for x86 architecture
only. This framework can be generalised for other archs and also
include the current cpuidle framework for managing multiple idle
routines.
Tests done with the patch:
------------------------
1. Build (on 2.6.36-rc7) and booted on x86 with C1E as deepest idle
state and current_idle was selected to be mwait_idle.
2. Build (on 2.5.36-rc8) and booted on x86 (Nehalem) with ACPI C3 as
deepest sleep state. The current_idle was selected to be
cpuidle_idle_call which is the cpuidle subsystem that will further
select idle routines from {C1,C2,C3}.
Future implementation will try to eliminate this hirearchy and have
a single registration and menu/idle cpuidle governor for selection
of idle routine.
Goals:
-----
1. Registration and unregistration for idle routines from kernel and
modules
2. Single registration mechanism for single idle routine (ex C1E) and
multi-idle routines (ex C1, C2,, C3)
3. Use governors from current cpuidle framework/module if multi-idle
routine is registered and we need to pick the best at each idle
entry.
4. Minimal overhead for arch with following use cases
a) Singe compile time defined idle routine, no need for
runtime/boottime selection
b) Single idle routine, but selectable during boot/runtime.
No need for cpuidle governors
c) Runtime selection of single or multi-idle routines and
demand loading/usage of governors to select one among the
set of idle routines. (Current x86 model)
5. Allow per-cpu registration for platforms that could have asymmetric
C-States. Use global registration for platforms that do not
need that support.
6. Port the framework for other new archs like POWER
To Do:
------
1. Move this registration mechanism to a place where all architectures
E.g. Arm, pSeries can use it. Current implementation is for x86
architecture only.
2. Make ACPI, Intel_Idle register C-states directly to this instead
of CPUIdle. CPUIdle should just select the idle routine for each
device (CPU) using this list. struct cpu_idle_routine may have cpumask,
power, latency etc. to facilitate this selection and per cpu registration.
Selection of appropriate idle routine may use these values.
3. Add config option to build this into the kernel. This would ensure
that there is no overhead for architectures that have a single idle
routine. They do not need to register any idle routine and should
call default_idle() directly.
References:
----------
1. Peter Zijlstra's suggestion of list based implementation
http://lkml.org/lkml/2009/8/27/159
2. Problems caused by save/restore of exported pm_idle
http://lkml.org/lkml/2009/8/28/43
http://lkml.org/lkml/2009/8/28/50
Signed-off-by: Trinabh Gupta <trinabh@linux.vnet.ibm.com>
---
arch/x86/kernel/process.c | 165 +++++++++++++++++++++++++++++++++++--
arch/x86/kernel/process_32.c | 3 -
arch/x86/kernel/process_64.c | 3 -
arch/x86/xen/setup.c | 9 ++
drivers/cpuidle/cpuidle.c | 31 +++++--
include/linux/cpu_idle_routines.h | 37 ++++++++
6 files changed, 224 insertions(+), 24 deletions(-)
create mode 100644 include/linux/cpu_idle_routines.h
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..2242b5a 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -14,6 +14,7 @@
#include <linux/utsname.h>
#include <trace/events/power.h>
#include <linux/hw_breakpoint.h>
+#include <linux/cpu_idle_routines.h>
#include <asm/system.h>
#include <asm/apic.h>
#include <asm/syscalls.h>
@@ -332,10 +333,128 @@ unsigned long boot_option_idle_override = 0;
EXPORT_SYMBOL(boot_option_idle_override);
/*
- * Powermanagement idle function, if any..
+ * Idle Routines Management Framework
*/
-void (*pm_idle)(void);
-EXPORT_SYMBOL(pm_idle);
+LIST_HEAD(registered_cpu_idle_routines);
+DEFINE_MUTEX(cpu_idle_routines_lock);
+static struct cpu_idle_routine *current_cpu_idle_routine;
+static void (*current_idle) (void);
+
+/*
+ * Calls current idle routine
+ */
+void __cpu_idle(void)
+{
+ if (current_idle)
+ current_idle();
+ else
+#if defined(CONFIG_ARCH_HAS_DEFAULT_IDLE)
+ default_idle();
+#else
+ local_irq_enable();
+#endif
+ return;
+}
+
+/*
+ * get_current_cpu_idle_routine - Returns current idle routine
+ */
+struct cpu_idle_routine *get_current_cpu_idle_routine(void)
+{
+ return current_cpu_idle_routine;
+}
+EXPORT_SYMBOL_GPL(get_current_cpu_idle_routine);
+
+/*
+ * select_cpu_idle_routine - Returns best idle routine based on priority
+ * Should be called only after taking cpu_idle_routines_lock.
+ */
+static struct cpu_idle_routine *select_cpu_idle_routine(void)
+{
+ struct cpu_idle_routine *item = NULL, *next_routine = NULL;
+ unsigned int min_priority = CPU_IDLE_MAX_PRIORITY;
+
+ list_for_each_entry(item, ®istered_cpu_idle_routines,
+ cpu_idle_routine_list) {
+ if (item->priority <= min_priority) {
+ next_routine = item;
+ min_priority = item->priority;
+ }
+ }
+ return next_routine;
+}
+
+/*
+ * set_current_cpu_idle_routine - Changes the current idle routine
+ * @idle_routine: The new idle routine
+ * Should be called only after taking cpu_idle_routines_lock.
+ */
+static void set_current_cpu_idle_routine(struct cpu_idle_routine *idle_routine)
+{
+ if (idle_routine == current_cpu_idle_routine)
+ return;
+ current_cpu_idle_routine = idle_routine;
+ if (idle_routine)
+ current_idle = idle_routine->routine;
+ else
+ current_idle = NULL;
+ /* Call cpu_idle_wait() to change pointer to current idle routine.
+ * cpu_idle_wait() may require redesign in future as interrupt
+ * may not be complete exit from idle.
+ */
+ cpu_idle_wait();
+}
+
+/*
+ * __register_cpu_idle_routine - internal/helper function for registration.
+ * @idle_routine: Routine to be registered
+ *
+ * Should be called only after taking cpu_idle_routines_lock
+ */
+static int __register_cpu_idle_routine(struct cpu_idle_routine *idle_routine)
+{
+ if (!idle_routine || !idle_routine->routine)
+ return -EINVAL;
+ if (idle_routine->priority > CPU_IDLE_MAX_PRIORITY)
+ return -EINVAL;
+
+ list_add(&idle_routine->cpu_idle_routine_list,
+ ®istered_cpu_idle_routines);
+
+ return 0;
+}
+
+/* register_cpu_idle_routine - Registers a cpu idle routine
+ * @routine: Routine to be registered
+ */
+int register_cpu_idle_routine(struct cpu_idle_routine *idle_routine)
+{
+ int ret;
+ mutex_lock(&cpu_idle_routines_lock);
+ ret = __register_cpu_idle_routine(idle_routine);
+
+ if (ret) {
+ mutex_unlock(&cpu_idle_routines_lock);
+ return ret;
+ }
+
+ set_current_cpu_idle_routine(select_cpu_idle_routine());
+ mutex_unlock(&cpu_idle_routines_lock);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(register_cpu_idle_routine);
+
+/* unregister_cpu_idle_routine - Unregisters a registered cpu idle routine
+ * @routine: Routine to be unregistered
+ */
+void unregister_cpu_idle_routine(struct cpu_idle_routine *idle_routine)
+{
+ mutex_lock(&cpu_idle_routines_lock);
+ list_del(&idle_routine->cpu_idle_routine_list);
+ set_current_cpu_idle_routine(select_cpu_idle_routine());
+ mutex_unlock(&cpu_idle_routines_lock);
+}
+EXPORT_SYMBOL_GPL(unregister_cpu_idle_routine);
#ifdef CONFIG_X86_32
/*
@@ -591,13 +710,20 @@ static void c1e_idle(void)
void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c)
{
+ struct cpu_idle_routine *idle_routine;
#ifdef CONFIG_SMP
- if (pm_idle == poll_idle && smp_num_siblings > 1) {
+ if (get_current_cpu_idle_routine() &&
+ get_current_cpu_idle_routine()->routine == poll_idle
+ && smp_num_siblings > 1) {
printk_once(KERN_WARNING "WARNING: polling idle and HT enabled,"
" performance may degrade.\n");
}
#endif
- if (pm_idle)
+ if (get_current_cpu_idle_routine())
+ return;
+
+ idle_routine = kmalloc(sizeof(struct cpu_idle_routine), GFP_KERNEL);
+ if (!idle_routine)
return;
if (cpu_has(c, X86_FEATURE_MWAIT) && mwait_usable(c)) {
@@ -605,30 +731,41 @@ void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c)
* One CPU supports mwait => All CPUs supports mwait
*/
printk(KERN_INFO "using mwait in idle threads.\n");
- pm_idle = mwait_idle;
+ init_cpu_idle_routine(idle_routine, "mwait_idle",
+ mwait_idle, 2);
+
} else if (cpu_has_amd_erratum(amd_erratum_400)) {
/* E400: APIC timer interrupt does not wake up CPU from C1e */
printk(KERN_INFO "using C1E aware idle routine\n");
- pm_idle = c1e_idle;
+ init_cpu_idle_routine(idle_routine, "c1e_idle", c1e_idle, 2);
} else
- pm_idle = default_idle;
+ init_cpu_idle_routine(idle_routine, "default_idle",
+ default_idle, 2);
+
+ register_cpu_idle_routine(idle_routine);
}
void __init init_c1e_mask(void)
{
/* If we're using c1e_idle, we need to allocate c1e_mask. */
- if (pm_idle == c1e_idle)
+ if (get_current_cpu_idle_routine() &&
+ get_current_cpu_idle_routine()->routine == c1e_idle)
zalloc_cpumask_var(&c1e_mask, GFP_KERNEL);
}
static int __init idle_setup(char *str)
{
+ struct cpu_idle_routine *idle_routine;
if (!str)
return -EINVAL;
+ idle_routine = kmalloc(sizeof(struct cpu_idle_routine), GFP_KERNEL);
+ if (!idle_routine)
+ return -ENOMEM;
+
if (!strcmp(str, "poll")) {
printk("using polling idle threads.\n");
- pm_idle = poll_idle;
+ init_cpu_idle_routine(idle_routine, "poll_idle", poll_idle, 0);
} else if (!strcmp(str, "mwait"))
force_mwait = 1;
else if (!strcmp(str, "halt")) {
@@ -639,7 +776,11 @@ static int __init idle_setup(char *str)
* To continue to load the CPU idle driver, don't touch
* the boot_option_idle_override.
*/
- pm_idle = default_idle;
+
+ init_cpu_idle_routine(idle_routine, "default_idle",
+ default_idle, 0);
+ register_cpu_idle_routine(idle_routine);
+
idle_halt = 1;
return 0;
} else if (!strcmp(str, "nomwait")) {
@@ -654,6 +795,8 @@ static int __init idle_setup(char *str)
} else
return -1;
+ register_cpu_idle_routine(idle_routine);
+
boot_option_idle_override = 1;
return 0;
}
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 96586c3..7f3d23c 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -38,6 +38,7 @@
#include <linux/uaccess.h>
#include <linux/io.h>
#include <linux/kdebug.h>
+#include <linux/cpu_idle_routines.h>
#include <asm/pgtable.h>
#include <asm/system.h>
@@ -111,7 +112,7 @@ void cpu_idle(void)
local_irq_disable();
/* Don't trace irqs off for idle */
stop_critical_timings();
- pm_idle();
+ __cpu_idle();
start_critical_timings();
trace_power_end(smp_processor_id());
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3d9ea53..c323a1b 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -37,6 +37,7 @@
#include <linux/uaccess.h>
#include <linux/io.h>
#include <linux/ftrace.h>
+#include <linux/cpu_idle_routines.h>
#include <asm/pgtable.h>
#include <asm/system.h>
@@ -138,7 +139,7 @@ void cpu_idle(void)
enter_idle();
/* Don't trace irqs off for idle */
stop_critical_timings();
- pm_idle();
+ __cpu_idle();
start_critical_timings();
trace_power_end(smp_processor_id());
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 328b003..4015ce8 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -8,6 +8,7 @@
#include <linux/sched.h>
#include <linux/mm.h>
#include <linux/pm.h>
+#include <linux/cpu_idle_routines.h>
#include <asm/elf.h>
#include <asm/vdso.h>
@@ -224,6 +225,7 @@ void __cpuinit xen_enable_syscall(void)
void __init xen_arch_setup(void)
{
struct physdev_set_iopl set_iopl;
+ struct cpu_idle_routine *idle_routine;
int rc;
xen_panic_handler_init();
@@ -258,7 +260,12 @@ void __init xen_arch_setup(void)
MAX_GUEST_CMDLINE > COMMAND_LINE_SIZE ?
COMMAND_LINE_SIZE : MAX_GUEST_CMDLINE);
- pm_idle = xen_idle;
+ idle_routine = kmalloc(sizeof(struct cpu_idle_routine),
+ GFP_KERNEL);
+ if (!idle_routine)
+ return;
+ init_cpu_idle_routine(idle_routine, "xen_idle", xen_idle, 0);
+ register_cpu_idle_routine(idle_routine);
paravirt_disable_iospace();
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..145f7e3 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -17,6 +17,8 @@
#include <linux/cpuidle.h>
#include <linux/ktime.h>
#include <linux/hrtimer.h>
+#include <linux/slab.h>
+#include <linux/cpu_idle_routines.h>
#include <trace/events/power.h>
#include "cpuidle.h"
@@ -25,9 +27,10 @@ DEFINE_PER_CPU(struct cpuidle_device *, cpuidle_devices);
DEFINE_MUTEX(cpuidle_lock);
LIST_HEAD(cpuidle_detected_devices);
-static void (*pm_idle_old)(void);
static int enabled_devices;
+static int registered_idle_call;
+static struct cpu_idle_routine *idle_routine;
#if defined(CONFIG_ARCH_HAS_CPU_IDLE_WAIT)
static void cpuidle_kick_cpus(void)
@@ -55,9 +58,6 @@ static void cpuidle_idle_call(void)
/* check if the device is ready */
if (!dev || !dev->enabled) {
- if (pm_idle_old)
- pm_idle_old();
- else
#if defined(CONFIG_ARCH_HAS_DEFAULT_IDLE)
default_idle();
#else
@@ -114,10 +114,19 @@ static void cpuidle_idle_call(void)
*/
void cpuidle_install_idle_handler(void)
{
- if (enabled_devices && (pm_idle != cpuidle_idle_call)) {
+ if (enabled_devices && !registered_idle_call) {
/* Make sure all changes finished before we switch to new idle */
smp_wmb();
- pm_idle = cpuidle_idle_call;
+
+ idle_routine = kmalloc(sizeof(struct cpu_idle_routine),
+ GFP_KERNEL);
+ if (!idle_routine)
+ return;
+ init_cpu_idle_routine(idle_routine, "cpuidle_idle_call",
+ cpuidle_idle_call, 0);
+ register_cpu_idle_routine(idle_routine);
+
+ registered_idle_call = 1;
}
}
@@ -126,8 +135,12 @@ void cpuidle_install_idle_handler(void)
*/
void cpuidle_uninstall_idle_handler(void)
{
- if (enabled_devices && pm_idle_old && (pm_idle != pm_idle_old)) {
- pm_idle = pm_idle_old;
+ if (enabled_devices && registered_idle_call) {
+
+ unregister_cpu_idle_routine(idle_routine);
+ kfree(idle_routine);
+ registered_idle_call = 0;
+
cpuidle_kick_cpus();
}
}
@@ -420,8 +433,6 @@ static int __init cpuidle_init(void)
{
int ret;
- pm_idle_old = pm_idle;
-
ret = cpuidle_add_class_sysfs(&cpu_sysdev_class);
if (ret)
return ret;
diff --git a/include/linux/cpu_idle_routines.h b/include/linux/cpu_idle_routines.h
new file mode 100644
index 0000000..4faaf17
--- /dev/null
+++ b/include/linux/cpu_idle_routines.h
@@ -0,0 +1,37 @@
+/*
+ * cpu_idle_routines.h - a framework for registering various CPU idle routines
+ *
+ * This code is licenced under the GPL.
+ */
+
+#ifndef _LINUX_CPU_IDLE_ROUTINES_H
+#define _LINUX_CPU_IDLE_ROUTINES_H
+
+#include <linux/cpumask.h>
+
+#define CPU_IDLE_MAX_PRIORITY 1024
+#define CPU_IDLE_ROUTINE_DESC_LEN 32
+
+struct cpu_idle_routine {
+ char desc[CPU_IDLE_ROUTINE_DESC_LEN];
+ void (*routine)(void);
+ unsigned int priority;
+ /* cpumask_t cpus_allowed; Support asymetric C states */
+ /* int count; Ref count for per cpu registration */
+ struct list_head cpu_idle_routine_list;
+};
+
+static inline void init_cpu_idle_routine(struct cpu_idle_routine *ptr,
+ char *desc, void (*func)(void), unsigned int prio)
+{
+ strncpy(ptr->desc, desc, CPU_IDLE_ROUTINE_DESC_LEN);
+ ptr->routine = func;
+ ptr->priority = prio;
+}
+
+struct cpu_idle_routine *get_current_cpu_idle_routine(void);
+int register_cpu_idle_routine(struct cpu_idle_routine *routine);
+void unregister_cpu_idle_routine(struct cpu_idle_routine *routine);
+void __cpu_idle(void);
+
+#endif /* _LINUX_CPU_IDLE_ROUTINES_H */
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-19 18:36 [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer Trinabh Gupta
@ 2010-10-19 18:38 ` Arjan van de Ven
2010-10-19 18:49 ` Venkatesh Pallipadi
0 siblings, 1 reply; 20+ messages in thread
From: Arjan van de Ven @ 2010-10-19 18:38 UTC (permalink / raw)
To: Trinabh Gupta
Cc: peterz, lenb, suresh.b.siddha, benh, venki, linux-kernel,
g.trinabh
On 10/19/2010 11:36 AM, Trinabh Gupta wrote:
> The core of the kernel's idle routine on x86 presently depends on an
> exported pm_idle function pointer that is unmanaged and causing
> hazard to various subsystems when they save and restore it.
> The first problem is that this exported pointer can be modified/flipped
> by any subsystem. There is no tracking or notification mechanism.
> Secondly and more importantly, various subsystems save the value of
> this pointer, flip it and later restore to the saved value. There is
> no guarantee that the saved value is still valid. The problem has
> been discussed in [2,3] and Peter Zijlstra suggested removing pm_idle
> and implementing a list based registration [1].
>
> This patch is an initial RFC implementation for x86 architecture
> only. This framework can be generalised for other archs and also
> include the current cpuidle framework for managing multiple idle
> routines.
>
> Tests done with the patch:
> ------------------------
> 1. Build (on 2.6.36-rc7) and booted on x86 with C1E as deepest idle
> state and current_idle was selected to be mwait_idle.
>
> 2. Build (on 2.5.36-rc8) and booted on x86 (Nehalem) with ACPI C3 as
> deepest sleep state. The current_idle was selected to be
> cpuidle_idle_call which is the cpuidle subsystem that will further
> select idle routines from {C1,C2,C3}.
>
> Future implementation will try to eliminate this hirearchy and have
> a single registration and menu/idle cpuidle governor for selection
> of idle routine.
looks like you're duplicating the cpuidle subsystem
how about biting the bullet and just always and only use the cpuidle
subsystem for all idle on x86 ?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-19 18:38 ` Arjan van de Ven
@ 2010-10-19 18:49 ` Venkatesh Pallipadi
2010-10-19 19:01 ` Trinabh Gupta
0 siblings, 1 reply; 20+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-19 18:49 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Trinabh Gupta, peterz, lenb, suresh.b.siddha, benh, linux-kernel,
g.trinabh
On Tue, Oct 19, 2010 at 11:38 AM, Arjan van de Ven
<arjan@linux.intel.com> wrote:
> On 10/19/2010 11:36 AM, Trinabh Gupta wrote:
>>
>> The core of the kernel's idle routine on x86 presently depends on an
>> exported pm_idle function pointer that is unmanaged and causing
>> hazard to various subsystems when they save and restore it.
>> The first problem is that this exported pointer can be modified/flipped
>> by any subsystem. There is no tracking or notification mechanism.
>> Secondly and more importantly, various subsystems save the value of
>> this pointer, flip it and later restore to the saved value. There is
>> no guarantee that the saved value is still valid. The problem has
>> been discussed in [2,3] and Peter Zijlstra suggested removing pm_idle
>> and implementing a list based registration [1].
>>
>> This patch is an initial RFC implementation for x86 architecture
>> only. This framework can be generalised for other archs and also
>> include the current cpuidle framework for managing multiple idle
>> routines.
>>
>> Tests done with the patch:
>> ------------------------
>> 1. Build (on 2.6.36-rc7) and booted on x86 with C1E as deepest idle
>> state and current_idle was selected to be mwait_idle.
>>
>> 2. Build (on 2.5.36-rc8) and booted on x86 (Nehalem) with ACPI C3 as
>> deepest sleep state. The current_idle was selected to be
>> cpuidle_idle_call which is the cpuidle subsystem that will further
>> select idle routines from {C1,C2,C3}.
>>
>> Future implementation will try to eliminate this hirearchy and have
>> a single registration and menu/idle cpuidle governor for selection
>> of idle routine.
>
>
> looks like you're duplicating the cpuidle subsystem
>
> how about biting the bullet and just always and only use the cpuidle
> subsystem for all idle on x86 ?
>
I agree with Arjan.
If we have a default_cpuidle driver which parses idle= params, handles
various mwait quirks in x86 process*.c and registers with cpuidle, we
can then always call cpuidle idle routine on x86.
Thanks,
Venki
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-19 18:49 ` Venkatesh Pallipadi
@ 2010-10-19 19:01 ` Trinabh Gupta
2010-10-20 15:12 ` Trinabh Gupta
0 siblings, 1 reply; 20+ messages in thread
From: Trinabh Gupta @ 2010-10-19 19:01 UTC (permalink / raw)
To: Venkatesh Pallipadi
Cc: Arjan van de Ven, peterz, lenb, suresh.b.siddha, benh,
linux-kernel
On 10/20/2010 12:19 AM, Venkatesh Pallipadi wrote:
> On Tue, Oct 19, 2010 at 11:38 AM, Arjan van de Ven
> <arjan@linux.intel.com> wrote:
>> On 10/19/2010 11:36 AM, Trinabh Gupta wrote:
>>>
>>> The core of the kernel's idle routine on x86 presently depends on an
>>> exported pm_idle function pointer that is unmanaged and causing
>>> hazard to various subsystems when they save and restore it.
>>> The first problem is that this exported pointer can be modified/flipped
>>> by any subsystem. There is no tracking or notification mechanism.
>>> Secondly and more importantly, various subsystems save the value of
>>> this pointer, flip it and later restore to the saved value. There is
>>> no guarantee that the saved value is still valid. The problem has
>>> been discussed in [2,3] and Peter Zijlstra suggested removing pm_idle
>>> and implementing a list based registration [1].
>>>
>>> This patch is an initial RFC implementation for x86 architecture
>>> only. This framework can be generalised for other archs and also
>>> include the current cpuidle framework for managing multiple idle
>>> routines.
>>>
>>> Tests done with the patch:
>>> ------------------------
>>> 1. Build (on 2.6.36-rc7) and booted on x86 with C1E as deepest idle
>>> state and current_idle was selected to be mwait_idle.
>>>
>>> 2. Build (on 2.5.36-rc8) and booted on x86 (Nehalem) with ACPI C3 as
>>> deepest sleep state. The current_idle was selected to be
>>> cpuidle_idle_call which is the cpuidle subsystem that will further
>>> select idle routines from {C1,C2,C3}.
>>>
>>> Future implementation will try to eliminate this hirearchy and have
>>> a single registration and menu/idle cpuidle governor for selection
>>> of idle routine.
>>
>>
>> looks like you're duplicating the cpuidle subsystem
>>
>> how about biting the bullet and just always and only use the cpuidle
>> subsystem for all idle on x86 ?
>>
>
> I agree with Arjan.
> If we have a default_cpuidle driver which parses idle= params, handles
> various mwait quirks in x86 process*.c and registers with cpuidle, we
> can then always call cpuidle idle routine on x86.
This wouldn't duplicate code. It would move parts/functionality of
cpuidle into the kernel, keeping governors alone as modules.
If we directly call cpuidle_idle_call() then this may be too much
overhead for architectures that have single idle routine i.e cases where
cpuidle is not used and will be seen as a bloat. (c.f goals 4a,b).
>
> Thanks,
> Venki
Regards,
-Trinabh
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-19 19:01 ` Trinabh Gupta
@ 2010-10-20 15:12 ` Trinabh Gupta
2010-10-20 15:18 ` Arjan van de Ven
0 siblings, 1 reply; 20+ messages in thread
From: Trinabh Gupta @ 2010-10-20 15:12 UTC (permalink / raw)
To: Venkatesh Pallipadi, Arjan van de Ven, peterz
Cc: lenb, suresh.b.siddha, benh, linux-kernel, ak
On 10/20/2010 12:31 AM, Trinabh Gupta wrote:
>
>
> On 10/20/2010 12:19 AM, Venkatesh Pallipadi wrote:
>> On Tue, Oct 19, 2010 at 11:38 AM, Arjan van de Ven
>> <arjan@linux.intel.com> wrote:
>>> On 10/19/2010 11:36 AM, Trinabh Gupta wrote:
>>>>
>>>> The core of the kernel's idle routine on x86 presently depends on an
>>>> exported pm_idle function pointer that is unmanaged and causing
>>>> hazard to various subsystems when they save and restore it.
>>>> The first problem is that this exported pointer can be modified/flipped
>>>> by any subsystem. There is no tracking or notification mechanism.
>>>> Secondly and more importantly, various subsystems save the value of
>>>> this pointer, flip it and later restore to the saved value. There is
>>>> no guarantee that the saved value is still valid. The problem has
>>>> been discussed in [2,3] and Peter Zijlstra suggested removing pm_idle
>>>> and implementing a list based registration [1].
>>>>
>>>> This patch is an initial RFC implementation for x86 architecture
>>>> only. This framework can be generalised for other archs and also
>>>> include the current cpuidle framework for managing multiple idle
>>>> routines.
>>>>
>>>> Tests done with the patch:
>>>> ------------------------
>>>> 1. Build (on 2.6.36-rc7) and booted on x86 with C1E as deepest idle
>>>> state and current_idle was selected to be mwait_idle.
>>>>
>>>> 2. Build (on 2.5.36-rc8) and booted on x86 (Nehalem) with ACPI C3 as
>>>> deepest sleep state. The current_idle was selected to be
>>>> cpuidle_idle_call which is the cpuidle subsystem that will further
>>>> select idle routines from {C1,C2,C3}.
>>>>
>>>> Future implementation will try to eliminate this hirearchy and have
>>>> a single registration and menu/idle cpuidle governor for selection
>>>> of idle routine.
>>>
>>>
>>> looks like you're duplicating the cpuidle subsystem
>>>
>>> how about biting the bullet and just always and only use the cpuidle
>>> subsystem for all idle on x86 ?
>>>
>>
>> I agree with Arjan.
>> If we have a default_cpuidle driver which parses idle= params, handles
>> various mwait quirks in x86 process*.c and registers with cpuidle, we
>> can then always call cpuidle idle routine on x86.
>
> This wouldn't duplicate code. It would move parts/functionality of
> cpuidle into the kernel, keeping governors alone as modules.
>
> If we directly call cpuidle_idle_call() then this may be too much
> overhead for architectures that have single idle routine i.e cases where
> cpuidle is not used and will be seen as a bloat. (c.f goals 4a,b).
Hi Venki, Arjan
Building cpuidle into the kernel would add ~7KB for everyone, even
x86 architectures having only single idle state/routine. Andi Kleen had
pointed this out before. Please see http://lkml.org/lkml/2009/10/12/421.
% size drivers/cpuidle/built-in.o
text data bss dec hex filename
6054 1288 44 7386 1cda drivers/cpuidle/built-in.o
One solution is to move some of the functionality into the
kernel and keep governors as module. Currently this prototype
adds ~ 0.5KB and also does more robust management of idle routines.
The bottom line is that pm_idle needs to be removed one way or
the other.
% size arch/x86/kernel/built-in.o
ext data bss dec hex filename
Without this patch:
341328 445821 507713 1294862 13c20e arch/x86/kernel/built-in.o
With this patch:
341732 445885 507713 1295330 13c3e2 arch/x86/kernel/built-in.o
Increase: 468 Bytes
Thanks,
-Trinabh
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-20 15:12 ` Trinabh Gupta
@ 2010-10-20 15:18 ` Arjan van de Ven
2010-10-20 15:34 ` Andi Kleen
2010-10-20 15:57 ` Trinabh Gupta
0 siblings, 2 replies; 20+ messages in thread
From: Arjan van de Ven @ 2010-10-20 15:18 UTC (permalink / raw)
To: Trinabh Gupta
Cc: Venkatesh Pallipadi, peterz, lenb, suresh.b.siddha, benh,
linux-kernel, ak
On 10/20/2010 8:12 AM, Trinabh Gupta wrote:
>
>
> On 10/20/2010 12:31 AM, Trinabh Gupta wrote:
>>
>>
>> On 10/20/2010 12:19 AM, Venkatesh Pallipadi wrote:
>>> On Tue, Oct 19, 2010 at 11:38 AM, Arjan van de Ven
>>> <arjan@linux.intel.com> wrote:
>>>> On 10/19/2010 11:36 AM, Trinabh Gupta wrote:
>>>>>
>>>>> The core of the kernel's idle routine on x86 presently depends on an
>>>>> exported pm_idle function pointer that is unmanaged and causing
>>>>> hazard to various subsystems when they save and restore it.
>>>>> The first problem is that this exported pointer can be
>>>>> modified/flipped
>>>>> by any subsystem. There is no tracking or notification mechanism.
>>>>> Secondly and more importantly, various subsystems save the value of
>>>>> this pointer, flip it and later restore to the saved value. There is
>>>>> no guarantee that the saved value is still valid. The problem has
>>>>> been discussed in [2,3] and Peter Zijlstra suggested removing pm_idle
>>>>> and implementing a list based registration [1].
>>>>>
>>>>> This patch is an initial RFC implementation for x86 architecture
>>>>> only. This framework can be generalised for other archs and also
>>>>> include the current cpuidle framework for managing multiple idle
>>>>> routines.
>>>>>
>>>>> Tests done with the patch:
>>>>> ------------------------
>>>>> 1. Build (on 2.6.36-rc7) and booted on x86 with C1E as deepest idle
>>>>> state and current_idle was selected to be mwait_idle.
>>>>>
>>>>> 2. Build (on 2.5.36-rc8) and booted on x86 (Nehalem) with ACPI C3 as
>>>>> deepest sleep state. The current_idle was selected to be
>>>>> cpuidle_idle_call which is the cpuidle subsystem that will further
>>>>> select idle routines from {C1,C2,C3}.
>>>>>
>>>>> Future implementation will try to eliminate this hirearchy and have
>>>>> a single registration and menu/idle cpuidle governor for selection
>>>>> of idle routine.
>>>>
>>>>
>>>> looks like you're duplicating the cpuidle subsystem
>>>>
>>>> how about biting the bullet and just always and only use the cpuidle
>>>> subsystem for all idle on x86 ?
>>>>
>>>
>>> I agree with Arjan.
>>> If we have a default_cpuidle driver which parses idle= params, handles
>>> various mwait quirks in x86 process*.c and registers with cpuidle, we
>>> can then always call cpuidle idle routine on x86.
>>
>> This wouldn't duplicate code. It would move parts/functionality of
>> cpuidle into the kernel, keeping governors alone as modules.
>>
>> If we directly call cpuidle_idle_call() then this may be too much
>> overhead for architectures that have single idle routine i.e cases where
>> cpuidle is not used and will be seen as a bloat. (c.f goals 4a,b).
>
> Hi Venki, Arjan
>
> Building cpuidle into the kernel would add ~7KB for everyone, even
> x86 architectures having only single idle state/routine.
but now you're duplicating this functionality adding code for everyone.
99.999% of the people today run cpuidle... (especially embedded x86
where they really care about power)
all x86 going forward also has > 1 idle option anyway.
and you're adding and extra layer in the middle that just duplicates the
layer that's in use in practice above it.
seriously, this sounds like the wrong tradeoff to make.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-20 15:18 ` Arjan van de Ven
@ 2010-10-20 15:34 ` Andi Kleen
2010-10-20 16:03 ` Arjan van de Ven
2010-10-20 15:57 ` Trinabh Gupta
1 sibling, 1 reply; 20+ messages in thread
From: Andi Kleen @ 2010-10-20 15:34 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Trinabh Gupta, Venkatesh Pallipadi, peterz, lenb, suresh.b.siddha,
benh, linux-kernel
> but now you're duplicating this functionality adding code for everyone.
>
> 99.999% of the people today run cpuidle... (especially embedded x86
> where they really care about power)
> all x86 going forward also has > 1 idle option anyway.
>
> and you're adding and extra layer in the middle that just duplicates
> the layer that's in use in practice above it.
>
> seriously, this sounds like the wrong tradeoff to make.
I think the right option is still to put cpuidle on a diet.
There's no reason an idle handler needs to be that bloated.
If it was 2K or so just including it into the core would be fine.
Ignoring code size completely is generally a wrong trade off imho.
-Andi
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-20 15:18 ` Arjan van de Ven
2010-10-20 15:34 ` Andi Kleen
@ 2010-10-20 15:57 ` Trinabh Gupta
1 sibling, 0 replies; 20+ messages in thread
From: Trinabh Gupta @ 2010-10-20 15:57 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Venkatesh Pallipadi, peterz, lenb, suresh.b.siddha, benh,
linux-kernel, ak
On 10/20/2010 08:48 PM, Arjan van de Ven wrote:
> On 10/20/2010 8:12 AM, Trinabh Gupta wrote:
>>
>>
>> On 10/20/2010 12:31 AM, Trinabh Gupta wrote:
>>>
>>>
>>> On 10/20/2010 12:19 AM, Venkatesh Pallipadi wrote:
>>>> On Tue, Oct 19, 2010 at 11:38 AM, Arjan van de Ven
>>>> <arjan@linux.intel.com> wrote:
>>>>> On 10/19/2010 11:36 AM, Trinabh Gupta wrote:
>>>>>>
>>>>>> The core of the kernel's idle routine on x86 presently depends on an
>>>>>> exported pm_idle function pointer that is unmanaged and causing
>>>>>> hazard to various subsystems when they save and restore it.
>>>>>> The first problem is that this exported pointer can be
>>>>>> modified/flipped
>>>>>> by any subsystem. There is no tracking or notification mechanism.
>>>>>> Secondly and more importantly, various subsystems save the value of
>>>>>> this pointer, flip it and later restore to the saved value. There is
>>>>>> no guarantee that the saved value is still valid. The problem has
>>>>>> been discussed in [2,3] and Peter Zijlstra suggested removing pm_idle
>>>>>> and implementing a list based registration [1].
>>>>>>
>>>>>> This patch is an initial RFC implementation for x86 architecture
>>>>>> only. This framework can be generalised for other archs and also
>>>>>> include the current cpuidle framework for managing multiple idle
>>>>>> routines.
>>>>>>
>>>>>> Tests done with the patch:
>>>>>> ------------------------
>>>>>> 1. Build (on 2.6.36-rc7) and booted on x86 with C1E as deepest idle
>>>>>> state and current_idle was selected to be mwait_idle.
>>>>>>
>>>>>> 2. Build (on 2.5.36-rc8) and booted on x86 (Nehalem) with ACPI C3 as
>>>>>> deepest sleep state. The current_idle was selected to be
>>>>>> cpuidle_idle_call which is the cpuidle subsystem that will further
>>>>>> select idle routines from {C1,C2,C3}.
>>>>>>
>>>>>> Future implementation will try to eliminate this hirearchy and have
>>>>>> a single registration and menu/idle cpuidle governor for selection
>>>>>> of idle routine.
>>>>>
>>>>>
>>>>> looks like you're duplicating the cpuidle subsystem
>>>>>
>>>>> how about biting the bullet and just always and only use the cpuidle
>>>>> subsystem for all idle on x86 ?
>>>>>
>>>>
>>>> I agree with Arjan.
>>>> If we have a default_cpuidle driver which parses idle= params, handles
>>>> various mwait quirks in x86 process*.c and registers with cpuidle, we
>>>> can then always call cpuidle idle routine on x86.
>>>
>>> This wouldn't duplicate code. It would move parts/functionality of
>>> cpuidle into the kernel, keeping governors alone as modules.
>>>
>>> If we directly call cpuidle_idle_call() then this may be too much
>>> overhead for architectures that have single idle routine i.e cases where
>>> cpuidle is not used and will be seen as a bloat. (c.f goals 4a,b).
>>
>> Hi Venki, Arjan
>>
>> Building cpuidle into the kernel would add ~7KB for everyone, even
>> x86 architectures having only single idle state/routine.
>
> but now you're duplicating this functionality adding code for everyone.
I do not intend to duplicate code. I am trying to do what Andi
suggested i.e make the registration part lightweight and *move*
it into the kernel so that we can do away with problems such as
exported pm_idle pointer. This implementation is an *intermediate*
step.
>
> 99.999% of the people today run cpuidle... (especially embedded x86
> where they really care about power)
> all x86 going forward also has > 1 idle option anyway.
>
> and you're adding and extra layer in the middle that just duplicates the
> layer that's in use in practice above it.
As I said, this would not duplicate rather split things - minimal
registration part within the kernel (so that problems like
pm_idle are resolved) and governors as module.
-Trinabh
>
> seriously, this sounds like the wrong tradeoff to make.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-20 15:34 ` Andi Kleen
@ 2010-10-20 16:03 ` Arjan van de Ven
2010-10-20 19:19 ` Vaidyanathan Srinivasan
2010-10-20 20:55 ` Dipankar Sarma
0 siblings, 2 replies; 20+ messages in thread
From: Arjan van de Ven @ 2010-10-20 16:03 UTC (permalink / raw)
To: Andi Kleen
Cc: Trinabh Gupta, Venkatesh Pallipadi, peterz, lenb, suresh.b.siddha,
benh, linux-kernel
On 10/20/2010 8:34 AM, Andi Kleen wrote:
>
>> but now you're duplicating this functionality adding code for everyone.
>>
>> 99.999% of the people today run cpuidle... (especially embedded x86
>> where they really care about power)
>> all x86 going forward also has > 1 idle option anyway.
>>
>> and you're adding and extra layer in the middle that just duplicates
>> the layer that's in use in practice above it.
>>
>> seriously, this sounds like the wrong tradeoff to make.
>
> I think the right option is still to put cpuidle on a diet.
> There's no reason an idle handler needs to be that bloated.
>
> If it was 2K or so just including it into the core would be fine.
>
> Ignoring code size completely is generally a wrong trade off imho.
I'm not ignoring code size.
I'm saying that a 7Kb component that everyone on this architecture uses
in practice versus adding 0.5Kb in ADDITION to that for everyone for the
theoretical case
of someone NOT using cpuidle is the wrong tradeoff.
having it go on a diet? I'm all for it. Killing off the ladder governor
for example is a step.
But really. 7Kb. There's lots of lower hanging fruit as well. 7Kb is not
a reason to make such a bad tradeoff.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-20 16:03 ` Arjan van de Ven
@ 2010-10-20 19:19 ` Vaidyanathan Srinivasan
2010-10-20 19:25 ` Arjan van de Ven
2010-10-20 20:55 ` Dipankar Sarma
1 sibling, 1 reply; 20+ messages in thread
From: Vaidyanathan Srinivasan @ 2010-10-20 19:19 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Andi Kleen, Trinabh Gupta, Venkatesh Pallipadi, peterz, lenb,
suresh.b.siddha, benh, linux-kernel
* Arjan van de Ven <arjan@linux.intel.com> [2010-10-20 09:03:23]:
> On 10/20/2010 8:34 AM, Andi Kleen wrote:
> >
> >>but now you're duplicating this functionality adding code for everyone.
> >>
> >>99.999% of the people today run cpuidle... (especially embedded
> >>x86 where they really care about power)
> >>all x86 going forward also has > 1 idle option anyway.
> >>
> >>and you're adding and extra layer in the middle that just
> >>duplicates the layer that's in use in practice above it.
> >>
> >>seriously, this sounds like the wrong tradeoff to make.
> >
> >I think the right option is still to put cpuidle on a diet.
> >There's no reason an idle handler needs to be that bloated.
> >
> >If it was 2K or so just including it into the core would be fine.
> >
> >Ignoring code size completely is generally a wrong trade off imho.
>
> I'm not ignoring code size.
> I'm saying that a 7Kb component that everyone on this architecture
> uses in practice versus adding 0.5Kb in ADDITION to that for
> everyone for the theoretical case
> of someone NOT using cpuidle is the wrong tradeoff.
Hi Arjan,
I agree with you that we need not add 0.5K extra code to x86 cpuidle
framework that is in use by most systems. However, this is only an
intermediate step (RFC) before we move/merge registration parts of
cpuidle into the kernel and leave only the governors as pluggable.
> having it go on a diet? I'm all for it. Killing off the ladder
> governor for example is a step.
> But really. 7Kb. There's lots of lower hanging fruit as well. 7Kb is
> not a reason to make such a bad tradeoff.
I see this RFC as an incremental step to move all idle routine
registration functionality into the kernel and keep governors and low
level drivers as modules. This will allow non x86 archs with just one
idle routine to keep minimal overhead. (Though this is becoming very
rare).
As stated in the goal the solution should satisfy the following
requirements:
4. Minimal overhead for arch with following use cases
a) Single compile time defined idle routine, no need for
runtime/boot time selection
b) Single idle routine, but selectable during boot/runtime.
No need for cpuidle governors
c) Runtime selection of single or multi-idle routines and
demand loading/usage of governors to select one among the
set of idle routines. (Current x86 model)
Making current cpuidle as default in kernel and make everybody else
register into cpuidle will satisfy (c), but we will need to slim down
the framework and keep parts of it as module to satisfy (b).
I think we agree on the goal, but need some discussion on what should
be the steps to reach the goal with incremental code changes and
without breaking multiple architectures at the same time.
--Vaidy
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-20 19:19 ` Vaidyanathan Srinivasan
@ 2010-10-20 19:25 ` Arjan van de Ven
2010-10-20 19:28 ` Peter Zijlstra
2010-10-20 19:40 ` Vaidyanathan Srinivasan
0 siblings, 2 replies; 20+ messages in thread
From: Arjan van de Ven @ 2010-10-20 19:25 UTC (permalink / raw)
To: svaidy
Cc: Andi Kleen, Trinabh Gupta, Venkatesh Pallipadi, peterz, lenb,
suresh.b.siddha, benh, linux-kernel
On 10/20/2010 12:19 PM, Vaidyanathan Srinivasan wrote:
> * Arjan van de Ven<arj
> I see this RFC as an incremental step to move all idle routine
> registration functionality into the kernel and keep governors and low
> level drivers as modules. This will allow non x86 archs with just one
> idle routine to keep minimal overhead. (Though this is becoming very
> rare).
yes pretty much all embedded really has more idle states, esp arm and co
> As stated in the goal the solution should satisfy the following
> requirements:
>
> 4. Minimal overhead for arch with following use cases
> a) Single compile time defined idle routine, no need for
> runtime/boot time selection
you ALWAYS have at least 2 idle handling states. The platform idle one
and the generic busy waiting one.
the later is needed for "I want absolutely 0 latency" cases.
> Making current cpuidle as default in kernel
not "in the kernel" but "for x86".
You're solving an x86 problem here, right?
(the pm_idle is an x86 only problem. other architectures should be able
to keep doing what they are doing)
For x86, lets solve it by going to cpuidle period... and if Andi can
find some bloat in cpuidle, lets see if the fat can be trimmed.
other architectures can either follow, or if they have nothing special
and only one idle routine, can do whatever they want.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-20 19:25 ` Arjan van de Ven
@ 2010-10-20 19:28 ` Peter Zijlstra
2010-10-20 19:29 ` Arjan van de Ven
2010-10-20 19:40 ` Vaidyanathan Srinivasan
1 sibling, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2010-10-20 19:28 UTC (permalink / raw)
To: Arjan van de Ven
Cc: svaidy, Andi Kleen, Trinabh Gupta, Venkatesh Pallipadi, lenb,
suresh.b.siddha, benh, linux-kernel
On Wed, 2010-10-20 at 12:25 -0700, Arjan van de Ven wrote:
> (the pm_idle is an x86 only problem. other architectures should be able
> to keep doing what they are doing)
sadly no, some architectures copied this brilliant piece of design.
A quick git grep suggests:
arm/blackfin/cris/ia64/m32r/m68knommu/microblaze/mn10300/sh/sparc all
have pm_idle.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-20 19:28 ` Peter Zijlstra
@ 2010-10-20 19:29 ` Arjan van de Ven
0 siblings, 0 replies; 20+ messages in thread
From: Arjan van de Ven @ 2010-10-20 19:29 UTC (permalink / raw)
To: Peter Zijlstra
Cc: svaidy, Andi Kleen, Trinabh Gupta, Venkatesh Pallipadi, lenb,
suresh.b.siddha, benh, linux-kernel
On 10/20/2010 12:28 PM, Peter Zijlstra wrote:
> On Wed, 2010-10-20 at 12:25 -0700, Arjan van de Ven wrote:
>> (the pm_idle is an x86 only problem. other architectures should be able
>> to keep doing what they are doing)
> sadly no, some architectures copied this brilliant piece of design.
>
> A quick git grep suggests:
> arm/blackfin/cris/ia64/m32r/m68knommu/microblaze/mn10300/sh/sparc all
> have pm_idle.
any of those that only every use one are the easy case.
the ones that use multiple can convert to the same solution x86 has over
time.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-20 19:25 ` Arjan van de Ven
2010-10-20 19:28 ` Peter Zijlstra
@ 2010-10-20 19:40 ` Vaidyanathan Srinivasan
2010-10-20 19:44 ` Arjan van de Ven
1 sibling, 1 reply; 20+ messages in thread
From: Vaidyanathan Srinivasan @ 2010-10-20 19:40 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Andi Kleen, Trinabh Gupta, Venkatesh Pallipadi, peterz, lenb,
suresh.b.siddha, benh, linux-kernel
* Arjan van de Ven <arjan@linux.intel.com> [2010-10-20 12:25:55]:
> On 10/20/2010 12:19 PM, Vaidyanathan Srinivasan wrote:
> >* Arjan van de Ven<arj
> >I see this RFC as an incremental step to move all idle routine
> >registration functionality into the kernel and keep governors and low
> >level drivers as modules. This will allow non x86 archs with just one
> >idle routine to keep minimal overhead. (Though this is becoming very
> >rare).
>
> yes pretty much all embedded really has more idle states, esp arm and co
>
> >As stated in the goal the solution should satisfy the following
> >requirements:
> >
> >4. Minimal overhead for arch with following use cases
> > a) Single compile time defined idle routine, no need for
> > runtime/boot time selection
>
> you ALWAYS have at least 2 idle handling states. The platform idle
> one and the generic busy waiting one.
> the later is needed for "I want absolutely 0 latency" cases.
Some special overrides like idle=poll should handle this case even if
cpuidle and related registration mechanism is compiled out. The point
is that we need some flexibility even if the full framework is not
included.
> > Making current cpuidle as default in kernel
>
> not "in the kernel" but "for x86".
> You're solving an x86 problem here, right?
> (the pm_idle is an x86 only problem. other architectures should be
> able to keep doing what they are doing)
> For x86, lets solve it by going to cpuidle period... and if Andi can
> find some bloat in cpuidle, lets see if the fat can be trimmed.
Ok, you are suggesting that for x86 lets move cpuidle in kernel
always, while it can be an optional module for other archs as it
stands today. We can slim down the cpuidle from current 7K or atleast
split some parts like governors as modules if needed.
> other architectures can either follow, or if they have nothing
> special and only one idle routine, can do whatever they want.
It will be good to have other archs also follow the same cpuidle
framework and call it from their kernel/idle routines so that we need
not have a hierarchy of idle routines there.
--Vaidy
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-20 19:40 ` Vaidyanathan Srinivasan
@ 2010-10-20 19:44 ` Arjan van de Ven
2010-10-20 19:47 ` Venkatesh Pallipadi
0 siblings, 1 reply; 20+ messages in thread
From: Arjan van de Ven @ 2010-10-20 19:44 UTC (permalink / raw)
To: svaidy
Cc: Andi Kleen, Trinabh Gupta, Venkatesh Pallipadi, peterz, lenb,
suresh.b.siddha, benh, linux-kernel
On 10/20/2010 12:40 PM, Vaid
>> you ALWAYS have at least 2 idle handling states. The platform idle
>> one and the generic busy waiting one.
>> the later is needed for "I want absolutely 0 latency" cases.
> Some special overrides like idle=poll should handle this case even if
> cpuidle and related registration mechanism is compiled out. The point
> is that we need some flexibility even if the full framework is not
> included.
this is not idle=poll
this is an (privileged) app or driver, at runtime, requesting a 0 usec
max latency for a short or long period of time.
>>> Making current cpuidle as default in kernel
>> not "in the kernel" but "for x86".
>> You're solving an x86 problem here, right?
>> (the pm_idle is an x86 only problem. other architectures should be
>> able to keep doing what they are doing)
>> For x86, lets solve it by going to cpuidle period... and if Andi can
>> find some bloat in cpuidle, lets see if the fat can be trimmed.
> Ok, you are suggesting that for x86 lets move cpuidle in kernel
> always, while it can be an optional module for other archs as it
> stands today. We can slim down the cpuidle from current 7K or atleast
> split some parts like governors as modules if needed.
governors as modules is a total pain. modules don't solve the problem.
really. it's still code you need.
we have two governors today, menu and ladder
menu is best on anything that is tickless
ladder is useless on any tickless kernel, and likely not better than
menu on non-tickless.
that's it.
It will be good to have other archs also follow the same cpuidle
> framework and call it from their kernel/idle routines so that we need
> not have a hierarchy of idle routines there.
yes. but we don't have to force that to happen at the exact same time.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-20 19:44 ` Arjan van de Ven
@ 2010-10-20 19:47 ` Venkatesh Pallipadi
2010-10-20 20:03 ` Vaidyanathan Srinivasan
2010-10-20 20:47 ` Arjan van de Ven
0 siblings, 2 replies; 20+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-20 19:47 UTC (permalink / raw)
To: Arjan van de Ven
Cc: svaidy, Andi Kleen, Trinabh Gupta, peterz, lenb, suresh.b.siddha,
benh, linux-kernel
On Wed, Oct 20, 2010 at 12:44 PM, Arjan van de Ven
<arjan@linux.intel.com> wrote:
> On 10/20/2010 12:40 PM, Vaid
>>>
>>> you ALWAYS have at least 2 idle handling states. The platform idle
>>> one and the generic busy waiting one.
>>> the later is needed for "I want absolutely 0 latency" cases.
>>
>> Some special overrides like idle=poll should handle this case even if
>> cpuidle and related registration mechanism is compiled out. The point
>> is that we need some flexibility even if the full framework is not
>> included.
>
> this is not idle=poll
>
> this is an (privileged) app or driver, at runtime, requesting a 0 usec max
> latency for a short or long period of time.
>
>
>>>> Making current cpuidle as default in kernel
>>>
>>> not "in the kernel" but "for x86".
>>> You're solving an x86 problem here, right?
>>> (the pm_idle is an x86 only problem. other architectures should be
>>> able to keep doing what they are doing)
>>> For x86, lets solve it by going to cpuidle period... and if Andi can
>>> find some bloat in cpuidle, lets see if the fat can be trimmed.
>>
>> Ok, you are suggesting that for x86 lets move cpuidle in kernel
>> always, while it can be an optional module for other archs as it
>> stands today. We can slim down the cpuidle from current 7K or atleast
>> split some parts like governors as modules if needed.
>
> governors as modules is a total pain. modules don't solve the problem.
> really. it's still code you need.
> we have two governors today, menu and ladder
> menu is best on anything that is tickless
> ladder is useless on any tickless kernel, and likely not better than menu on
> non-tickless.
> that's it.
> It will be good to have other archs also follow the same cpuidle
I don't think they have to be modules. There can be a CPUIDLE_LITE
which deCONFIG entire governor.c and individual governors (and
probably sysfs stuff as well) for archs that can only use one state at
any time, but still want to do config or runtime detection of one
state and register that state.
Thanks,
Venki
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-20 19:47 ` Venkatesh Pallipadi
@ 2010-10-20 20:03 ` Vaidyanathan Srinivasan
2010-10-20 20:47 ` Arjan van de Ven
1 sibling, 0 replies; 20+ messages in thread
From: Vaidyanathan Srinivasan @ 2010-10-20 20:03 UTC (permalink / raw)
To: Venkatesh Pallipadi
Cc: Arjan van de Ven, Andi Kleen, Trinabh Gupta, peterz, lenb,
suresh.b.siddha, benh, linux-kernel
* Venkatesh Pallipadi <venki@google.com> [2010-10-20 12:47:22]:
> >> Ok, you are suggesting that for x86 lets move cpuidle in kernel
> >> always, while it can be an optional module for other archs as it
> >> stands today. We can slim down the cpuidle from current 7K or atleast
> >> split some parts like governors as modules if needed.
> >
> > governors as modules is a total pain. modules don't solve the problem.
> > really. it's still code you need.
> > we have two governors today, menu and ladder
> > menu is best on anything that is tickless
> > ladder is useless on any tickless kernel, and likely not better than menu on
> > non-tickless.
> > that's it.
> > It will be good to have other archs also follow the same cpuidle
>
> I don't think they have to be modules. There can be a CPUIDLE_LITE
> which deCONFIG entire governor.c and individual governors (and
> probably sysfs stuff as well) for archs that can only use one state at
> any time, but still want to do config or runtime detection of one
> state and register that state.
hmm, CPUIDLE_LITE should resemble this patch, where there is only
simple single idle routine registration. If the full CPUIDLE is
picked at compile time (default for x86), we could have all the
features.
Sounds like a good option to try.
--Vaidy
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-20 19:47 ` Venkatesh Pallipadi
2010-10-20 20:03 ` Vaidyanathan Srinivasan
@ 2010-10-20 20:47 ` Arjan van de Ven
2010-10-20 21:19 ` Venkatesh Pallipadi
1 sibling, 1 reply; 20+ messages in thread
From: Arjan van de Ven @ 2010-10-20 20:47 UTC (permalink / raw)
To: Venkatesh Pallipadi
Cc: svaidy, Andi Kleen, Trinabh Gupta, peterz, lenb, suresh.b.siddha,
benh, linux-kernel
On 10/20/2010 12:47 PM, Venkatesh Pallipadi wrote:
> On Wed, Oct 20, 2010 at 12:44 PM, Arjan van de Ven
> <arjan@linux.intel.com> wrote:
>> On 10/20/2010 12:40 PM, Vaid
>>>> you ALWAYS have at least 2 idle handling states. The platform idle
>>>> one and the generic busy waiting one.
>>>> the later is needed for "I want absolutely 0 latency" cases.
>>> Some special overrides like idle=poll should handle this case even if
>>> cpuidle and related registration mechanism is compiled out. The point
>>> is that we need some flexibility even if the full framework is not
>>> included.
>> this is not idle=poll
>>
>> this is an (privileged) app or driver, at runtime, requesting a 0 usec max
>> latency for a short or long period of time.
>>
>>
>>>>> Making current cpuidle as default in kernel
>>>> not "in the kernel" but "for x86".
>>>> You're solving an x86 problem here, right?
>>>> (the pm_idle is an x86 only problem. other architectures should be
>>>> able to keep doing what they are doing)
>>>> For x86, lets solve it by going to cpuidle period... and if Andi can
>>>> find some bloat in cpuidle, lets see if the fat can be trimmed.
>>> Ok, you are suggesting that for x86 lets move cpuidle in kernel
>>> always, while it can be an optional module for other archs as it
>>> stands today. We can slim down the cpuidle from current 7K or atleast
>>> split some parts like governors as modules if needed.
>> governors as modules is a total pain. modules don't solve the problem.
>> really. it's still code you need.
>> we have two governors today, menu and ladder
>> menu is best on anything that is tickless
>> ladder is useless on any tickless kernel, and likely not better than menu on
>> non-tickless.
>> that's it.
>> It will be good to have other archs also follow the same cpuidle
> I don't think they have to be modules. There can be a CPUIDLE_LITE
> which deCONFIG entire governor.c and individual governors (and
> probably sysfs stuff as well) for archs that can only use one state at
> any time, but still want to do config or runtime detection of one
> state and register that state.
there is no such case though; you always have at least 2 options!
(the hardware equivalent of "hlt" and the polling option)
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-20 16:03 ` Arjan van de Ven
2010-10-20 19:19 ` Vaidyanathan Srinivasan
@ 2010-10-20 20:55 ` Dipankar Sarma
1 sibling, 0 replies; 20+ messages in thread
From: Dipankar Sarma @ 2010-10-20 20:55 UTC (permalink / raw)
To: Arjan van de Ven
Cc: Andi Kleen, Trinabh Gupta, Venkatesh Pallipadi, peterz, lenb,
suresh.b.siddha, benh, linux-kernel
On Wed, Oct 20, 2010 at 09:03:23AM -0700, Arjan van de Ven wrote:
> On 10/20/2010 8:34 AM, Andi Kleen wrote:
> >
> >I think the right option is still to put cpuidle on a diet.
> >There's no reason an idle handler needs to be that bloated.
> >
> >If it was 2K or so just including it into the core would be fine.
> >
> >Ignoring code size completely is generally a wrong trade off imho.
>
> I'm not ignoring code size.
> I'm saying that a 7Kb component that everyone on this architecture
> uses in practice versus adding 0.5Kb in ADDITION to that for
> everyone for the theoretical case
> of someone NOT using cpuidle is the wrong tradeoff.
The 0.5kb is necessary because we want to move from dangling
pm_idle to a simple registration mechanism.
> having it go on a diet? I'm all for it. Killing off the ladder
> governor for example is a step.
> But really. 7Kb. There's lots of lower hanging fruit as well. 7Kb is
> not a reason to make such a bad tradeoff.
Given the number of archs using this, doing this incrementally
seems to be the best way to go. The registration part first,
trimming cpuidle, moving other archs to the registration
mechanism later eventually deprecating pm_idle.
Thanks
Dipankar
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer
2010-10-20 20:47 ` Arjan van de Ven
@ 2010-10-20 21:19 ` Venkatesh Pallipadi
0 siblings, 0 replies; 20+ messages in thread
From: Venkatesh Pallipadi @ 2010-10-20 21:19 UTC (permalink / raw)
To: Arjan van de Ven
Cc: svaidy, Andi Kleen, Trinabh Gupta, peterz, lenb, suresh.b.siddha,
benh, linux-kernel
On Wed, Oct 20, 2010 at 1:47 PM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> On 10/20/2010 12:47 PM, Venkatesh Pallipadi wrote:
>>
>> On Wed, Oct 20, 2010 at 12:44 PM, Arjan van de Ven
>> <arjan@linux.intel.com> wrote:
>>>
>>> On 10/20/2010 12:40 PM, Vaid
>>>>>
>>>>> you ALWAYS have at least 2 idle handling states. The platform idle
>>>>> one and the generic busy waiting one.
>>>>> the later is needed for "I want absolutely 0 latency" cases.
>>>>
>>>> Some special overrides like idle=poll should handle this case even if
>>>> cpuidle and related registration mechanism is compiled out. The point
>>>> is that we need some flexibility even if the full framework is not
>>>> included.
>>>
>>> this is not idle=poll
>>>
>>> this is an (privileged) app or driver, at runtime, requesting a 0 usec
>>> max
>>> latency for a short or long period of time.
>>>
>>>
>>>>>> Making current cpuidle as default in kernel
>>>>>
>>>>> not "in the kernel" but "for x86".
>>>>> You're solving an x86 problem here, right?
>>>>> (the pm_idle is an x86 only problem. other architectures should be
>>>>> able to keep doing what they are doing)
>>>>> For x86, lets solve it by going to cpuidle period... and if Andi can
>>>>> find some bloat in cpuidle, lets see if the fat can be trimmed.
>>>>
>>>> Ok, you are suggesting that for x86 lets move cpuidle in kernel
>>>> always, while it can be an optional module for other archs as it
>>>> stands today. We can slim down the cpuidle from current 7K or atleast
>>>> split some parts like governors as modules if needed.
>>>
>>> governors as modules is a total pain. modules don't solve the problem.
>>> really. it's still code you need.
>>> we have two governors today, menu and ladder
>>> menu is best on anything that is tickless
>>> ladder is useless on any tickless kernel, and likely not better than menu
>>> on
>>> non-tickless.
>>> that's it.
>>> It will be good to have other archs also follow the same cpuidle
>>
>> I don't think they have to be modules. There can be a CPUIDLE_LITE
>> which deCONFIG entire governor.c and individual governors (and
>> probably sysfs stuff as well) for archs that can only use one state at
>> any time, but still want to do config or runtime detection of one
>> state and register that state.
>
> there is no such case though; you always have at least 2 options!
> (the hardware equivalent of "hlt" and the polling option)
>
Agree. But, from arch perspective there is only hlt and CPUIDLE_LITE
can add poll and still deconfigure all but one governor, sysfs stuff
and code dealing with registering of multiple states etc for example.
I am not saying doing all this is a priority. Just saying that, if
~7KB CPUIDLE is seen as a problem for some arch which does not have
more than one kind of halt, then there are ways to trim CPUIDLE down.
Thanks,
Venki
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2010-10-20 21:19 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-19 18:36 [RFC V1] cpuidle: add idle routine registration and cleanup pm_idle pointer Trinabh Gupta
2010-10-19 18:38 ` Arjan van de Ven
2010-10-19 18:49 ` Venkatesh Pallipadi
2010-10-19 19:01 ` Trinabh Gupta
2010-10-20 15:12 ` Trinabh Gupta
2010-10-20 15:18 ` Arjan van de Ven
2010-10-20 15:34 ` Andi Kleen
2010-10-20 16:03 ` Arjan van de Ven
2010-10-20 19:19 ` Vaidyanathan Srinivasan
2010-10-20 19:25 ` Arjan van de Ven
2010-10-20 19:28 ` Peter Zijlstra
2010-10-20 19:29 ` Arjan van de Ven
2010-10-20 19:40 ` Vaidyanathan Srinivasan
2010-10-20 19:44 ` Arjan van de Ven
2010-10-20 19:47 ` Venkatesh Pallipadi
2010-10-20 20:03 ` Vaidyanathan Srinivasan
2010-10-20 20:47 ` Arjan van de Ven
2010-10-20 21:19 ` Venkatesh Pallipadi
2010-10-20 20:55 ` Dipankar Sarma
2010-10-20 15:57 ` Trinabh Gupta
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).