All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC V2 0/3] cpuidle: Cleanup pm_idle and include driver/cpuidle.c in-kernel
@ 2011-01-13 12:51 Trinabh Gupta
  2011-01-13 12:52 ` [RFC V2 1/3] cpuidle: Remove pm_idle pointer for x86 Trinabh Gupta
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Trinabh Gupta @ 2011-01-13 12:51 UTC (permalink / raw)
  To: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak; +Cc: linux-kernel

The core of the kernel's idle routine on x86 presently depends
on an exported pm_idle function pointer that is unmanaged.
The first problem is that this exported pointer can be modified
by any subsystem. There is no tracking or notification mechanism.
Secondly and more importantly, various subsystems save the value of
this pointer, flip it and later restore to the saved value. There is
no guarantee that the saved value is still valid. The problem has
been discussed at http://lkml.org/lkml/2009/8/28/43 and
http://lkml.org/lkml/2009/8/28/50.

This patch series tries to achieve the goal of having cpuidle manage
all idle routine management by doing the following changes:

* Remove pm_idle pointer from process_xx.c and implicitly call
  cpuidle.c:cpuidle_idle_call().

* Change select_idle_routine() in x86 bootup code to create
  a cpuidle_driver structure and register the routines with the
  cpuidle subsystem using a default_idle driver in 
  drivers/idle/default_driver.c

* The 'best' (current) cpuidle_driver is selected based
  on the priority that is supplied at the time of registration/
  unregistration.

* Arch supplied default idle (hlt or poll) is used during transitions.  

* List of cpuidle_drivers and their (per-cpu) cpuidle_devices are
  maintained in a doubly linked list to enable selection and unregistering 
  when not used. Previously cpuidle subsystem had only one set of
  cpuidle_driver and corresponding cpuidle_devices.

* Ideally we could have a single list of idle routines supplied by
  different modules and use a priority+governor to select the best.
  We could incrementally do this since we have a dependency: Only
  idle routines from a set (ACPI) should be used on all cpus and not
  a combination of say C1e on some cpus and ACPI routines on some
  other cpus.

This patch series applies on 2.6.37, and was tested on x86 system with
multiple sleep states. 

Request your comment on the general approach based on the discussion
in v1 (https://lkml.org/lkml/2010/10/19/449).

Changes from v1: cpuidle: add idle routine registration and cleanup 
pm_idle pointer

* v1 has registration mechanisms for single idle routines and expects
  cpuidle.c:cpuidle_idle_call() to take over for systems with multiple
  idle routines.

  While this series uses cpuidle.c:cpuidle_idle_call() as a means to
  invoke single idle routines as well as multiple idle routines. The 
  registration mechanism proposed in V1 is completely replaced by 
  cpuidle.c with added fields like 'priority'.

---

Trinabh Gupta (3):
      cpuidle: Default Idle Driver for x86
      cpuidle data structure changes, list based cpuidle driver registration
      cpuidle: Remove pm_idle pointer for x86


 arch/x86/kernel/process.c     |  340 -------------------------------
 arch/x86/kernel/process_32.c  |    4 
 arch/x86/kernel/process_64.c  |    4 
 arch/x86/xen/setup.c          |    1 
 drivers/acpi/processor_idle.c |    2 
 drivers/cpuidle/Kconfig       |    2 
 drivers/cpuidle/cpuidle.c     |   42 +---
 drivers/cpuidle/driver.c      |   68 +++++-
 drivers/cpuidle/governor.c    |   13 +
 drivers/idle/Makefile         |    2 
 drivers/idle/default_driver.c |  451 +++++++++++++++++++++++++++++++++++++++++
 include/linux/cpuidle.h       |    4 
 12 files changed, 543 insertions(+), 390 deletions(-)
 create mode 100644 drivers/idle/default_driver.c

-- 
-Trinabh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC V2 1/3] cpuidle: Remove pm_idle pointer for x86
  2011-01-13 12:51 [RFC V2 0/3] cpuidle: Cleanup pm_idle and include driver/cpuidle.c in-kernel Trinabh Gupta
@ 2011-01-13 12:52 ` Trinabh Gupta
  2011-01-13 12:52 ` [RFC V2 2/3] cpuidle: list based cpuidle driver registration and selection Trinabh Gupta
  2011-01-13 12:52 ` [RFC V2 3/3] cpuidle: default idle driver for x86 Trinabh Gupta
  2 siblings, 0 replies; 6+ messages in thread
From: Trinabh Gupta @ 2011-01-13 12:52 UTC (permalink / raw)
  To: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak; +Cc: linux-kernel

This patch reomves pm_idle function pointer and directly calls
cpuidle_idle_call from the idle loop on x86. CPUIdle has to be
built into the kernel.

Archs that still use pm_idle can continue to set
pm_idle=cpuidle_idle_call() and co-exist.  This will need #ifdefs
in cpuidle.c and has not been implemented yet in this RFC.

Signed-off-by: Trinabh Gupta <trinabh@linux.vnet.ibm.com>
---

 arch/x86/kernel/process_32.c |    4 +++-
 arch/x86/kernel/process_64.c |    4 +++-
 drivers/cpuidle/Kconfig      |    2 +-
 drivers/cpuidle/cpuidle.c    |    2 +-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 96586c3..d487ff8 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -76,6 +76,8 @@ static inline void play_dead(void)
 }
 #endif
 
+extern void cpuidle_idle_call(void);
+
 /*
  * The idle thread. There's no useful work to be
  * done, so just try to conserve power and have a
@@ -111,7 +113,7 @@ void cpu_idle(void)
 			local_irq_disable();
 			/* Don't trace irqs off for idle */
 			stop_critical_timings();
-			pm_idle();
+			cpuidle_idle_call();
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b3d7a3a..768cd23 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -101,6 +101,8 @@ static inline void play_dead(void)
 }
 #endif
 
+extern void cpuidle_idle_call(void);
+
 /*
  * The idle thread. There's no useful work to be
  * done, so just try to conserve power and have a
@@ -138,7 +140,7 @@ void cpu_idle(void)
 			enter_idle();
 			/* Don't trace irqs off for idle */
 			stop_critical_timings();
-			pm_idle();
+			cpuidle_idle_call();
 			start_critical_timings();
 
 			trace_power_end(smp_processor_id());
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 7dbc4a8..57ad7bd 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -1,7 +1,7 @@
 
 config CPU_IDLE
 	bool "CPU idle PM support"
-	default ACPI
+	default y
 	help
 	  CPU idle is a generic framework for supporting software-controlled
 	  idle processor power management.  It includes modular cross-platform
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a507108..9bf4640 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -47,7 +47,7 @@ static int __cpuidle_register_device(struct cpuidle_device *dev);
  *
  * NOTE: no locks or semaphores should be used here
  */
-static void cpuidle_idle_call(void)
+void cpuidle_idle_call(void)
 {
 	struct cpuidle_device *dev = __get_cpu_var(cpuidle_devices);
 	struct cpuidle_state *target_state;


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC V2 2/3] cpuidle: list based cpuidle driver registration and selection
  2011-01-13 12:51 [RFC V2 0/3] cpuidle: Cleanup pm_idle and include driver/cpuidle.c in-kernel Trinabh Gupta
  2011-01-13 12:52 ` [RFC V2 1/3] cpuidle: Remove pm_idle pointer for x86 Trinabh Gupta
@ 2011-01-13 12:52 ` Trinabh Gupta
  2011-01-13 12:52 ` [RFC V2 3/3] cpuidle: default idle driver for x86 Trinabh Gupta
  2 siblings, 0 replies; 6+ messages in thread
From: Trinabh Gupta @ 2011-01-13 12:52 UTC (permalink / raw)
  To: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak; +Cc: linux-kernel

A cpuidle_driver structure represents a cpuidle driver like
acpi_idle, intel_idle providing low level idle routines.
A cpuidle_driver is global in nature as it provides routines
for all the CPUS. Each CPU registered with the cpuidle subsystem is
represented as a cpuidle_device. A cpuidle_device structure
points to the low level idle routines for that CPU provided by
a certain driver. In other words, a cpuidle driver creates a
cpuidle_device structure for each CPU that it registers with the
cpuidle subsystem. Whenever cpuidle idle loop is called, the cpuidle
subsystem picks the cpuidle_device structure for that cpu and
calls one of the low level idle routines through that structure.

In the current design, only one cpuidle_driver may be registered
and registration of any subsequent driver fails. The same registered
driver provides low level idle routines for each cpuidle_device.

A list based registration for cpuidle_driver provides a clean
mechanism for multiple subsystems/modules to register their own idle
routines and thus avoids using pm_idle.

This patch implements a list based registration for cpuidle
driver. Different drivers can be registered and the driver to
be used is selected based on a per driver priority. On a
cpuidle driver registration or unregistration cpuidle_device
structure for each CPU is changed. Each cpuidle_device points
to its driver. Each driver also maintains a list of cpuidle_devices
for itself (although this can be made global). Effectively
there is both way connectivity between cpuidle_device and
cpuidle_driver to facilitate management for subsequent registration
and unregsitration.

        ---------                        ---------
        |cpuidle|                        |cpuidle|
        |driver |----------------------- |driver |
        |default|                        |acpi   |
        ---------                        ---------
            |                                |
            |                                |
----------------------------      ---------------------------
|                    |            |                |
|                    |            |                |
|                    |            |                |
---------    ---------            ---------    ----------
|cpuidle|    |cpuidle|            |cpuidle|    |cpuidle |
|device |    |device | ....       |device |    |device  | .....
|  CPU0 |    | CPU1  |            |CPU0   |    |CPU1    |
---------    ---------            ---------    ----------

Signed-off-by: Trinabh Gupta <trinabh@linux.vnet.ibm.com>
---

 drivers/acpi/processor_idle.c |    2 +
 drivers/cpuidle/cpuidle.c     |   50 +++++++++++++-----------------
 drivers/cpuidle/driver.c      |   68 ++++++++++++++++++++++++++++++++++-------
 drivers/cpuidle/governor.c    |   13 +++++---
 include/linux/cpuidle.h       |    4 ++
 5 files changed, 93 insertions(+), 44 deletions(-)

diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index dcb38f8..b53c7fb 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -964,6 +964,7 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
 struct cpuidle_driver acpi_idle_driver = {
 	.name =		"acpi_idle",
 	.owner =	THIS_MODULE,
+	.priority =	10,
 };
 
 /**
@@ -985,6 +986,7 @@ static int acpi_processor_setup_cpuidle(struct acpi_processor *pr)
 	}
 
 	dev->cpu = pr->id;
+	dev->drv = &acpi_idle_driver;
 	for (i = 0; i < CPUIDLE_STATE_MAX; i++) {
 		dev->states[i].name[0] = '\0';
 		dev->states[i].desc[0] = '\0';
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 9bf4640..f16b920 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -25,9 +25,6 @@ DEFINE_PER_CPU(struct cpuidle_device *, cpuidle_devices);
 
 DEFINE_MUTEX(cpuidle_lock);
 LIST_HEAD(cpuidle_detected_devices);
-static void (*pm_idle_old)(void);
-
-static int enabled_devices;
 
 #if defined(CONFIG_ARCH_HAS_CPU_IDLE_WAIT)
 static void cpuidle_kick_cpus(void)
@@ -55,13 +52,10 @@ void cpuidle_idle_call(void)
 
 	/* check if the device is ready */
 	if (!dev || !dev->enabled) {
-		if (pm_idle_old)
-			pm_idle_old();
-		else
 #if defined(CONFIG_ARCH_HAS_DEFAULT_IDLE)
-			default_idle();
+		default_idle();
 #else
-			local_irq_enable();
+		local_irq_enable();
 #endif
 		return;
 	}
@@ -114,11 +108,17 @@ void cpuidle_idle_call(void)
  */
 void cpuidle_install_idle_handler(void)
 {
-	if (enabled_devices && (pm_idle != cpuidle_idle_call)) {
-		/* Make sure all changes finished before we switch to new idle */
-		smp_wmb();
-		pm_idle = cpuidle_idle_call;
-	}
+	/* 
+	* To Do: These (un)install_idle_handler() routines will become
+	* arch specific to stage the cleanup on a per-architecture basis.
+	*
+	* For archs with pm_idle, the original code will remain. This can also
+	* move into arch/kernel where the pm_idle is defined so that
+	* per-architecture code is naturally split-up. From cpuidle's point of
+	* view, it will call the (un)install_idle_handler() as needed and this
+	* will be a no-op on non-pm_idle archs.
+	*/
+	smp_wmb();
 }
 
 /**
@@ -126,10 +126,7 @@ void cpuidle_install_idle_handler(void)
  */
 void cpuidle_uninstall_idle_handler(void)
 {
-	if (enabled_devices && pm_idle_old && (pm_idle != pm_idle_old)) {
-		pm_idle = pm_idle_old;
-		cpuidle_kick_cpus();
-	}
+	cpuidle_kick_cpus();
 }
 
 /**
@@ -196,7 +193,6 @@ int cpuidle_enable_device(struct cpuidle_device *dev)
 
 	dev->enabled = 1;
 
-	enabled_devices++;
 	return 0;
 
 fail_sysfs:
@@ -227,7 +223,6 @@ void cpuidle_disable_device(struct cpuidle_device *dev)
 		cpuidle_curr_governor->disable(dev);
 
 	cpuidle_remove_state_sysfs(dev);
-	enabled_devices--;
 }
 
 EXPORT_SYMBOL_GPL(cpuidle_disable_device);
@@ -286,7 +281,7 @@ static int __cpuidle_register_device(struct cpuidle_device *dev)
 
 	if (!sys_dev)
 		return -EINVAL;
-	if (!try_module_get(cpuidle_driver->owner))
+	if (!try_module_get(dev->drv->owner))
 		return -EINVAL;
 
 	init_completion(&dev->kobj_unregister);
@@ -313,10 +308,11 @@ static int __cpuidle_register_device(struct cpuidle_device *dev)
 			dev->states[i].power_usage = -1 - i;
 	}
 
-	per_cpu(cpuidle_devices, dev->cpu) = dev;
-	list_add(&dev->device_list, &cpuidle_detected_devices);
+	if (cpuidle_driver == dev->drv)
+		per_cpu(cpuidle_devices, dev->cpu) = dev;
+	list_add(&dev->device_list, &dev->drv->device_list);
 	if ((ret = cpuidle_add_sysfs(sys_dev))) {
-		module_put(cpuidle_driver->owner);
+		module_put(dev->drv->owner);
 		return ret;
 	}
 
@@ -367,13 +363,13 @@ void cpuidle_unregister_device(struct cpuidle_device *dev)
 	cpuidle_disable_device(dev);
 
 	cpuidle_remove_sysfs(sys_dev);
-	list_del(&dev->device_list);
 	wait_for_completion(&dev->kobj_unregister);
-	per_cpu(cpuidle_devices, dev->cpu) = NULL;
+	if (cpuidle_driver == dev->drv)
+		per_cpu(cpuidle_devices, dev->cpu) = NULL;
 
 	cpuidle_resume_and_unlock();
 
-	module_put(cpuidle_driver->owner);
+	module_put(dev->drv->owner);
 }
 
 EXPORT_SYMBOL_GPL(cpuidle_unregister_device);
@@ -420,8 +416,6 @@ static int __init cpuidle_init(void)
 {
 	int ret;
 
-	pm_idle_old = pm_idle;
-
 	ret = cpuidle_add_class_sysfs(&cpu_sysdev_class);
 	if (ret)
 		return ret;
diff --git a/drivers/cpuidle/driver.c b/drivers/cpuidle/driver.c
index fd1601e..78ae527 100644
--- a/drivers/cpuidle/driver.c
+++ b/drivers/cpuidle/driver.c
@@ -14,8 +14,52 @@
 
 #include "cpuidle.h"
 
+#define MAX_PRIORITY 1000
+
 static struct cpuidle_driver *cpuidle_curr_driver;
 DEFINE_SPINLOCK(cpuidle_driver_lock);
+LIST_HEAD(registered_cpuidle_drivers);
+
+static struct cpuidle_driver *select_cpuidle_driver(void)
+{
+	struct cpuidle_driver *item = NULL, *next = NULL;
+	unsigned int min_priority = MAX_PRIORITY;
+
+	list_for_each_entry(item, &registered_cpuidle_drivers,
+		driver_list) {
+		if (item->priority <= min_priority) {
+			next = item;
+			min_priority = item->priority;
+		}
+	}
+	return next;
+}
+
+static void set_current_cpuidle_driver(struct cpuidle_driver *drv)
+{
+	struct cpuidle_device *item = NULL;
+	if (drv == cpuidle_curr_driver)
+		return;
+
+	/* Unregister the previous drivers devices */
+	if (cpuidle_curr_driver) {
+		list_for_each_entry(item, &cpuidle_curr_driver->device_list,
+			device_list) {
+			cpuidle_unregister_device(item);
+		}
+	}
+
+	cpuidle_curr_driver = drv;
+
+	if (drv == NULL)
+		return;
+	else {
+		/* Register the new driver devices */
+		list_for_each_entry(item, &drv->device_list, device_list) {
+			cpuidle_register_device(item);
+		}
+	}
+}
 
 /**
  * cpuidle_register_driver - registers a driver
@@ -23,15 +67,20 @@ DEFINE_SPINLOCK(cpuidle_driver_lock);
  */
 int cpuidle_register_driver(struct cpuidle_driver *drv)
 {
+	struct cpuidle_driver *item = NULL;
 	if (!drv)
 		return -EINVAL;
 
-	spin_lock(&cpuidle_driver_lock);
-	if (cpuidle_curr_driver) {
-		spin_unlock(&cpuidle_driver_lock);
-		return -EBUSY;
+	/* Check if driver already registered */
+	list_for_each_entry(item, &registered_cpuidle_drivers, driver_list) {
+		if (item == drv)
+			return -EINVAL;
 	}
-	cpuidle_curr_driver = drv;
+
+	spin_lock(&cpuidle_driver_lock);
+	INIT_LIST_HEAD(&drv->device_list);
+	list_add(&drv->driver_list, &registered_cpuidle_drivers);
+	set_current_cpuidle_driver(select_cpuidle_driver());
 	spin_unlock(&cpuidle_driver_lock);
 
 	return 0;
@@ -54,14 +103,9 @@ EXPORT_SYMBOL_GPL(cpuidle_get_driver);
  */
 void cpuidle_unregister_driver(struct cpuidle_driver *drv)
 {
-	if (drv != cpuidle_curr_driver) {
-		WARN(1, "invalid cpuidle_unregister_driver(%s)\n",
-			drv->name);
-		return;
-	}
-
 	spin_lock(&cpuidle_driver_lock);
-	cpuidle_curr_driver = NULL;
+	list_del(&drv->driver_list);
+	set_current_cpuidle_driver(select_cpuidle_driver());
 	spin_unlock(&cpuidle_driver_lock);
 }
 
diff --git a/drivers/cpuidle/governor.c b/drivers/cpuidle/governor.c
index 724c164..a8ed695 100644
--- a/drivers/cpuidle/governor.c
+++ b/drivers/cpuidle/governor.c
@@ -44,6 +44,7 @@ static struct cpuidle_governor * __cpuidle_find_governor(const char *str)
 int cpuidle_switch_governor(struct cpuidle_governor *gov)
 {
 	struct cpuidle_device *dev;
+	struct cpuidle_driver *drv = cpuidle_get_driver();
 
 	if (gov == cpuidle_curr_governor)
 		return 0;
@@ -51,8 +52,10 @@ int cpuidle_switch_governor(struct cpuidle_governor *gov)
 	cpuidle_uninstall_idle_handler();
 
 	if (cpuidle_curr_governor) {
-		list_for_each_entry(dev, &cpuidle_detected_devices, device_list)
-			cpuidle_disable_device(dev);
+		if (drv) {
+			list_for_each_entry(dev, &drv->device_list, device_list)
+				cpuidle_disable_device(dev);
+		}
 		module_put(cpuidle_curr_governor->owner);
 	}
 
@@ -61,8 +64,10 @@ int cpuidle_switch_governor(struct cpuidle_governor *gov)
 	if (gov) {
 		if (!try_module_get(cpuidle_curr_governor->owner))
 			return -EINVAL;
-		list_for_each_entry(dev, &cpuidle_detected_devices, device_list)
-			cpuidle_enable_device(dev);
+		if (drv) {
+			list_for_each_entry(dev, &drv->device_list, device_list)
+				cpuidle_enable_device(dev);
+		}
 		cpuidle_install_idle_handler();
 		printk(KERN_INFO "cpuidle: using governor %s\n", gov->name);
 	}
diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
index 1be416b..5fea1ff 100644
--- a/include/linux/cpuidle.h
+++ b/include/linux/cpuidle.h
@@ -96,6 +96,7 @@ struct cpuidle_device {
 	struct cpuidle_state	*last_state;
 
 	struct list_head 	device_list;
+	struct cpuidle_driver	*drv;
 	struct kobject		kobj;
 	struct completion	kobj_unregister;
 	void			*governor_data;
@@ -125,6 +126,9 @@ static inline int cpuidle_get_last_residency(struct cpuidle_device *dev)
 struct cpuidle_driver {
 	char			name[CPUIDLE_NAME_LEN];
 	struct module 		*owner;
+	unsigned int		priority;
+	struct list_head	driver_list;
+	struct list_head	device_list;
 };
 
 #ifdef CONFIG_CPU_IDLE


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC V2 3/3] cpuidle: default idle driver for x86
  2011-01-13 12:51 [RFC V2 0/3] cpuidle: Cleanup pm_idle and include driver/cpuidle.c in-kernel Trinabh Gupta
  2011-01-13 12:52 ` [RFC V2 1/3] cpuidle: Remove pm_idle pointer for x86 Trinabh Gupta
  2011-01-13 12:52 ` [RFC V2 2/3] cpuidle: list based cpuidle driver registration and selection Trinabh Gupta
@ 2011-01-13 12:52 ` Trinabh Gupta
  2011-01-13 15:47   ` Arjan van de Ven
  2 siblings, 1 reply; 6+ messages in thread
From: Trinabh Gupta @ 2011-01-13 12:52 UTC (permalink / raw)
  To: arjan, peterz, lenb, suresh.b.siddha, benh, venki, ak; +Cc: linux-kernel

This default cpuidle_driver parses idle= boot parameters, selects
the optimal idle routine for x86 during bootup and registers with
cpuidle. The code for idle routines and the selection of optimal
routine is moved from arch/x86/kernel/process.c . At module_init this
default driver is registered with cpuidle and for non ACPI platforms
it continues to be used. For ACPI platforms, acpi_idle driver would
replace this driver at a later point in time during bootup. Until
this driver's registration, architecture supplied compile time
default idle routine is called from within cpuidle_idle_call().

To Do:
1. Currently for Xen pm_idle is changed within
arch/x86/xen/setup.c. A simple cpuidle_driver for xen
will have to be implemented.
2. This cannot be compiled as a module yet as
select_idle_routine() is called from
arch/x86/kernel/cpu/common.c:identify_cpu(). This can
be subsequently cleaned up and allow the default driver
to be loadable module.

Signed-off-by: Trinabh Gupta <trinabh@linux.vnet.ibm.com>
---

 arch/x86/kernel/process.c     |  340 -------------------------------
 arch/x86/xen/setup.c          |    1 
 drivers/idle/Makefile         |    2 
 drivers/idle/default_driver.c |  454 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 455 insertions(+), 342 deletions(-)
 create mode 100644 drivers/idle/default_driver.c

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 57d1868..6267f89 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -7,7 +7,6 @@
 #include <linux/sched.h>
 #include <linux/module.h>
 #include <linux/pm.h>
-#include <linux/clockchips.h>
 #include <linux/random.h>
 #include <linux/user-return-notifier.h>
 #include <linux/dmi.h>
@@ -22,11 +21,6 @@
 #include <asm/i387.h>
 #include <asm/debugreg.h>
 
-unsigned long idle_halt;
-EXPORT_SYMBOL(idle_halt);
-unsigned long idle_nomwait;
-EXPORT_SYMBOL(idle_nomwait);
-
 struct kmem_cache *task_xstate_cachep;
 EXPORT_SYMBOL_GPL(task_xstate_cachep);
 
@@ -325,340 +319,6 @@ long sys_execve(const char __user *name,
 	return error;
 }
 
-/*
- * Idle related variables and functions
- */
-unsigned long boot_option_idle_override = 0;
-EXPORT_SYMBOL(boot_option_idle_override);
-
-/*
- * Powermanagement idle function, if any..
- */
-void (*pm_idle)(void);
-EXPORT_SYMBOL(pm_idle);
-
-#ifdef CONFIG_X86_32
-/*
- * This halt magic was a workaround for ancient floppy DMA
- * wreckage. It should be safe to remove.
- */
-static int hlt_counter;
-void disable_hlt(void)
-{
-	hlt_counter++;
-}
-EXPORT_SYMBOL(disable_hlt);
-
-void enable_hlt(void)
-{
-	hlt_counter--;
-}
-EXPORT_SYMBOL(enable_hlt);
-
-static inline int hlt_use_halt(void)
-{
-	return (!hlt_counter && boot_cpu_data.hlt_works_ok);
-}
-#else
-static inline int hlt_use_halt(void)
-{
-	return 1;
-}
-#endif
-
-/*
- * We use this if we don't have any better
- * idle routine..
- */
-void default_idle(void)
-{
-	if (hlt_use_halt()) {
-		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
-		current_thread_info()->status &= ~TS_POLLING;
-		/*
-		 * TS_POLLING-cleared state must be visible before we
-		 * test NEED_RESCHED:
-		 */
-		smp_mb();
-
-		if (!need_resched())
-			safe_halt();	/* enables interrupts racelessly */
-		else
-			local_irq_enable();
-		current_thread_info()->status |= TS_POLLING;
-	} else {
-		local_irq_enable();
-		/* loop is done by the caller */
-		cpu_relax();
-	}
-}
-#ifdef CONFIG_APM_MODULE
-EXPORT_SYMBOL(default_idle);
-#endif
-
-void stop_this_cpu(void *dummy)
-{
-	local_irq_disable();
-	/*
-	 * Remove this CPU:
-	 */
-	set_cpu_online(smp_processor_id(), false);
-	disable_local_APIC();
-
-	for (;;) {
-		if (hlt_works(smp_processor_id()))
-			halt();
-	}
-}
-
-static void do_nothing(void *unused)
-{
-}
-
-/*
- * cpu_idle_wait - Used to ensure that all the CPUs discard old value of
- * pm_idle and update to new pm_idle value. Required while changing pm_idle
- * handler on SMP systems.
- *
- * Caller must have changed pm_idle to the new value before the call. Old
- * pm_idle value will not be used by any CPU after the return of this function.
- */
-void cpu_idle_wait(void)
-{
-	smp_mb();
-	/* kick all the CPUs so that they exit out of pm_idle */
-	smp_call_function(do_nothing, NULL, 1);
-}
-EXPORT_SYMBOL_GPL(cpu_idle_wait);
-
-/*
- * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
- * which can obviate IPI to trigger checking of need_resched.
- * We execute MONITOR against need_resched and enter optimized wait state
- * through MWAIT. Whenever someone changes need_resched, we would be woken
- * up from MWAIT (without an IPI).
- *
- * New with Core Duo processors, MWAIT can take some hints based on CPU
- * capability.
- */
-void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
-{
-	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
-	if (!need_resched()) {
-		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
-			clflush((void *)&current_thread_info()->flags);
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__mwait(ax, cx);
-	}
-}
-
-/* Default MONITOR/MWAIT with no hints, used for default C1 state */
-static void mwait_idle(void)
-{
-	if (!need_resched()) {
-		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
-		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
-			clflush((void *)&current_thread_info()->flags);
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
-		if (!need_resched())
-			__sti_mwait(0, 0);
-		else
-			local_irq_enable();
-	} else
-		local_irq_enable();
-}
-
-/*
- * On SMP it's slightly faster (but much more power-consuming!)
- * to poll the ->work.need_resched flag instead of waiting for the
- * cross-CPU IPI to arrive. Use this option with caution.
- */
-static void poll_idle(void)
-{
-	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
-	local_irq_enable();
-	while (!need_resched())
-		cpu_relax();
-	trace_power_end(0);
-}
-
-/*
- * mwait selection logic:
- *
- * It depends on the CPU. For AMD CPUs that support MWAIT this is
- * wrong. Family 0x10 and 0x11 CPUs will enter C1 on HLT. Powersavings
- * then depend on a clock divisor and current Pstate of the core. If
- * all cores of a processor are in halt state (C1) the processor can
- * enter the C1E (C1 enhanced) state. If mwait is used this will never
- * happen.
- *
- * idle=mwait overrides this decision and forces the usage of mwait.
- */
-static int __cpuinitdata force_mwait;
-
-#define MWAIT_INFO			0x05
-#define MWAIT_ECX_EXTENDED_INFO		0x01
-#define MWAIT_EDX_C1			0xf0
-
-static int __cpuinit mwait_usable(const struct cpuinfo_x86 *c)
-{
-	u32 eax, ebx, ecx, edx;
-
-	if (force_mwait)
-		return 1;
-
-	if (c->cpuid_level < MWAIT_INFO)
-		return 0;
-
-	cpuid(MWAIT_INFO, &eax, &ebx, &ecx, &edx);
-	/* Check, whether EDX has extended info about MWAIT */
-	if (!(ecx & MWAIT_ECX_EXTENDED_INFO))
-		return 1;
-
-	/*
-	 * edx enumeratios MONITOR/MWAIT extensions. Check, whether
-	 * C1  supports MWAIT
-	 */
-	return (edx & MWAIT_EDX_C1);
-}
-
-bool c1e_detected;
-EXPORT_SYMBOL(c1e_detected);
-
-static cpumask_var_t c1e_mask;
-
-void c1e_remove_cpu(int cpu)
-{
-	if (c1e_mask != NULL)
-		cpumask_clear_cpu(cpu, c1e_mask);
-}
-
-/*
- * C1E aware idle routine. We check for C1E active in the interrupt
- * pending message MSR. If we detect C1E, then we handle it the same
- * way as C3 power states (local apic timer and TSC stop)
- */
-static void c1e_idle(void)
-{
-	if (need_resched())
-		return;
-
-	if (!c1e_detected) {
-		u32 lo, hi;
-
-		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
-
-		if (lo & K8_INTP_C1E_ACTIVE_MASK) {
-			c1e_detected = true;
-			if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
-				mark_tsc_unstable("TSC halt in AMD C1E");
-			printk(KERN_INFO "System has AMD C1E enabled\n");
-		}
-	}
-
-	if (c1e_detected) {
-		int cpu = smp_processor_id();
-
-		if (!cpumask_test_cpu(cpu, c1e_mask)) {
-			cpumask_set_cpu(cpu, c1e_mask);
-			/*
-			 * Force broadcast so ACPI can not interfere.
-			 */
-			clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_FORCE,
-					   &cpu);
-			printk(KERN_INFO "Switch to broadcast mode on CPU%d\n",
-			       cpu);
-		}
-		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
-
-		default_idle();
-
-		/*
-		 * The switch back from broadcast mode needs to be
-		 * called with interrupts disabled.
-		 */
-		 local_irq_disable();
-		 clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
-		 local_irq_enable();
-	} else
-		default_idle();
-}
-
-void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c)
-{
-#ifdef CONFIG_SMP
-	if (pm_idle == poll_idle && smp_num_siblings > 1) {
-		printk_once(KERN_WARNING "WARNING: polling idle and HT enabled,"
-			" performance may degrade.\n");
-	}
-#endif
-	if (pm_idle)
-		return;
-
-	if (cpu_has(c, X86_FEATURE_MWAIT) && mwait_usable(c)) {
-		/*
-		 * One CPU supports mwait => All CPUs supports mwait
-		 */
-		printk(KERN_INFO "using mwait in idle threads.\n");
-		pm_idle = mwait_idle;
-	} else if (cpu_has_amd_erratum(amd_erratum_400)) {
-		/* E400: APIC timer interrupt does not wake up CPU from C1e */
-		printk(KERN_INFO "using C1E aware idle routine\n");
-		pm_idle = c1e_idle;
-	} else
-		pm_idle = default_idle;
-}
-
-void __init init_c1e_mask(void)
-{
-	/* If we're using c1e_idle, we need to allocate c1e_mask. */
-	if (pm_idle == c1e_idle)
-		zalloc_cpumask_var(&c1e_mask, GFP_KERNEL);
-}
-
-static int __init idle_setup(char *str)
-{
-	if (!str)
-		return -EINVAL;
-
-	if (!strcmp(str, "poll")) {
-		printk("using polling idle threads.\n");
-		pm_idle = poll_idle;
-	} else if (!strcmp(str, "mwait"))
-		force_mwait = 1;
-	else if (!strcmp(str, "halt")) {
-		/*
-		 * When the boot option of idle=halt is added, halt is
-		 * forced to be used for CPU idle. In such case CPU C2/C3
-		 * won't be used again.
-		 * To continue to load the CPU idle driver, don't touch
-		 * the boot_option_idle_override.
-		 */
-		pm_idle = default_idle;
-		idle_halt = 1;
-		return 0;
-	} else if (!strcmp(str, "nomwait")) {
-		/*
-		 * If the boot option of "idle=nomwait" is added,
-		 * it means that mwait will be disabled for CPU C2/C3
-		 * states. In such case it won't touch the variable
-		 * of boot_option_idle_override.
-		 */
-		idle_nomwait = 1;
-		return 0;
-	} else
-		return -1;
-
-	boot_option_idle_override = 1;
-	return 0;
-}
-early_param("idle", idle_setup);
-
 unsigned long arch_align_stack(unsigned long sp)
 {
 	if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index b5a7f92..ee93c83 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -349,7 +349,6 @@ void __init xen_arch_setup(void)
 #ifdef CONFIG_X86_32
 	boot_cpu_data.hlt_works_ok = 1;
 #endif
-	pm_idle = default_idle;
 
 	fiddle_vdso();
 }
diff --git a/drivers/idle/Makefile b/drivers/idle/Makefile
index 23d295c..0039d60 100644
--- a/drivers/idle/Makefile
+++ b/drivers/idle/Makefile
@@ -1,3 +1,3 @@
 obj-$(CONFIG_I7300_IDLE)			+= i7300_idle.o
 obj-$(CONFIG_INTEL_IDLE)			+= intel_idle.o
-
+obj-y						+= default_driver.o
diff --git a/drivers/idle/default_driver.c b/drivers/idle/default_driver.c
new file mode 100644
index 0000000..ccf6a45
--- /dev/null
+++ b/drivers/idle/default_driver.c
@@ -0,0 +1,454 @@
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/sched.h>
+#include <linux/cpuidle.h>
+#include <linux/clockchips.h>
+#include <linux/slab.h>
+#include <trace/events/power.h>
+#include <asm/mwait.h>
+
+unsigned long boot_option_idle_override = 0;
+EXPORT_SYMBOL(boot_option_idle_override);
+
+unsigned long idle_halt;
+EXPORT_SYMBOL(idle_halt);
+unsigned long idle_nomwait;
+EXPORT_SYMBOL(idle_nomwait);
+
+static struct cpuidle_state *opt_state;
+
+#ifdef CONFIG_X86_32
+/*
+ * This halt magic was a workaround for ancient floppy DMA
+ * wreckage. It should be safe to remove.
+ */
+static int hlt_counter;
+void disable_hlt(void)
+{
+	hlt_counter++;
+}
+EXPORT_SYMBOL(disable_hlt);
+
+void enable_hlt(void)
+{
+	hlt_counter--;
+}
+EXPORT_SYMBOL(enable_hlt);
+
+static inline int hlt_use_halt(void)
+{
+	return (!hlt_counter && boot_cpu_data.hlt_works_ok);
+}
+#else
+static inline int hlt_use_halt(void)
+{
+	return 1;
+}
+#endif
+
+/*
+ * We use this if we don't have any better
+ * idle routine..
+ */
+void default_idle(void)
+{
+	if (hlt_use_halt()) {
+		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		current_thread_info()->status &= ~TS_POLLING;
+		/*
+		 * TS_POLLING-cleared state must be visible before we
+		 * test NEED_RESCHED:
+		 */
+		smp_mb();
+
+		if (!need_resched())
+			safe_halt();	/* enables interrupts racelessly */
+		else
+			local_irq_enable();
+		current_thread_info()->status |= TS_POLLING;
+	} else {
+		local_irq_enable();
+		/* loop is done by the caller */
+		cpu_relax();
+	}
+}
+#ifdef CONFIG_APM_MODULE
+EXPORT_SYMBOL(default_idle);
+#endif
+
+void stop_this_cpu(void *dummy)
+{
+	local_irq_disable();
+	/*
+	 * Remove this CPU:
+	 */
+	set_cpu_online(smp_processor_id(), false);
+	disable_local_APIC();
+
+	for (;;) {
+		if (hlt_works(smp_processor_id()))
+			halt();
+	}
+}
+
+static void do_nothing(void *unused)
+{
+}
+
+/*
+ * cpu_idle_wait - Used to ensure that all the CPUs discard old value of
+ * pm_idle and update to new pm_idle value. Required while changing pm_idle
+ * handler on SMP systems.
+ *
+ * Caller must have changed pm_idle to the new value before the call. Old
+ * pm_idle value will not be used by any CPU after the return of this function.
+ */
+void cpu_idle_wait(void)
+{
+	smp_mb();
+	/* kick all the CPUs so that they exit out of pm_idle */
+	smp_call_function(do_nothing, NULL, 1);
+}
+EXPORT_SYMBOL_GPL(cpu_idle_wait);
+
+/*
+ * This uses new MONITOR/MWAIT instructions on P4 processors with PNI,
+ * which can obviate IPI to trigger checking of need_resched.
+ * We execute MONITOR against need_resched and enter optimized wait state
+ * through MWAIT. Whenever someone changes need_resched, we would be woken
+ * up from MWAIT (without an IPI).
+ *
+ * New with Core Duo processors, MWAIT can take some hints based on CPU
+ * capability.
+ */
+void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
+{
+	trace_power_start(POWER_CSTATE, (ax>>4)+1, smp_processor_id());
+	if (!need_resched()) {
+		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
+
+		__monitor((void *)&current_thread_info()->flags, 0, 0);
+		smp_mb();
+		if (!need_resched())
+			__mwait(ax, cx);
+	}
+}
+
+/* Default MONITOR/MWAIT with no hints, used for default C1 state */
+static void mwait_idle(void)
+{
+	if (!need_resched()) {
+		trace_power_start(POWER_CSTATE, 1, smp_processor_id());
+		if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
+
+		__monitor((void *)&current_thread_info()->flags, 0, 0);
+		smp_mb();
+		if (!need_resched())
+			__sti_mwait(0, 0);
+		else
+			local_irq_enable();
+	} else
+		local_irq_enable();
+}
+
+/*
+ * On SMP it's slightly faster (but much more power-consuming!)
+ * to poll the ->work.need_resched flag instead of waiting for the
+ * cross-CPU IPI to arrive. Use this option with caution.
+ */
+static void poll_idle(void)
+{
+	trace_power_start(POWER_CSTATE, 0, smp_processor_id());
+	local_irq_enable();
+	while (!need_resched())
+		cpu_relax();
+	trace_power_end(0);
+}
+
+/*
+ * mwait selection logic:
+ *
+ * It depends on the CPU. For AMD CPUs that support MWAIT this is
+ * wrong. Family 0x10 and 0x11 CPUs will enter C1 on HLT. Powersavings
+ * then depend on a clock divisor and current Pstate of the core. If
+ * all cores of a processor are in halt state (C1) the processor can
+ * enter the C1E (C1 enhanced) state. If mwait is used this will never
+ * happen.
+ *
+ * idle=mwait overrides this decision and forces the usage of mwait.
+ */
+static int __cpuinitdata force_mwait;
+
+#define MWAIT_INFO			0x05
+#define MWAIT_ECX_EXTENDED_INFO		0x01
+#define MWAIT_EDX_C1			0xf0
+
+static int __cpuinit mwait_usable(const struct cpuinfo_x86 *c)
+{
+	u32 eax, ebx, ecx, edx;
+
+	if (force_mwait)
+		return 1;
+
+	if (c->cpuid_level < MWAIT_INFO)
+		return 0;
+
+	cpuid(MWAIT_INFO, &eax, &ebx, &ecx, &edx);
+	/* Check, whether EDX has extended info about MWAIT */
+	if (!(ecx & MWAIT_ECX_EXTENDED_INFO))
+		return 1;
+
+	/*
+	 * edx enumeratios MONITOR/MWAIT extensions. Check, whether
+	 * C1  supports MWAIT
+	 */
+	return (edx & MWAIT_EDX_C1);
+}
+
+bool c1e_detected;
+EXPORT_SYMBOL(c1e_detected);
+
+static cpumask_var_t c1e_mask;
+
+void c1e_remove_cpu(int cpu)
+{
+	if (c1e_mask != NULL)
+		cpumask_clear_cpu(cpu, c1e_mask);
+}
+
+/*
+ * C1E aware idle routine. We check for C1E active in the interrupt
+ * pending message MSR. If we detect C1E, then we handle it the same
+ * way as C3 power states (local apic timer and TSC stop)
+ */
+static void c1e_idle(void)
+{
+	if (need_resched())
+		return;
+
+	if (!c1e_detected) {
+		u32 lo, hi;
+
+		rdmsr(MSR_K8_INT_PENDING_MSG, lo, hi);
+
+		if (lo & K8_INTP_C1E_ACTIVE_MASK) {
+			c1e_detected = true;
+			if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
+				mark_tsc_unstable("TSC halt in AMD C1E");
+			printk(KERN_INFO "System has AMD C1E enabled\n");
+		}
+	}
+
+	if (c1e_detected) {
+		int cpu = smp_processor_id();
+
+		if (!cpumask_test_cpu(cpu, c1e_mask)) {
+			cpumask_set_cpu(cpu, c1e_mask);
+			/*
+			 * Force broadcast so ACPI can not interfere.
+			 */
+			clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_FORCE,
+					   &cpu);
+			printk(KERN_INFO "Switch to broadcast mode on CPU%d\n",
+			       cpu);
+		}
+		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
+
+		default_idle();
+
+		/*
+		 * The switch back from broadcast mode needs to be
+		 * called with interrupts disabled.
+		 */
+		 local_irq_disable();
+		 clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
+		 local_irq_enable();
+	} else
+		default_idle();
+}
+
+static int poll_idle_wrapper(struct cpuidle_device *dev,
+			struct cpuidle_state *state)
+{
+	poll_idle();
+	return 0;
+}
+
+static int mwait_idle_wrapper(struct cpuidle_device *dev,
+			struct cpuidle_state *state)
+{
+	mwait_idle();
+	return 0;
+}
+
+static int c1e_idle_wrapper(struct cpuidle_device *dev,
+			struct cpuidle_state *state)
+{
+	c1e_idle();
+	return 0;
+}
+
+static int default_idle_wrapper(struct cpuidle_device *dev,
+			struct cpuidle_state *state)
+{
+	default_idle();
+	return 0;
+}
+
+static struct cpuidle_state state_poll = {
+		.name = "POLL",
+		.desc = "POLL",
+		.driver_data = (void *) 0x00,
+		.flags = CPUIDLE_FLAG_TIME_VALID,
+		.exit_latency = 1,
+		.target_residency = 1,
+		.enter = &poll_idle_wrapper,
+};
+
+static struct cpuidle_state state_mwait = {
+		.name = "C1",
+		.desc = "MWAIT No Hints",
+		.driver_data = (void *) 0x01,
+		.flags = CPUIDLE_FLAG_TIME_VALID,
+		.exit_latency = 1,
+		.target_residency = 1,
+		.enter = &mwait_idle_wrapper,
+};
+
+static struct cpuidle_state state_c1e = {
+		.name = "C1E",
+		.desc = "C1E",
+		.driver_data = (void *) 0x02,
+		.flags = CPUIDLE_FLAG_TIME_VALID,
+		.exit_latency = 1,
+		.target_residency = 1,
+		.enter = &c1e_idle_wrapper,
+};
+
+static struct cpuidle_state state_default_idle = {
+		.name = "DEFAULT-IDLE",
+		.desc = "Default idle routine",
+		.driver_data = (void *) 0x03,
+		.flags = CPUIDLE_FLAG_TIME_VALID,
+		.exit_latency = 1,
+		.target_residency = 1,
+		.enter = &default_idle_wrapper,
+};
+
+void __cpuinit select_idle_routine(const struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_SMP
+	if (opt_state == &state_poll && smp_num_siblings > 1) {
+		printk_once(KERN_WARNING "WARNING: polling idle and HT enabled,"
+			" performance may degrade.\n");
+	}
+#endif
+	if (opt_state)
+		return;
+
+	if (cpu_has(c, X86_FEATURE_MWAIT) && mwait_usable(c)) {
+		/*
+		 * One CPU supports mwait => All CPUs supports mwait
+		 */
+		printk(KERN_INFO "using mwait in idle threads.\n");
+		opt_state = &state_mwait;
+	} else if (cpu_has_amd_erratum(amd_erratum_400)) {
+		/* E400: APIC timer interrupt does not wake up CPU from C1e */
+		printk(KERN_INFO "using C1E aware idle routine\n");
+		opt_state = &state_c1e;
+	} else
+		opt_state = &state_default_idle;
+}
+
+void __init init_c1e_mask(void)
+{
+	/* If we're using c1e_idle, we need to allocate c1e_mask. */
+	if (opt_state == &state_c1e)
+		zalloc_cpumask_var(&c1e_mask, GFP_KERNEL);
+}
+
+static int __init idle_setup(char *str)
+{
+	if (!str)
+		return -EINVAL;
+
+	if (!strcmp(str, "poll")) {
+		printk(KERN_INFO "using polling idle threads.\n");
+		opt_state = &state_poll;
+	} else if (!strcmp(str, "mwait"))
+		force_mwait = 1;
+	else if (!strcmp(str, "halt")) {
+		/*
+		 * When the boot option of idle=halt is added, halt is
+		 * forced to be used for CPU idle. In such case CPU C2/C3
+		 * won't be used again.
+		 * To continue to load the CPU idle driver, don't touch
+		 * the boot_option_idle_override.
+		 */
+		opt_state = &state_default_idle;
+		idle_halt = 1;
+		return 0;
+	} else if (!strcmp(str, "nomwait")) {
+		/*
+		 * If the boot option of "idle=nomwait" is added,
+		 * it means that mwait will be disabled for CPU C2/C3
+		 * states. In such case it won't touch the variable
+		 * of boot_option_idle_override.
+		 */
+		idle_nomwait = 1;
+		return 0;
+	} else
+		return -1;
+
+	boot_option_idle_override = 1;
+	return 0;
+}
+early_param("idle", idle_setup);
+
+static struct cpuidle_driver default_idle_driver = {
+	.name = "default_idle",
+	.owner = THIS_MODULE,
+	.priority = 100,
+};
+
+static int setup_cpuidle(int cpu)
+{
+	struct cpuidle_device *dev = kzalloc(sizeof(struct cpuidle_device),
+					GFP_KERNEL);
+	int count = CPUIDLE_DRIVER_STATE_START;
+	dev->cpu = cpu;
+	dev->drv = &default_idle_driver;
+
+	dev->states[count] = *opt_state;
+	count++;
+
+	dev->state_count = count;
+
+	if (cpuidle_register_device(dev))
+		return -EIO;
+	return 0;
+}
+
+static int __init default_idle_init(void)
+{
+	int retval, i;
+	retval = cpuidle_register_driver(&default_idle_driver);
+
+	for_each_online_cpu(i) {
+		setup_cpuidle(i);
+	}
+
+	return 0;
+}
+
+
+static void  __exit default_idle_exit(void)
+{
+	cpuidle_unregister_driver(&default_idle_driver);
+	return;
+}
+
+module_init(default_idle_init);
+module_exit(default_idle_exit);


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC V2 3/3] cpuidle: default idle driver for x86
  2011-01-13 12:52 ` [RFC V2 3/3] cpuidle: default idle driver for x86 Trinabh Gupta
@ 2011-01-13 15:47   ` Arjan van de Ven
  2011-01-13 16:22     ` Vaidyanathan Srinivasan
  0 siblings, 1 reply; 6+ messages in thread
From: Arjan van de Ven @ 2011-01-13 15:47 UTC (permalink / raw)
  To: Trinabh Gupta
  Cc: peterz, lenb, suresh.b.siddha, benh, venki, ak, linux-kernel

On 1/13/2011 4:52 AM, Trinabh Gupta wrote:
> This default cpuidle_driver parses idle= boot parameters, selects
> the optimal idle routine for x86 during bootup and registers with
> cpuidle. The code for idle routines and the selection of optimal
> routine is moved from arch/x86/kernel/process.c . At module_init this
> default driver is registered with cpuidle and for non ACPI platforms
> it continues to be used. For ACPI platforms, acpi_idle driver would
> replace this driver at a later point in time during bootup. Until
> this driver's registration, architecture supplied compile time
> default idle routine is called from within cpuidle_idle_call().
>


I like the general approach, but I'd think making idle drivers modular 
is going one step too far....
that looks like waaay overkill to me (also since most of the actual idle 
handlers are so small
that the overhead of the exported symbols alone is bigger than the idle 
handlers)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC V2 3/3] cpuidle: default idle driver for x86
  2011-01-13 15:47   ` Arjan van de Ven
@ 2011-01-13 16:22     ` Vaidyanathan Srinivasan
  0 siblings, 0 replies; 6+ messages in thread
From: Vaidyanathan Srinivasan @ 2011-01-13 16:22 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Trinabh Gupta, peterz, lenb, suresh.b.siddha, benh, venki, ak,
	linux-kernel

* Arjan van de Ven <arjan@linux.intel.com> [2011-01-13 07:47:16]:

> On 1/13/2011 4:52 AM, Trinabh Gupta wrote:
> >This default cpuidle_driver parses idle= boot parameters, selects
> >the optimal idle routine for x86 during bootup and registers with
> >cpuidle. The code for idle routines and the selection of optimal
> >routine is moved from arch/x86/kernel/process.c . At module_init this
> >default driver is registered with cpuidle and for non ACPI platforms
> >it continues to be used. For ACPI platforms, acpi_idle driver would
> >replace this driver at a later point in time during bootup. Until
> >this driver's registration, architecture supplied compile time
> >default idle routine is called from within cpuidle_idle_call().
> >
> 
> 
> I like the general approach, but I'd think making idle drivers
> modular is going one step too far....
> that looks like waaay overkill to me (also since most of the actual
> idle handlers are so small
> that the overhead of the exported symbols alone is bigger than the
> idle handlers)

Agreed.  The idea of keeping them in a module is to keep the code
around for legacy systems and not have to use that code at all on
modern systems where ACPI driver can take over.

The current RFC still has not managed to pull out all the code into
the module, but suggest that it is possible.

We can keep them in-kernel and move to the next iteration where we
should ensure other architectures also can co-exist with pm_idle() and
slowly move to this approach.

--Vaidy


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-01-13 16:23 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-13 12:51 [RFC V2 0/3] cpuidle: Cleanup pm_idle and include driver/cpuidle.c in-kernel Trinabh Gupta
2011-01-13 12:52 ` [RFC V2 1/3] cpuidle: Remove pm_idle pointer for x86 Trinabh Gupta
2011-01-13 12:52 ` [RFC V2 2/3] cpuidle: list based cpuidle driver registration and selection Trinabh Gupta
2011-01-13 12:52 ` [RFC V2 3/3] cpuidle: default idle driver for x86 Trinabh Gupta
2011-01-13 15:47   ` Arjan van de Ven
2011-01-13 16:22     ` Vaidyanathan Srinivasan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.