* Re: [PATCH 6/6] cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
From: Tejun Heo @ 2011-08-31 7:03 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: containers, lizf, linux-kernel, linux-pm, paul, kamezawa.hiroyu
In-Reply-To: <20110830201030.GC15953@somewhere.redhat.com>
Hello, Frederic.
On Tue, Aug 30, 2011 at 10:10:32PM +0200, Frederic Weisbecker wrote:
> In order to keep the fix queued in -mm (https://lkml.org/lkml/2011/8/26/262)
> the tasks that have failed to migrate should be removed from the iterator
> so that they are not included in the batch in ->attach().
I don't think that's a good approach. It breaks the symmetry when
calling different callbacks. What if ->can_attach() allocates
per-task resources and the task exits in the middle? I think we
better lock down fork/exit/exec. I'll send patches but I'm currently
moving / traveling w/ limited access to my toys so it might take some
time.
Thanks.
--
tejun
^ permalink raw reply
* [PATCH v9 0/4] Devfreq, DVFS Framework for Non-CPU Devices
From: MyungJoo Ham @ 2011-08-31 7:29 UTC (permalink / raw)
To: linux-pm; +Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, Thomas Gleixner
The main update from the patchset v8:
- Add per-devfreq-device locking mechanism (devfreq->lock)
- Provide the per-devfreq-device locking mechanism to governors
- Merged 4/5 patch to 2/5 (devfreq internal interface for governors)
Patch 1/4 has no changes.
Patch 2/4 has major update on synchronization and merged another patch; thus, dropped "Reviewed-By".
Patch 3/4 has minor udpate (affected by the update on 2/4: mutex added)
Patch 4/4 has minor update (affected by the update on 2/4: mutex added) + removed unused variable.
For a usage example, please look at
http://git.infradead.org/users/kmpark/linux-2.6-samsung/shortlog/refs/heads/devfreq
In the above git tree, DVFS (dynamic voltage and frequency scaling) mechanism
is applied to the memory bus of Exynos4210 for Exynos4210-NURI boards.
In the example, the LPDDR2 DRAM frequency changes between 133, 266, and 400MHz
and other related clocks simply follow the determined DDR RAM clock.
The devfreq driver for Exynos4210 memory bus is at
/drivers/devfreq/exynos4210_memorybus.c in the git tree.
In the dd (writing and reading 360MiB) test with NURI board, the memory
throughput was not changed (the performance is not deteriorated) while
the SoC power consumption has been reduced by 1%. When the memory access
is not that intense while the CPU is heavily used, the SoC power consumption
has been reduced by 6%. The power consumption has been compared with the
case using the conventional Exynos4210 cpufreq driver, which sets memory
bus frequency according to the CPU core frequency. Besides, when the CPU core
running slow and the memory access is intense, the performance (memory
throughput) has been increased by 11% (with higher SoC power consumption of
5%). The tested governor is "simple-ondemand".
MyungJoo Ham (4):
PM / OPP: Add OPP availability change notifier.
PM: Introduce devfreq: generic DVFS framework with device-specific
OPPs
PM / devfreq: add common sysfs interfaces
PM / devfreq: add basic governors
Documentation/ABI/testing/sysfs-devices-power | 46 ++
drivers/Kconfig | 2 +
drivers/Makefile | 2 +
drivers/base/power/opp.c | 29 ++
drivers/devfreq/Kconfig | 75 ++++
drivers/devfreq/Makefile | 5 +
drivers/devfreq/devfreq.c | 567 +++++++++++++++++++++++++
drivers/devfreq/governor.h | 22 +
drivers/devfreq/governor_performance.c | 24 +
drivers/devfreq/governor_powersave.c | 24 +
drivers/devfreq/governor_simpleondemand.c | 88 ++++
drivers/devfreq/governor_userspace.c | 126 ++++++
include/linux/devfreq.h | 160 +++++++
include/linux/opp.h | 12 +
14 files changed, 1182 insertions(+), 0 deletions(-)
create mode 100644 drivers/devfreq/Kconfig
create mode 100644 drivers/devfreq/Makefile
create mode 100644 drivers/devfreq/devfreq.c
create mode 100644 drivers/devfreq/governor.h
create mode 100644 drivers/devfreq/governor_performance.c
create mode 100644 drivers/devfreq/governor_powersave.c
create mode 100644 drivers/devfreq/governor_simpleondemand.c
create mode 100644 drivers/devfreq/governor_userspace.c
create mode 100644 include/linux/devfreq.h
--
1.7.4.1
^ permalink raw reply
* [PATCH v9 1/4] PM / OPP: Add OPP availability change notifier.
From: MyungJoo Ham @ 2011-08-31 7:29 UTC (permalink / raw)
To: linux-pm; +Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, Thomas Gleixner
In-Reply-To: <1314775779-21399-1-git-send-email-myungjoo.ham@samsung.com>
The patch enables to register notifier_block for an OPP-device in order
to get notified for any changes in the availability of OPPs of the
device. For example, if a new OPP is inserted or enable/disable status
of an OPP is changed, the notifier is executed.
This enables the usage of opp_add, opp_enable, and opp_disable to
directly take effect with any connected entities such as cpufreq or
devfreq.
Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Reviewed-by: Mike Turquette <mturquette@ti.com>
---
No changes since v7
Added at devfreq patch set v6 replacing devfreq_update calls at OPP.
---
drivers/base/power/opp.c | 29 +++++++++++++++++++++++++++++
include/linux/opp.h | 12 ++++++++++++
2 files changed, 41 insertions(+), 0 deletions(-)
diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
index b23de18..e6b4c89 100644
--- a/drivers/base/power/opp.c
+++ b/drivers/base/power/opp.c
@@ -73,6 +73,7 @@ struct opp {
* RCU usage: nodes are not modified in the list of device_opp,
* however addition is possible and is secured by dev_opp_list_lock
* @dev: device pointer
+ * @head: notifier head to notify the OPP availability changes.
* @opp_list: list of opps
*
* This is an internal data structure maintaining the link to opps attached to
@@ -83,6 +84,7 @@ struct device_opp {
struct list_head node;
struct device *dev;
+ struct srcu_notifier_head head;
struct list_head opp_list;
};
@@ -404,6 +406,7 @@ int opp_add(struct device *dev, unsigned long freq, unsigned long u_volt)
}
dev_opp->dev = dev;
+ srcu_init_notifier_head(&dev_opp->head);
INIT_LIST_HEAD(&dev_opp->opp_list);
/* Secure the device list modification */
@@ -428,6 +431,11 @@ int opp_add(struct device *dev, unsigned long freq, unsigned long u_volt)
list_add_rcu(&new_opp->node, head);
mutex_unlock(&dev_opp_list_lock);
+ /*
+ * Notify the changes in the availability of the operable
+ * frequency/voltage list.
+ */
+ srcu_notifier_call_chain(&dev_opp->head, OPP_EVENT_ADD, new_opp);
return 0;
}
@@ -504,6 +512,14 @@ static int opp_set_availability(struct device *dev, unsigned long freq,
mutex_unlock(&dev_opp_list_lock);
synchronize_rcu();
+ /* Notify the change of the OPP availability */
+ if (availability_req)
+ srcu_notifier_call_chain(&dev_opp->head, OPP_EVENT_ENABLE,
+ new_opp);
+ else
+ srcu_notifier_call_chain(&dev_opp->head, OPP_EVENT_DISABLE,
+ new_opp);
+
/* clean up old opp */
new_opp = opp;
goto out;
@@ -643,3 +659,16 @@ void opp_free_cpufreq_table(struct device *dev,
*table = NULL;
}
#endif /* CONFIG_CPU_FREQ */
+
+/** opp_get_notifier() - find notifier_head of the device with opp
+ * @dev: device pointer used to lookup device OPPs.
+ */
+struct srcu_notifier_head *opp_get_notifier(struct device *dev)
+{
+ struct device_opp *dev_opp = find_device_opp(dev);
+
+ if (IS_ERR(dev_opp))
+ return ERR_PTR(PTR_ERR(dev_opp)); /* matching type */
+
+ return &dev_opp->head;
+}
diff --git a/include/linux/opp.h b/include/linux/opp.h
index 7020e97..87a9208 100644
--- a/include/linux/opp.h
+++ b/include/linux/opp.h
@@ -16,9 +16,14 @@
#include <linux/err.h>
#include <linux/cpufreq.h>
+#include <linux/notifier.h>
struct opp;
+enum opp_event {
+ OPP_EVENT_ADD, OPP_EVENT_ENABLE, OPP_EVENT_DISABLE,
+};
+
#if defined(CONFIG_PM_OPP)
unsigned long opp_get_voltage(struct opp *opp);
@@ -40,6 +45,8 @@ int opp_enable(struct device *dev, unsigned long freq);
int opp_disable(struct device *dev, unsigned long freq);
+struct srcu_notifier_head *opp_get_notifier(struct device *dev);
+
#else
static inline unsigned long opp_get_voltage(struct opp *opp)
{
@@ -89,6 +96,11 @@ static inline int opp_disable(struct device *dev, unsigned long freq)
{
return 0;
}
+
+struct srcu_notifier_head *opp_get_notifier(struct device *dev)
+{
+ return ERR_PTR(-EINVAL);
+}
#endif /* CONFIG_PM */
#if defined(CONFIG_CPU_FREQ) && defined(CONFIG_PM_OPP)
--
1.7.4.1
^ permalink raw reply related
* [PATCH v9 2/4] PM: Introduce devfreq: generic DVFS framework with device-specific OPPs
From: MyungJoo Ham @ 2011-08-31 7:29 UTC (permalink / raw)
To: linux-pm; +Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, Thomas Gleixner
In-Reply-To: <1314775779-21399-1-git-send-email-myungjoo.ham@samsung.com>
With OPPs, a device may have multiple operable frequency and voltage
sets. However, there can be multiple possible operable sets and a system
will need to choose one from them. In order to reduce the power
consumption (by reducing frequency and voltage) without affecting the
performance too much, a Dynamic Voltage and Frequency Scaling (DVFS)
scheme may be used.
This patch introduces the DVFS capability to non-CPU devices with OPPs.
DVFS is a techique whereby the frequency and supplied voltage of a
device is adjusted on-the-fly. DVFS usually sets the frequency as low
as possible with given conditions (such as QoS assurance) and adjusts
voltage according to the chosen frequency in order to reduce power
consumption and heat dissipation.
The generic DVFS for devices, devfreq, may appear quite similar with
/drivers/cpufreq. However, cpufreq does not allow to have multiple
devices registered and is not suitable to have multiple heterogenous
devices with different (but simple) governors.
Normally, DVFS mechanism controls frequency based on the demand for
the device, and then, chooses voltage based on the chosen frequency.
devfreq also controls the frequency based on the governor's frequency
recommendation and let OPP pick up the pair of frequency and voltage
based on the recommended frequency. Then, the chosen OPP is passed to
device driver's "target" callback.
When PM QoS is going to be used with the devfreq device, the device
driver should enable OPPs that are appropriate with the current PM QoS
requests. In order to do so, the device driver may call opp_enable and
opp_disable at the notifier callback of PM QoS so that PM QoS's
update_target() call enables the appropriate OPPs. Note that at least
one of OPPs should be enabled at any time; be careful when there is a
transition.
Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
The test code with board support for Exynos4-NURI is at
http://git.infradead.org/users/kmpark/linux-2.6-samsung/shortlog/refs/heads/devfreq
---
Thank you for your valuable comments, Rafael, Greg, Pavel, Colin, Mike,
and Kevin.
Changed from v8
- Merged patch 4/5 of v8 (internal interfaces for governors)
- Added lock (mutex) to struct devfreq
- Uses devfreq->lock to access elements in devfreq.
- Added kerneldoc entries for init/exit callbacks of governors.
- The caller of update_devfreq() in governor.h should lock
devfreq->lock before calling it.
- Added comment on the usage of lock in struct devfreq.
- Revised devfreq_add_device error handling
At v8, there is no changes since v7
Changes from v6
- Type revised for timing variables
- Removed unnecessary code and variable
Changes at v6-resubmit from v6
- Use jiffy directly instead of ktime
- Be prepared for profile->polling_ms changes (not supported fully at
this stage)
Changes from v5
- Uses OPP availability change notifier
- Removed devfreq_interval. Uses one jiffy instead. DEVFREQ adjusts
polling interval based on the interval requirement of DEVFREQ
devices.
- Moved devfreq to /drivers/devfreq to accomodate devfreq-related files
including governors and devfreq drivers.
- Coding style revised.
- Updated devfreq_add_device interface to get tunable values.
Changed from v4
- Removed tickle, which is a duplicated feature; PM QoS can do the same.
- Allow to extend polling interval if devices have longer polling intervals.
- Relocated private data of governors.
- Removed system-wide sysfs
Changed from v3
- In kerneldoc comments, DEVFREQ has ben replaced by devfreq
- Revised removing devfreq entries with error mechanism
- Added and revised comments
- Removed unnecessary codes
- Allow to give a name to a governor
- Bugfix: a tickle call may cancel an older tickle call that is still in
effect.
Changed from v2
- Code style revised and cleaned up.
- Remove DEVFREQ entries that incur errors except for EAGAIN
- Bug fixed: tickle for devices without polling governors
Changes from v1(RFC)
- Rename: DVFS --> DEVFREQ
- Revised governor design
. Governor receives the whole struct devfreq
. Governor should gather usage information (thru get_dev_status) itself
- Periodic monitoring runs only when needed.
- DEVFREQ no more deals with voltage information directly
- Removed some printks.
- Some cosmetics update
- Use freezable_wq.
---
drivers/Kconfig | 2 +
drivers/Makefile | 2 +
drivers/devfreq/Kconfig | 39 +++++
drivers/devfreq/Makefile | 1 +
drivers/devfreq/devfreq.c | 364 ++++++++++++++++++++++++++++++++++++++++++++
drivers/devfreq/governor.h | 22 +++
include/linux/devfreq.h | 123 +++++++++++++++
7 files changed, 553 insertions(+), 0 deletions(-)
create mode 100644 drivers/devfreq/Kconfig
create mode 100644 drivers/devfreq/Makefile
create mode 100644 drivers/devfreq/devfreq.c
create mode 100644 drivers/devfreq/governor.h
create mode 100644 include/linux/devfreq.h
diff --git a/drivers/Kconfig b/drivers/Kconfig
index 95b9e7e..a1efd75 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -130,4 +130,6 @@ source "drivers/iommu/Kconfig"
source "drivers/virt/Kconfig"
+source "drivers/devfreq/Kconfig"
+
endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index 7fa433a..97c957b 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -127,3 +127,5 @@ obj-$(CONFIG_IOMMU_SUPPORT) += iommu/
# Virtualization drivers
obj-$(CONFIG_VIRT_DRIVERS) += virt/
+
+obj-$(CONFIG_PM_DEVFREQ) += devfreq/
diff --git a/drivers/devfreq/Kconfig b/drivers/devfreq/Kconfig
new file mode 100644
index 0000000..1fb42de
--- /dev/null
+++ b/drivers/devfreq/Kconfig
@@ -0,0 +1,39 @@
+config ARCH_HAS_DEVFREQ
+ bool
+ depends on ARCH_HAS_OPP
+ help
+ Denotes that the architecture supports DEVFREQ. If the architecture
+ supports multiple OPP entries per device and the frequency of the
+ devices with OPPs may be altered dynamically, the architecture
+ supports DEVFREQ.
+
+menuconfig PM_DEVFREQ
+ bool "Generic Dynamic Voltage and Frequency Scaling (DVFS) support"
+ depends on PM_OPP && ARCH_HAS_DEVFREQ
+ help
+ With OPP support, a device may have a list of frequencies and
+ voltages available. DEVFREQ, a generic DVFS framework can be
+ registered for a device with OPP support in order to let the
+ governor provided to DEVFREQ choose an operating frequency
+ based on the OPP's list and the policy given with DEVFREQ.
+
+ Each device may have its own governor and policy. DEVFREQ can
+ reevaluate the device state periodically and/or based on the
+ OPP list changes (each frequency/voltage pair in OPP may be
+ disabled or enabled).
+
+ Like some CPUs with CPUFREQ, a device may have multiple clocks.
+ However, because the clock frequencies of a single device are
+ determined by the single device's state, an instance of DEVFREQ
+ is attached to a single device and returns a "representative"
+ clock frequency from the OPP of the device, which is also attached
+ to a device by 1-to-1. The device registering DEVFREQ takes the
+ responsiblity to "interpret" the frequency listed in OPP and
+ to set its every clock accordingly with the "target" callback
+ given to DEVFREQ.
+
+if PM_DEVFREQ
+
+comment "DEVFREQ Drivers"
+
+endif # PM_DEVFREQ
diff --git a/drivers/devfreq/Makefile b/drivers/devfreq/Makefile
new file mode 100644
index 0000000..168934a
--- /dev/null
+++ b/drivers/devfreq/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_PM_DEVFREQ) += devfreq.o
diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
new file mode 100644
index 0000000..621b863
--- /dev/null
+++ b/drivers/devfreq/devfreq.c
@@ -0,0 +1,364 @@
+/*
+ * devfreq: Generic Dynamic Voltage and Frequency Scaling (DVFS) Framework
+ * for Non-CPU Devices Based on OPP.
+ *
+ * Copyright (C) 2011 Samsung Electronics
+ * MyungJoo Ham <myungjoo.ham@samsung.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/opp.h>
+#include <linux/devfreq.h>
+#include <linux/workqueue.h>
+#include <linux/platform_device.h>
+#include <linux/list.h>
+#include <linux/printk.h>
+#include <linux/hrtimer.h>
+
+/*
+ * devfreq_work periodically monitors every registered device.
+ * The minimum polling interval is one jiffy. The polling interval is
+ * determined by the minimum polling period among all polling devfreq
+ * devices. The resolution of polling interval is one jiffy.
+ */
+static bool polling;
+static struct workqueue_struct *devfreq_wq;
+static struct delayed_work devfreq_work;
+
+/* The list of all device-devfreq */
+static LIST_HEAD(devfreq_list);
+static DEFINE_MUTEX(devfreq_list_lock);
+
+/**
+ * find_device_devfreq() - find devfreq struct using device pointer
+ * @dev: device pointer used to lookup device devfreq.
+ *
+ * Search the list of device devfreqs and return the matched device's
+ * devfreq info. devfreq_list_lock should be held by the caller.
+ */
+static struct devfreq *find_device_devfreq(struct device *dev)
+{
+ struct devfreq *tmp_devfreq;
+
+ if (unlikely(IS_ERR_OR_NULL(dev))) {
+ pr_err("DEVFREQ: %s: Invalid parameters\n", __func__);
+ return ERR_PTR(-EINVAL);
+ }
+
+ list_for_each_entry(tmp_devfreq, &devfreq_list, node) {
+ if (tmp_devfreq->dev == dev)
+ return tmp_devfreq;
+ }
+
+ return ERR_PTR(-ENODEV);
+}
+
+/**
+ * get_devfreq() - find devfreq struct. a wrapped find_device_devfreq()
+ * with mutex protection. exported for governors
+ * @dev: device pointer used to lookup device devfreq.
+ */
+struct devfreq *get_devfreq(struct device *dev)
+{
+ struct devfreq *ret;
+
+ mutex_lock(&devfreq_list_lock);
+ ret = find_device_devfreq(dev);
+ mutex_unlock(&devfreq_list_lock);
+
+ return ret;
+}
+
+/**
+ * devfreq_do() - Check the usage profile of a given device and configure
+ * frequency and voltage accordingly
+ * @devfreq: devfreq info of the given device
+ */
+static int devfreq_do(struct devfreq *devfreq)
+{
+ struct opp *opp;
+ unsigned long freq;
+ int err;
+
+ err = devfreq->governor->get_target_freq(devfreq, &freq);
+ if (err)
+ return err;
+
+ opp = opp_find_freq_ceil(devfreq->dev, &freq);
+ if (opp == ERR_PTR(-ENODEV))
+ opp = opp_find_freq_floor(devfreq->dev, &freq);
+
+ if (IS_ERR(opp))
+ return PTR_ERR(opp);
+
+ if (devfreq->previous_freq == freq)
+ return 0;
+
+ err = devfreq->profile->target(devfreq->dev, opp);
+ if (err)
+ return err;
+
+ devfreq->previous_freq = freq;
+ return 0;
+}
+
+/**
+ * update_devfreq() - Notify that the device OPP or frequency requirement
+ * has been changed. This function is exported for governors.
+ * @devfreq: the devfreq instance.
+ *
+ * Note: lock devfreq->lock before calling update_devfreq
+ */
+int update_devfreq(struct devfreq *devfreq)
+{
+ int err = 0;
+
+ if (!mutex_is_locked(&devfreq->lock)) {
+ WARN(true, "devfreq->lock must be locked by the caller.\n");
+ return -EINVAL;
+ }
+
+ /* Reevaluate the proper frequency */
+ err = devfreq_do(devfreq);
+ return err;
+}
+
+/**
+ * devfreq_update() - Notify that the device OPP has been changed.
+ * @dev: the device whose OPP has been changed.
+ *
+ * Called by OPP notifier.
+ */
+static int devfreq_update(struct notifier_block *nb, unsigned long type,
+ void *devp)
+{
+ struct devfreq *devfreq = container_of(nb, struct devfreq, nb);
+ int ret;
+
+ mutex_lock(&devfreq->lock);
+ ret = update_devfreq(devfreq);
+ mutex_unlock(&devfreq->lock);
+
+ return ret;
+}
+
+/**
+ * devfreq_monitor() - Periodically run devfreq_do()
+ * @work: the work struct used to run devfreq_monitor periodically.
+ *
+ */
+static void devfreq_monitor(struct work_struct *work)
+{
+ static unsigned long last_polled_at;
+ struct devfreq *devfreq, *tmp;
+ int error;
+ unsigned long jiffies_passed;
+ unsigned long next_jiffies = ULONG_MAX, now = jiffies;
+
+ /* Initially last_polled_at = 0, polling every device at bootup */
+ jiffies_passed = now - last_polled_at;
+ last_polled_at = now;
+ if (jiffies_passed == 0)
+ jiffies_passed = 1;
+
+ mutex_lock(&devfreq_list_lock);
+
+ list_for_each_entry_safe(devfreq, tmp, &devfreq_list, node) {
+ mutex_lock(&devfreq->lock);
+
+ if (devfreq->next_polling == 0) {
+ mutex_unlock(&devfreq->lock);
+ continue;
+ }
+
+ /*
+ * Reduce more next_polling if devfreq_wq took an extra
+ * delay. (i.e., CPU has been idled.)
+ */
+ if (devfreq->next_polling <= jiffies_passed) {
+ error = devfreq_do(devfreq);
+
+ /* Remove a devfreq with an error. */
+ if (error && error != -EAGAIN) {
+ dev_err(devfreq->dev, "Due to devfreq_do error(%d), devfreq(%s) is removed from the device\n",
+ error, devfreq->governor->name);
+
+ list_del(&devfreq->node);
+ mutex_unlock(&devfreq->lock);
+ kfree(devfreq);
+ continue;
+ }
+ devfreq->next_polling = devfreq->polling_jiffies;
+
+ /* No more polling required (polling_ms changed) */
+ if (devfreq->next_polling == 0) {
+ mutex_unlock(&devfreq->lock);
+ continue;
+ }
+ } else {
+ devfreq->next_polling -= jiffies_passed;
+ }
+
+ next_jiffies = (next_jiffies > devfreq->next_polling) ?
+ devfreq->next_polling : next_jiffies;
+
+ mutex_unlock(&devfreq->lock);
+ }
+
+ if (next_jiffies > 0 && next_jiffies < ULONG_MAX) {
+ polling = true;
+ queue_delayed_work(devfreq_wq, &devfreq_work, next_jiffies);
+ } else {
+ polling = false;
+ }
+
+ mutex_unlock(&devfreq_list_lock);
+}
+
+/**
+ * devfreq_add_device() - Add devfreq feature to the device
+ * @dev: the device to add devfreq feature.
+ * @profile: device-specific profile to run devfreq.
+ * @governor: the policy to choose frequency.
+ * @data: private data for the governor. The devfreq framework does not
+ * touch this value.
+ */
+int devfreq_add_device(struct device *dev, struct devfreq_dev_profile *profile,
+ struct devfreq_governor *governor, void *data)
+{
+ struct devfreq *devfreq;
+ struct srcu_notifier_head *nh;
+ int err = 0;
+
+ if (!dev || !profile || !governor) {
+ dev_err(dev, "%s: Invalid parameters.\n", __func__);
+ return -EINVAL;
+ }
+
+ mutex_lock(&devfreq_list_lock);
+
+ devfreq = find_device_devfreq(dev);
+ if (!IS_ERR(devfreq)) {
+ dev_err(dev, "%s: Unable to create devfreq for the device. It already has one.\n", __func__);
+ err = -EINVAL;
+ goto out;
+ }
+
+ devfreq = kzalloc(sizeof(struct devfreq), GFP_KERNEL);
+ if (!devfreq) {
+ dev_err(dev, "%s: Unable to create devfreq for the device\n",
+ __func__);
+ err = -ENOMEM;
+ goto out;
+ }
+
+ mutex_init(&devfreq->lock);
+ mutex_lock(&devfreq->lock);
+ devfreq->dev = dev;
+ devfreq->profile = profile;
+ devfreq->governor = governor;
+ devfreq->next_polling = devfreq->polling_jiffies
+ = msecs_to_jiffies(devfreq->profile->polling_ms);
+ devfreq->previous_freq = profile->initial_freq;
+ devfreq->data = data;
+
+ devfreq->nb.notifier_call = devfreq_update;
+
+ nh = opp_get_notifier(dev);
+ if (IS_ERR(nh)) {
+ err = PTR_ERR(nh);
+ goto err_opp;
+ }
+ err = srcu_notifier_chain_register(nh, &devfreq->nb);
+ if (err)
+ goto err_opp;
+
+ if (governor->init)
+ err = governor->init(devfreq);
+ if (err)
+ goto err_init;
+
+ list_add(&devfreq->node, &devfreq_list);
+
+ if (devfreq_wq && devfreq->next_polling && !polling) {
+ polling = true;
+ queue_delayed_work(devfreq_wq, &devfreq_work,
+ devfreq->next_polling);
+ }
+ mutex_unlock(&devfreq->lock);
+ goto out;
+err_init:
+ srcu_notifier_chain_unregister(nh, &devfreq->nb);
+err_opp:
+ mutex_unlock(&devfreq->lock);
+ kfree(devfreq);
+out:
+ mutex_unlock(&devfreq_list_lock);
+ return err;
+}
+
+/**
+ * devfreq_remove_device() - Remove devfreq feature from a device.
+ * @device: the device to remove devfreq feature.
+ */
+int devfreq_remove_device(struct device *dev)
+{
+ struct devfreq *devfreq;
+ struct srcu_notifier_head *nh;
+ int err = 0;
+
+ if (!dev)
+ return -EINVAL;
+
+ mutex_lock(&devfreq_list_lock);
+ devfreq = find_device_devfreq(dev);
+ if (IS_ERR(devfreq)) {
+ err = PTR_ERR(devfreq);
+ goto out;
+ }
+
+ mutex_lock(&devfreq->lock);
+ nh = opp_get_notifier(dev);
+ if (IS_ERR(nh)) {
+ err = PTR_ERR(nh);
+ mutex_unlock(&devfreq->lock);
+ goto out;
+ }
+
+ list_del(&devfreq->node);
+
+ if (devfreq->governor->exit)
+ devfreq->governor->exit(devfreq);
+
+ srcu_notifier_chain_unregister(nh, &devfreq->nb);
+ mutex_unlock(&devfreq->lock);
+ kfree(devfreq);
+out:
+ mutex_unlock(&devfreq_list_lock);
+ return 0;
+}
+
+/**
+ * devfreq_init() - Initialize data structure for devfreq framework and
+ * start polling registered devfreq devices.
+ */
+static int __init devfreq_init(void)
+{
+ mutex_lock(&devfreq_list_lock);
+ polling = false;
+ devfreq_wq = create_freezable_workqueue("devfreq_wq");
+ INIT_DELAYED_WORK_DEFERRABLE(&devfreq_work, devfreq_monitor);
+ mutex_unlock(&devfreq_list_lock);
+
+ devfreq_monitor(&devfreq_work.work);
+ return 0;
+}
+late_initcall(devfreq_init);
diff --git a/drivers/devfreq/governor.h b/drivers/devfreq/governor.h
new file mode 100644
index 0000000..9122090
--- /dev/null
+++ b/drivers/devfreq/governor.h
@@ -0,0 +1,22 @@
+/*
+ * governor.h - internal header for governors.
+ *
+ * Copyright (C) 2011 Samsung Electronics
+ * MyungJoo Ham <myungjoo.ham@samsung.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This header is for devfreq governors in drivers/devfreq/
+ */
+
+#ifndef _GOVERNOR_H
+#define _GOVERNOR_H
+
+extern struct devfreq *get_devfreq(struct device *dev);
+
+/* (Mandatory) Lock devfreq->lock before calling update_devfreq */
+extern int update_devfreq(struct devfreq *devfreq);
+
+#endif /* _GOVERNOR_H */
diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
new file mode 100644
index 0000000..f14b57d
--- /dev/null
+++ b/include/linux/devfreq.h
@@ -0,0 +1,123 @@
+/*
+ * devfreq: Generic Dynamic Voltage and Frequency Scaling (DVFS) Framework
+ * for Non-CPU Devices Based on OPP.
+ *
+ * Copyright (C) 2011 Samsung Electronics
+ * MyungJoo Ham <myungjoo.ham@samsung.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef __LINUX_DEVFREQ_H__
+#define __LINUX_DEVFREQ_H__
+
+#include <linux/device.h>
+#include <linux/notifier.h>
+#include <linux/opp.h>
+
+#define DEVFREQ_NAME_LEN 16
+
+struct devfreq;
+struct devfreq_dev_status {
+ /* both since the last measure */
+ unsigned long total_time;
+ unsigned long busy_time;
+ unsigned long current_frequency;
+};
+
+struct devfreq_dev_profile {
+ unsigned long max_freq; /* may be larger than the actual value */
+ unsigned long initial_freq;
+ unsigned int polling_ms; /* 0 for at opp change only */
+
+ int (*target)(struct device *dev, struct opp *opp);
+ int (*get_dev_status)(struct device *dev,
+ struct devfreq_dev_status *stat);
+};
+
+/**
+ * struct devfreq_governor - Devfreq policy governor
+ * @name Governor's name
+ * @get_target_freq Returns desired operating frequency for the device.
+ * Basically, get_target_freq will run
+ * devfreq_dev_profile.get_dev_status() to get the
+ * status of the device (load = busy_time / total_time).
+ * @init Called when the devfreq is being attached to a device
+ * @exit Called when the devfreq is being removed from a device
+ *
+ * Note that the callbacks are called with devfreq->lock locked by devfreq.
+ */
+struct devfreq_governor {
+ char name[DEVFREQ_NAME_LEN];
+ int (*get_target_freq)(struct devfreq *this, unsigned long *freq);
+ int (*init)(struct devfreq *this);
+ void (*exit)(struct devfreq *this);
+};
+
+/**
+ * struct devfreq - Device devfreq structure
+ * @node list node - contains the devices with devfreq that have been
+ * registered.
+ * @lock a mutex to protect accessing devfreq.
+ * @dev device pointer
+ * @profile device-specific devfreq profile
+ * @governor method how to choose frequency based on the usage.
+ * @nb notifier block registered to the corresponding OPP to get
+ * notified for frequency availability updates.
+ * @polling_jiffies interval in jiffies.
+ * @previous_freq previously configured frequency value.
+ * @next_polling the number of remaining jiffies to poll with
+ * "devfreq_monitor" executions to reevaluate
+ * frequency/voltage of the device. Set by
+ * profile's polling_ms interval.
+ * @data Private data of the governor. The devfreq framework does not
+ * touch this.
+ *
+ * This structure stores the devfreq information for a give device.
+ *
+ * Note that when a governor accesses entries in struct devfreq in its
+ * functions except for the context of callbacks defined in struct
+ * devfreq_governor, the governor should protect its access with the
+ * struct mutex lock in struct devfreq. A governor may use this mutex
+ * to protect its own private data in void *data as well.
+ */
+struct devfreq {
+ struct list_head node;
+
+ struct mutex lock;
+ struct device *dev;
+ struct devfreq_dev_profile *profile;
+ struct devfreq_governor *governor;
+ struct notifier_block nb;
+
+ unsigned long polling_jiffies;
+ unsigned long previous_freq;
+ unsigned int next_polling;
+
+ void *data; /* private data for governors */
+};
+
+#if defined(CONFIG_PM_DEVFREQ)
+extern int devfreq_add_device(struct device *dev,
+ struct devfreq_dev_profile *profile,
+ struct devfreq_governor *governor,
+ void *data);
+extern int devfreq_remove_device(struct device *dev);
+#else /* !CONFIG_PM_DEVFREQ */
+static int devfreq_add_device(struct device *dev,
+ struct devfreq_dev_profile *profile,
+ struct devfreq_governor *governor,
+ void *data)
+{
+ return 0;
+}
+
+static int devfreq_remove_device(struct device *dev)
+{
+ return 0;
+}
+#endif /* CONFIG_PM_DEVFREQ */
+
+#endif /* __LINUX_DEVFREQ_H__ */
--
1.7.4.1
^ permalink raw reply related
* [PATCH v9 3/4] PM / devfreq: add common sysfs interfaces
From: MyungJoo Ham @ 2011-08-31 7:29 UTC (permalink / raw)
To: linux-pm; +Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, Thomas Gleixner
In-Reply-To: <1314775779-21399-1-git-send-email-myungjoo.ham@samsung.com>
Device specific sysfs interface /sys/devices/.../power/devfreq_*
- governor R: name of governor
- cur_freq R: current frequency
- max_freq R: maximum operable frequency
- min_freq R: minimum operable frequency
- polling_interval R: polling interval in ms given with devfreq profile
W: update polling interval.
Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
--
Changes from v8
- applied per-devfreq locking mechanism
Changes from v7
- removed set_freq from the common devfreq interface
- added get_devfreq, a mutex-protected wrapper for find_device_devfreq
(for sysfs interfaces and later with governor-support)
- corrected ABI documentation.
Changes from v6
- poling_interval is writable.
Changes from v5
- updated devferq_update usage.
Changes from v4
- removed system-wide sysfs interface
- removed tickling sysfs interface
- added set_freq for userspace governor (and any other governors that
require user input)
Changes from v3
- corrected sysfs API usage
- corrected error messages
- moved sysfs entry location
- added sysfs entries
Changes from v2
- add ABI entries for devfreq sysfs interface
---
Documentation/ABI/testing/sysfs-devices-power | 37 +++++
drivers/devfreq/devfreq.c | 203 +++++++++++++++++++++++++
2 files changed, 240 insertions(+), 0 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-devices-power b/Documentation/ABI/testing/sysfs-devices-power
index 8ffbc25..57f4591 100644
--- a/Documentation/ABI/testing/sysfs-devices-power
+++ b/Documentation/ABI/testing/sysfs-devices-power
@@ -165,3 +165,40 @@ Description:
Not all drivers support this attribute. If it isn't supported,
attempts to read or write it will yield I/O errors.
+
+What: /sys/devices/.../power/devfreq_governor
+Date: July 2011
+Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
+Description:
+ The /sys/devices/.../power/devfreq_governor shows the name
+ of the governor used by the corresponding device.
+
+What: /sys/devices/.../power/devfreq_cur_freq
+Date: July 2011
+Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
+Description:
+ The /sys/devices/.../power/devfreq_cur_freq shows the current
+ frequency of the corresponding device.
+
+What: /sys/devices/.../power/devfreq_max_freq
+Date: July 2011
+Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
+Description:
+ The /sys/devices/.../power/devfreq_max_freq shows the
+ maximum operable frequency of the corresponding device.
+
+What: /sys/devices/.../power/devfreq_min_freq
+Date: July 2011
+Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
+Description:
+ The /sys/devices/.../power/devfreq_min_freq shows the
+ minimum operable frequency of the corresponding device.
+
+What: /sys/devices/.../power/devfreq_polling_interval
+Date: July 2011
+Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
+Description:
+ The /sys/devices/.../power/devfreq_polling_interval sets and
+ shows the requested polling interval of the corresponding
+ device. The values are represented in ms. If the value is less
+ than 1 jiffy, it is considered to be 0, which means no polling.
diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
index 621b863..1c46052 100644
--- a/drivers/devfreq/devfreq.c
+++ b/drivers/devfreq/devfreq.c
@@ -37,6 +37,8 @@ static struct delayed_work devfreq_work;
static LIST_HEAD(devfreq_list);
static DEFINE_MUTEX(devfreq_list_lock);
+static struct attribute_group dev_attr_group;
+
/**
* find_device_devfreq() - find devfreq struct using device pointer
* @dev: device pointer used to lookup device devfreq.
@@ -191,6 +193,8 @@ static void devfreq_monitor(struct work_struct *work)
dev_err(devfreq->dev, "Due to devfreq_do error(%d), devfreq(%s) is removed from the device\n",
error, devfreq->governor->name);
+ sysfs_unmerge_group(&devfreq->dev->kobj,
+ &dev_attr_group);
list_del(&devfreq->node);
mutex_unlock(&devfreq->lock);
kfree(devfreq);
@@ -293,6 +297,8 @@ int devfreq_add_device(struct device *dev, struct devfreq_dev_profile *profile,
queue_delayed_work(devfreq_wq, &devfreq_work,
devfreq->next_polling);
}
+
+ sysfs_merge_group(&dev->kobj, &dev_attr_group);
mutex_unlock(&devfreq->lock);
goto out;
err_init:
@@ -333,6 +339,8 @@ int devfreq_remove_device(struct device *dev)
goto out;
}
+ sysfs_unmerge_group(&dev->kobj, &dev_attr_group);
+
list_del(&devfreq->node);
if (devfreq->governor->exit)
@@ -346,6 +354,201 @@ out:
return 0;
}
+static ssize_t show_governor(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct devfreq *df;
+ ssize_t ret;
+
+ mutex_lock(&devfreq_list_lock);
+ df = find_device_devfreq(dev);
+ if (IS_ERR(df)) {
+ ret = PTR_ERR(df);
+ goto out;
+ }
+
+ mutex_lock(&df->lock);
+ if (!df->governor) {
+ ret = -EINVAL;
+ goto out_l;
+ }
+
+ ret = sprintf(buf, "%s\n", df->governor->name);
+out_l:
+ mutex_unlock(&df->lock);
+out:
+ mutex_unlock(&devfreq_list_lock);
+ return ret;
+}
+
+static ssize_t show_freq(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct devfreq *df;
+ ssize_t ret;
+
+ mutex_lock(&devfreq_list_lock);
+ df = find_device_devfreq(dev);
+ if (IS_ERR(df)) {
+ ret = PTR_ERR(df);
+ goto out;
+ }
+
+ ret = sprintf(buf, "%lu\n", df->previous_freq);
+out:
+ mutex_unlock(&devfreq_list_lock);
+ return ret;
+}
+
+static ssize_t show_max_freq(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct devfreq *df;
+ ssize_t ret;
+ unsigned long freq = ULONG_MAX;
+ struct opp *opp;
+
+ mutex_lock(&devfreq_list_lock);
+ df = find_device_devfreq(dev);
+ if (IS_ERR(df)) {
+ ret = PTR_ERR(df);
+ goto out;
+ }
+
+ mutex_lock(&df->lock);
+ opp = opp_find_freq_floor(df->dev, &freq);
+ if (IS_ERR(opp)) {
+ ret = PTR_ERR(opp);
+ goto out_l;
+ }
+
+ ret = sprintf(buf, "%lu\n", freq);
+out_l:
+ mutex_unlock(&df->lock);
+out:
+ mutex_unlock(&devfreq_list_lock);
+ return ret;
+}
+
+static ssize_t show_min_freq(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct devfreq *df;
+ ssize_t ret;
+ unsigned long freq = 0;
+ struct opp *opp;
+
+ mutex_lock(&devfreq_list_lock);
+ df = find_device_devfreq(dev);
+ if (IS_ERR(df)) {
+ ret = PTR_ERR(df);
+ goto out;
+ }
+
+ mutex_lock(&df->lock);
+ opp = opp_find_freq_ceil(df->dev, &freq);
+ if (IS_ERR(opp)) {
+ ret = PTR_ERR(opp);
+ goto out_l;
+ }
+
+ ret = sprintf(buf, "%lu\n", freq);
+out_l:
+ mutex_unlock(&df->lock);
+out:
+ mutex_unlock(&devfreq_list_lock);
+ return ret;
+}
+
+static ssize_t show_polling_interval(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct devfreq *df;
+ ssize_t ret;
+
+ mutex_lock(&devfreq_list_lock);
+ df = find_device_devfreq(dev);
+ if (IS_ERR(df)) {
+ ret = PTR_ERR(df);
+ goto out;
+ }
+
+ mutex_lock(&df->lock);
+ if (!df->profile) {
+ ret = -EINVAL;
+ goto out_l;
+ }
+
+ ret = sprintf(buf, "%d\n", df->profile->polling_ms);
+out_l:
+ mutex_unlock(&df->lock);
+out:
+ mutex_unlock(&devfreq_list_lock);
+ return ret;
+}
+
+static ssize_t store_polling_interval(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct devfreq *df;
+ unsigned int value;
+ int ret;
+
+ mutex_lock(&devfreq_list_lock);
+ df = find_device_devfreq(dev);
+ if (IS_ERR(df)) {
+ count = PTR_ERR(df);
+ goto out;
+ }
+ mutex_lock(&df->lock);
+ if (!df->profile) {
+ count = -EINVAL;
+ goto out_l;
+ }
+
+ ret = sscanf(buf, "%u", &value);
+ if (ret != 1) {
+ count = -EINVAL;
+ goto out_l;
+ }
+
+ df->profile->polling_ms = value;
+ df->next_polling = df->polling_jiffies
+ = msecs_to_jiffies(value);
+
+ if (df->next_polling > 0 && !polling) {
+ polling = true;
+ queue_delayed_work(devfreq_wq, &devfreq_work,
+ df->next_polling);
+ }
+out_l:
+ mutex_unlock(&df->lock);
+out:
+ mutex_unlock(&devfreq_list_lock);
+
+ return count;
+}
+
+static DEVICE_ATTR(devfreq_governor, 0444, show_governor, NULL);
+static DEVICE_ATTR(devfreq_cur_freq, 0444, show_freq, NULL);
+static DEVICE_ATTR(devfreq_max_freq, 0444, show_max_freq, NULL);
+static DEVICE_ATTR(devfreq_min_freq, 0444, show_min_freq, NULL);
+static DEVICE_ATTR(devfreq_polling_interval, 0644, show_polling_interval,
+ store_polling_interval);
+static struct attribute *dev_entries[] = {
+ &dev_attr_devfreq_governor.attr,
+ &dev_attr_devfreq_cur_freq.attr,
+ &dev_attr_devfreq_max_freq.attr,
+ &dev_attr_devfreq_min_freq.attr,
+ &dev_attr_devfreq_polling_interval.attr,
+ NULL,
+};
+static struct attribute_group dev_attr_group = {
+ .name = power_group_name,
+ .attrs = dev_entries,
+};
+
/**
* devfreq_init() - Initialize data structure for devfreq framework and
* start polling registered devfreq devices.
--
1.7.4.1
^ permalink raw reply related
* [PATCH v9 4/4] PM / devfreq: add basic governors
From: MyungJoo Ham @ 2011-08-31 7:29 UTC (permalink / raw)
To: linux-pm; +Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, Thomas Gleixner
In-Reply-To: <1314775779-21399-1-git-send-email-myungjoo.ham@samsung.com>
Four cpufreq-like governors are provided as examples.
powersave: use the lowest frequency possible. The user (device) should
set the polling_ms as 0 because polling is useless for this governor.
performance: use the highest freqeuncy possible. The user (device)
should set the polling_ms as 0 because polling is useless for this
governor.
userspace: use the user specified frequency stored at
devfreq.user_set_freq. With sysfs support in the following patch, a user
may set the value with the sysfs interface.
simple_ondemand: simplified version of cpufreq's ondemand governor.
When a user updates OPP entries (enable/disable/add), OPP framework
automatically notifies devfreq to update operating frequency
accordingly. Thus, devfreq users (device drivers) do not need to update
devfreq manually with OPP entry updates or set polling_ms for powersave
, performance, userspace, or any other "static" governors.
Note that these are given only as basic examples for governors and any
devices with devfreq may implement their own governors with the drivers
and use them.
Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
Changed from v8
- Removed unnecessary header entries
Changed from v7
- Userspace uses its own sysfs interface.
Changed from v5
- Seperated governor files from devfreq.c
- Allow simple ondemand to be tuned for each device
---
Documentation/ABI/testing/sysfs-devices-power | 9 ++
drivers/devfreq/Kconfig | 36 +++++++
drivers/devfreq/Makefile | 4 +
drivers/devfreq/governor_performance.c | 24 +++++
drivers/devfreq/governor_powersave.c | 24 +++++
drivers/devfreq/governor_simpleondemand.c | 88 +++++++++++++++++
drivers/devfreq/governor_userspace.c | 126 +++++++++++++++++++++++++
include/linux/devfreq.h | 37 +++++++
8 files changed, 348 insertions(+), 0 deletions(-)
create mode 100644 drivers/devfreq/governor_performance.c
create mode 100644 drivers/devfreq/governor_powersave.c
create mode 100644 drivers/devfreq/governor_simpleondemand.c
create mode 100644 drivers/devfreq/governor_userspace.c
diff --git a/Documentation/ABI/testing/sysfs-devices-power b/Documentation/ABI/testing/sysfs-devices-power
index 57f4591..c7f6977 100644
--- a/Documentation/ABI/testing/sysfs-devices-power
+++ b/Documentation/ABI/testing/sysfs-devices-power
@@ -202,3 +202,12 @@ Description:
shows the requested polling interval of the corresponding
device. The values are represented in ms. If the value is less
than 1 jiffy, it is considered to be 0, which means no polling.
+
+What: /sys/devices/.../power/devfreq_userspace_set_freq
+Date: August 2011
+Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
+Description:
+ The /sys/devices/.../power/devfreq_userspace_set_freq sets
+ and shows the user specified frequency in kHz. This sysfs
+ entry is created and managed by userspace DEVFREQ governor.
+ If other governors are used, it won't be supported.
diff --git a/drivers/devfreq/Kconfig b/drivers/devfreq/Kconfig
index 1fb42de..643b055 100644
--- a/drivers/devfreq/Kconfig
+++ b/drivers/devfreq/Kconfig
@@ -34,6 +34,42 @@ menuconfig PM_DEVFREQ
if PM_DEVFREQ
+comment "DEVFREQ Governors"
+
+config DEVFREQ_GOV_SIMPLE_ONDEMAND
+ bool "Simple Ondemand"
+ help
+ Chooses frequency based on the recent load on the device. Works
+ similar as ONDEMAND governor of CPUFREQ does. A device with
+ Simple-Ondemand should be able to provide busy/total counter
+ values that imply the usage rate. A device may provide tuned
+ values to the governor with data field at devfreq_add_device().
+
+config DEVFREQ_GOV_PERFORMANCE
+ bool "Performance"
+ help
+ Sets the frequency at the maximum available frequency.
+ This governor always returns UINT_MAX as frequency so that
+ the DEVFREQ framework returns the highest frequency available
+ at any time.
+
+config DEVFREQ_GOV_POWERSAVE
+ bool "Powersave"
+ help
+ Sets the frequency at the minimum available frequency.
+ This governor always returns 0 as frequency so that
+ the DEVFREQ framework returns the lowest frequency available
+ at any time.
+
+config DEVFREQ_GOV_USERSPACE
+ bool "Userspace"
+ help
+ Sets the frequency at the user specified one.
+ This governor returns the user configured frequency if there
+ has been an input to /sys/devices/.../power/devfreq_set_freq.
+ Otherwise, the governor does not change the frequnecy
+ given at the initialization.
+
comment "DEVFREQ Drivers"
endif # PM_DEVFREQ
diff --git a/drivers/devfreq/Makefile b/drivers/devfreq/Makefile
index 168934a..4564a89 100644
--- a/drivers/devfreq/Makefile
+++ b/drivers/devfreq/Makefile
@@ -1 +1,5 @@
obj-$(CONFIG_PM_DEVFREQ) += devfreq.o
+obj-$(CONFIG_DEVFREQ_GOV_SIMPLE_ONDEMAND) += governor_simpleondemand.o
+obj-$(CONFIG_DEVFREQ_GOV_PERFORMANCE) += governor_performance.o
+obj-$(CONFIG_DEVFREQ_GOV_POWERSAVE) += governor_powersave.o
+obj-$(CONFIG_DEVFREQ_GOV_USERSPACE) += governor_userspace.o
diff --git a/drivers/devfreq/governor_performance.c b/drivers/devfreq/governor_performance.c
new file mode 100644
index 0000000..c47eff8
--- /dev/null
+++ b/drivers/devfreq/governor_performance.c
@@ -0,0 +1,24 @@
+/*
+ * linux/drivers/devfreq/governor_performance.c
+ *
+ * Copyright (C) 2011 Samsung Electronics
+ * MyungJoo Ham <myungjoo.ham@samsung.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/devfreq.h>
+
+static int devfreq_performance_func(struct devfreq *df,
+ unsigned long *freq)
+{
+ *freq = UINT_MAX; /* devfreq_do will run "floor" */
+ return 0;
+}
+
+struct devfreq_governor devfreq_performance = {
+ .name = "performance",
+ .get_target_freq = devfreq_performance_func,
+};
diff --git a/drivers/devfreq/governor_powersave.c b/drivers/devfreq/governor_powersave.c
new file mode 100644
index 0000000..4f128d8
--- /dev/null
+++ b/drivers/devfreq/governor_powersave.c
@@ -0,0 +1,24 @@
+/*
+ * linux/drivers/devfreq/governor_powersave.c
+ *
+ * Copyright (C) 2011 Samsung Electronics
+ * MyungJoo Ham <myungjoo.ham@samsung.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/devfreq.h>
+
+static int devfreq_powersave_func(struct devfreq *df,
+ unsigned long *freq)
+{
+ *freq = 0; /* devfreq_do will run "ceiling" to 0 */
+ return 0;
+}
+
+struct devfreq_governor devfreq_powersave = {
+ .name = "powersave",
+ .get_target_freq = devfreq_powersave_func,
+};
diff --git a/drivers/devfreq/governor_simpleondemand.c b/drivers/devfreq/governor_simpleondemand.c
new file mode 100644
index 0000000..18fe8be
--- /dev/null
+++ b/drivers/devfreq/governor_simpleondemand.c
@@ -0,0 +1,88 @@
+/*
+ * linux/drivers/devfreq/governor_simpleondemand.c
+ *
+ * Copyright (C) 2011 Samsung Electronics
+ * MyungJoo Ham <myungjoo.ham@samsung.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/errno.h>
+#include <linux/devfreq.h>
+#include <linux/math64.h>
+
+/* Default constants for DevFreq-Simple-Ondemand (DFSO) */
+#define DFSO_UPTHRESHOLD (90)
+#define DFSO_DOWNDIFFERENCTIAL (5)
+static int devfreq_simple_ondemand_func(struct devfreq *df,
+ unsigned long *freq)
+{
+ struct devfreq_dev_status stat;
+ int err = df->profile->get_dev_status(df->dev, &stat);
+ unsigned long long a, b;
+ unsigned int dfso_upthreshold = DFSO_UPTHRESHOLD;
+ unsigned int dfso_downdifferential = DFSO_DOWNDIFFERENCTIAL;
+ struct devfreq_simple_ondemand_data *data = df->data;
+
+ if (err)
+ return err;
+
+ if (data) {
+ if (data->upthreshold)
+ dfso_upthreshold = data->upthreshold;
+ if (data->downdifferential)
+ dfso_downdifferential = data->downdifferential;
+ }
+ if (dfso_upthreshold > 100 ||
+ dfso_upthreshold < dfso_downdifferential)
+ return -EINVAL;
+
+ /* Assume MAX if it is going to be divided by zero */
+ if (stat.total_time == 0) {
+ *freq = UINT_MAX;
+ return 0;
+ }
+
+ /* Prevent overflow */
+ if (stat.busy_time >= (1 << 24) || stat.total_time >= (1 << 24)) {
+ stat.busy_time >>= 7;
+ stat.total_time >>= 7;
+ }
+
+ /* Set MAX if it's busy enough */
+ if (stat.busy_time * 100 >
+ stat.total_time * dfso_upthreshold) {
+ *freq = UINT_MAX;
+ return 0;
+ }
+
+ /* Set MAX if we do not know the initial frequency */
+ if (stat.current_frequency == 0) {
+ *freq = UINT_MAX;
+ return 0;
+ }
+
+ /* Keep the current frequency */
+ if (stat.busy_time * 100 >
+ stat.total_time * (dfso_upthreshold - dfso_downdifferential)) {
+ *freq = stat.current_frequency;
+ return 0;
+ }
+
+ /* Set the desired frequency based on the load */
+ a = stat.busy_time;
+ a *= stat.current_frequency;
+ b = div_u64(a, stat.total_time);
+ b *= 100;
+ b = div_u64(b, (dfso_upthreshold - dfso_downdifferential / 2));
+ *freq = (unsigned long) b;
+
+ return 0;
+}
+
+struct devfreq_governor devfreq_simple_ondemand = {
+ .name = "simple_ondemand",
+ .get_target_freq = devfreq_simple_ondemand_func,
+};
diff --git a/drivers/devfreq/governor_userspace.c b/drivers/devfreq/governor_userspace.c
new file mode 100644
index 0000000..490167f
--- /dev/null
+++ b/drivers/devfreq/governor_userspace.c
@@ -0,0 +1,126 @@
+/*
+ * linux/drivers/devfreq/governor_simpleondemand.c
+ *
+ * Copyright (C) 2011 Samsung Electronics
+ * MyungJoo Ham <myungjoo.ham@samsung.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/slab.h>
+#include <linux/device.h>
+#include <linux/devfreq.h>
+#include <linux/pm.h>
+#include <linux/mutex.h>
+#include "governor.h"
+
+struct userspace_data {
+ unsigned long user_frequency;
+ bool valid;
+};
+
+static int devfreq_userspace_func(struct devfreq *df, unsigned long *freq)
+{
+ struct userspace_data *data = df->data;
+
+ if (!data->valid)
+ *freq = df->previous_freq; /* No user freq specified yet */
+ else
+ *freq = data->user_frequency;
+ return 0;
+}
+
+static ssize_t store_freq(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct devfreq *devfreq = get_devfreq(dev);
+ struct userspace_data *data;
+ unsigned long wanted;
+ int err = 0;
+
+ if (IS_ERR(devfreq)) {
+ err = PTR_ERR(devfreq);
+ goto out;
+ }
+
+ mutex_lock(&devfreq->lock);
+ data = devfreq->data;
+
+ sscanf(buf, "%lu", &wanted);
+ data->user_frequency = wanted;
+ data->valid = true;
+ err = update_devfreq(devfreq);
+ if (err == 0)
+ err = count;
+ mutex_unlock(&devfreq->lock);
+out:
+ return err;
+}
+
+static ssize_t show_freq(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ struct devfreq *devfreq = get_devfreq(dev);
+ struct userspace_data *data;
+ int err = 0;
+
+ if (IS_ERR(devfreq)) {
+ err = PTR_ERR(devfreq);
+ goto out;
+ }
+
+ mutex_lock(&devfreq->lock);
+ data = devfreq->data;
+
+ if (data->valid)
+ err = sprintf(buf, "%lu\n", data->user_frequency);
+ else
+ err = sprintf(buf, "undefined\n");
+ mutex_unlock(&devfreq->lock);
+out:
+ return err;
+}
+
+static DEVICE_ATTR(devfreq_userspace_set_freq, 0644, show_freq, store_freq);
+static struct attribute *dev_entries[] = {
+ &dev_attr_devfreq_userspace_set_freq.attr,
+ NULL,
+};
+static struct attribute_group dev_attr_group = {
+ .name = power_group_name,
+ .attrs = dev_entries,
+};
+
+static int userspace_init(struct devfreq *devfreq)
+{
+ int err = 0;
+ struct userspace_data *data = kzalloc(sizeof(struct userspace_data),
+ GFP_KERNEL);
+
+ if (!data) {
+ err = -ENOMEM;
+ goto out;
+ }
+ data->valid = false;
+ devfreq->data = data;
+
+ sysfs_merge_group(&devfreq->dev->kobj, &dev_attr_group);
+out:
+ return err;
+}
+
+static void userspace_exit(struct devfreq *devfreq)
+{
+ sysfs_unmerge_group(&devfreq->dev->kobj, &dev_attr_group);
+ kfree(devfreq->data);
+ devfreq->data = NULL;
+}
+
+struct devfreq_governor devfreq_userspace = {
+ .name = "userspace",
+ .get_target_freq = devfreq_userspace_func,
+ .init = userspace_init,
+ .exit = userspace_exit,
+};
diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
index f14b57d..5b802a6 100644
--- a/include/linux/devfreq.h
+++ b/include/linux/devfreq.h
@@ -105,6 +105,37 @@ extern int devfreq_add_device(struct device *dev,
struct devfreq_governor *governor,
void *data);
extern int devfreq_remove_device(struct device *dev);
+
+#ifdef CONFIG_DEVFREQ_GOV_POWERSAVE
+extern struct devfreq_governor devfreq_powersave;
+#endif
+#ifdef CONFIG_DEVFREQ_GOV_PERFORMANCE
+extern struct devfreq_governor devfreq_performance;
+#endif
+#ifdef CONFIG_DEVFREQ_GOV_USERSPACE
+extern struct devfreq_governor devfreq_userspace;
+#endif
+#ifdef CONFIG_DEVFREQ_GOV_SIMPLE_ONDEMAND
+extern struct devfreq_governor devfreq_simple_ondemand;
+/**
+ * struct devfreq_simple_ondemand_data - void *data fed to struct devfreq
+ * and devfreq_add_device
+ * @ upthreshold If the load is over this value, the frequency jumps.
+ * Specify 0 to use the default. Valid value = 0 to 100.
+ * @ downdifferential If the load is under upthreshold - downdifferential,
+ * the governor may consider slowing the frequency down.
+ * Specify 0 to use the default. Valid value = 0 to 100.
+ * downdifferential < upthreshold must hold.
+ *
+ * If the fed devfreq_simple_ondemand_data pointer is NULL to the governor,
+ * the governor uses the default values.
+ */
+struct devfreq_simple_ondemand_data {
+ unsigned int upthreshold;
+ unsigned int downdifferential;
+};
+#endif
+
#else /* !CONFIG_PM_DEVFREQ */
static int devfreq_add_device(struct device *dev,
struct devfreq_dev_profile *profile,
@@ -118,6 +149,12 @@ static int devfreq_remove_device(struct device *dev)
{
return 0;
}
+
+#define devfreq_powersave NULL
+#define devfreq_performance NULL
+#define devfreq_userspace NULL
+#define devfreq_simple_ondemand NULL
+
#endif /* CONFIG_PM_DEVFREQ */
#endif /* __LINUX_DEVFREQ_H__ */
--
1.7.4.1
^ permalink raw reply related
* [PATCH pm-freezer 1/4] cgroup_freezer: fix freezer->state setting bug in freezer_change_state()
From: Tejun Heo @ 2011-08-31 10:21 UTC (permalink / raw)
To: Rafael J. Wysocki, Oleg Nesterov, Paul Menage
Cc: containers, linux-pm, linux-kernel
d02f52811d0e "cgroup_freezer: prepare for removal of TIF_FREEZE" moved
setting of freezer->state into freezer_change_state(); unfortunately,
while doing so, when it's beginning to freeze tasks, it sets the state
to CGROUP_FROZEN instead of CGROUP_FREEZING ending up skipping the
whole freezing state. Fix it.
-v2: Oleg pointed out that re-freezing FROZEN cgroup could increment
system_freezing_cnt. Fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
---
I'm in the process of moving and can only use a quite old laptop. I
tested compile but couldn't really do much else, so please proceed
with caution. Oleg, can you please ack the patches if you agree with
the updated versions?
Thanks.
kernel/cgroup_freezer.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)
Index: work/kernel/cgroup_freezer.c
===================================================================
--- work.orig/kernel/cgroup_freezer.c
+++ work/kernel/cgroup_freezer.c
@@ -308,24 +308,26 @@ static int freezer_change_state(struct c
spin_lock_irq(&freezer->lock);
update_if_frozen(cgroup, freezer);
- if (goal_state == freezer->state)
- goto out;
-
- freezer->state = goal_state;
switch (goal_state) {
case CGROUP_THAWED:
- atomic_dec(&system_freezing_cnt);
- unfreeze_cgroup(cgroup, freezer);
+ if (freezer->state != CGROUP_THAWED) {
+ freezer->state = CGROUP_THAWED;
+ atomic_dec(&system_freezing_cnt);
+ unfreeze_cgroup(cgroup, freezer);
+ }
break;
case CGROUP_FROZEN:
- atomic_inc(&system_freezing_cnt);
- retval = try_to_freeze_cgroup(cgroup, freezer);
+ if (freezer->state == CGROUP_THAWED) {
+ freezer->state = CGROUP_FREEZING;
+ atomic_inc(&system_freezing_cnt);
+ retval = try_to_freeze_cgroup(cgroup, freezer);
+ }
break;
default:
BUG();
}
-out:
+
spin_unlock_irq(&freezer->lock);
return retval;
^ permalink raw reply
* [PATCH pm-freezer 2/4] freezer: set PF_NOFREEZE on a dying task right before TASK_DEAD setting bug in freezer_change_state()
From: Tejun Heo @ 2011-08-31 10:21 UTC (permalink / raw)
To: Rafael J. Wysocki, Oleg Nesterov, Paul Menage
Cc: containers, linux-pm, linux-kernel
In-Reply-To: <20110831102100.GA2828@mtj.dyndns.org>
3fb45733df "freezer: make exiting tasks properly unfreezable" removed
clear_freeze_flag() from exit path and set PF_NOFREEZE right after
PTRACE_EVENT_EXIT; however, Oleg pointed out that following exit paths
may cause interaction with device drivers after PM freezer consider
the system frozen.
There's no try_to_freeze() call in the exit path and the only
necessary guarantee is that freezer doesn't hang waiting for zombies.
Set PF_NOFREEZE right before setting tsk->state to TASK_DEAD instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
---
kernel/exit.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
Index: work/kernel/exit.c
===================================================================
--- work.orig/kernel/exit.c
+++ work/kernel/exit.c
@@ -913,12 +913,6 @@ NORET_TYPE void do_exit(long code)
ptrace_event(PTRACE_EVENT_EXIT, code);
- /*
- * With ptrace notification done, there's no point in freezing from
- * here on. Disallow freezing.
- */
- current->flags |= PF_NOFREEZE;
-
validate_creds_for_do_exit(tsk);
/*
@@ -1044,6 +1038,10 @@ NORET_TYPE void do_exit(long code)
preempt_disable();
exit_rcu();
+
+ /* this task is now dead and freezer should ignore it */
+ current->flags |= PF_NOFREEZE;
+
/* causes final put_task_struct in finish_task_switch(). */
tsk->state = TASK_DEAD;
schedule();
^ permalink raw reply
* [PATCH pm-freezer 3/4] freezer: restructure __refrigerator()
From: Tejun Heo @ 2011-08-31 10:22 UTC (permalink / raw)
To: Rafael J. Wysocki, Oleg Nesterov, Paul Menage
Cc: containers, linux-pm, linux-kernel
In-Reply-To: <20110831102143.GB2828@mtj.dyndns.org>
If another freeze happens before all tasks leave FROZEN state after
being thawed, the freezer can see the existing FROZEN and consider the
tasks to be frozen but they can clear FROZEN without checking the new
freezing().
Oleg suggested restructuring __refrigerator() such that there's single
condition check section inside freezer_lock and sigpending is cleared
afterwards, which fixes the problem and simplifies the code.
Restructure accordingly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
---
kernel/freezer.c | 33 ++++++++++++++-------------------
1 file changed, 14 insertions(+), 19 deletions(-)
Index: work/kernel/freezer.c
===================================================================
--- work.orig/kernel/freezer.c
+++ work/kernel/freezer.c
@@ -52,36 +52,31 @@ bool __refrigerator(bool check_kthr_stop
/* Hmm, should we be allowed to suspend when there are realtime
processes around? */
bool was_frozen = false;
- long save;
+ long save = current->state;
- /*
- * No point in checking freezing() again - the caller already did.
- * Proceed to enter FROZEN.
- */
- spin_lock_irq(&freezer_lock);
- current->flags |= PF_FROZEN;
- spin_unlock_irq(&freezer_lock);
-
- save = current->state;
pr_debug("%s entered refrigerator\n", current->comm);
- spin_lock_irq(¤t->sighand->siglock);
- recalc_sigpending(); /* We sent fake signal, clean it up */
- spin_unlock_irq(¤t->sighand->siglock);
-
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
+
+ spin_lock_irq(&freezer_lock);
+ current->flags |= PF_FROZEN;
if (!freezing(current) ||
- (check_kthr_stop && kthread_should_stop()))
+ (check_kthr_stop && kthread_should_stop())) {
+ current->flags &= ~PF_FROZEN;
+ break;
+ }
+ spin_unlock_irq(&freezer_lock);
+
+ if (!(current->flags & PF_FROZEN))
break;
was_frozen = true;
schedule();
}
- /* leave FROZEN */
- spin_lock_irq(&freezer_lock);
- current->flags &= ~PF_FROZEN;
- spin_unlock_irq(&freezer_lock);
+ spin_lock_irq(¤t->sighand->siglock);
+ recalc_sigpending(); /* We sent fake signal, clean it up */
+ spin_unlock_irq(¤t->sighand->siglock);
pr_debug("%s left refrigerator\n", current->comm);
^ permalink raw reply
* [PATCH pm-freezer 4/4] freezer: use lock_task_sighand() in fake_signal_wake_up()
From: Tejun Heo @ 2011-08-31 10:22 UTC (permalink / raw)
To: Rafael J. Wysocki, Oleg Nesterov, Paul Menage
Cc: containers, linux-pm, linux-kernel
In-Reply-To: <20110831102210.GC2828@mtj.dyndns.org>
cgroup_freezer calls freeze_task() without holding tasklist_lock and,
if the task is exiting, its ->sighand may be gone by the time
fake_signal_wake_up() is called. Use lock_task_sighand() instead of
accessing ->sighand directly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Paul Menage <paul@paulmenage.org>
---
kernel/freezer.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
Index: work/kernel/freezer.c
===================================================================
--- work.orig/kernel/freezer.c
+++ work/kernel/freezer.c
@@ -95,9 +95,10 @@ static void fake_signal_wake_up(struct t
{
unsigned long flags;
- spin_lock_irqsave(&p->sighand->siglock, flags);
- signal_wake_up(p, 0);
- spin_unlock_irqrestore(&p->sighand->siglock, flags);
+ if (lock_task_sighand(p, &flags)) {
+ signal_wake_up(p, 0);
+ unlock_task_sighand(p, &flags);
+ }
}
/**
^ permalink raw reply
* wait_event_freezable variant for TASK_KILLABLE?
From: Jeff Layton @ 2011-08-31 12:44 UTC (permalink / raw)
To: linux-pm; +Cc: linux-cifs, linux-kernel
I had a bug reported a while back that cifs mounts were preventing
machines from suspending. I can reproduce this pretty readily by simply
making a cifs mount, leaving it idle for a bit (so that the root dentry
will need to be revalidated) and then attempting to suspend the
machine. When I do that I get the following backtrace:
[ 5323.278130] PM: Syncing filesystems ... done.
[ 5323.313956] PM: Preparing system for mem sleep
[ 5323.435457] Freezing user space processes ...
[ 5343.444237] Freezing of tasks failed after 20.00 seconds (1 tasks refusing to freeze, wq_busy=0):
[ 5343.444335] umount D ffff88011075dc00 0 7400 7383 0x00800084
[ 5343.444342] ffff8800c95e1b08 0000000000000086 ffff8800c95e1b40 ffff880000000001
[ 5343.444348] ffff880117965cc0 ffff8800c95e1fd8 ffff8800c95e1fd8 0000000000012540
[ 5343.444354] ffff8800d5d5c590 ffff880117965cc0 ffff8800c95e1b18 00000001c95e1ad8
[ 5343.444359] Call Trace:
[ 5343.444378] [<ffffffffa044f0ca>] wait_for_response+0x199/0x19e [cifs]
[ 5343.444384] [<ffffffff81070566>] ? remove_wait_queue+0x3a/0x3a
[ 5343.444392] [<ffffffffa044fe93>] SendReceive+0x184/0x285 [cifs]
[ 5343.444399] [<ffffffffa043a51e>] CIFSSMBUnixQPathInfo+0x167/0x212 [cifs]
[ 5343.444407] [<ffffffffa044ae90>] cifs_get_inode_info_unix+0x8e/0x165 [cifs]
[ 5343.444414] [<ffffffffa0444223>] ? build_path_from_dentry+0xe2/0x20d [cifs]
[ 5343.444418] [<ffffffff8111673e>] ? __kmalloc+0x103/0x115
[ 5343.444425] [<ffffffffa0444223>] ? build_path_from_dentry+0xe2/0x20d [cifs]
[ 5343.444431] [<ffffffffa0444223>] ? build_path_from_dentry+0xe2/0x20d [cifs]
[ 5343.444439] [<ffffffffa044c11d>] cifs_revalidate_dentry_attr+0x10b/0x172 [cifs]
[ 5343.444447] [<ffffffffa044c259>] cifs_getattr+0x7a/0xfc [cifs]
[ 5343.444451] [<ffffffff8112a9f7>] vfs_getattr+0x45/0x63
[ 5343.444454] [<ffffffff8112aa6d>] vfs_fstatat+0x58/0x6e
[ 5343.444457] [<ffffffff8112aabe>] vfs_stat+0x1b/0x1d
[ 5343.444460] [<ffffffff8112abbd>] sys_newstat+0x1a/0x33
[ 5343.444463] [<ffffffff8112f9e8>] ? path_put+0x20/0x24
[ 5343.444466] [<ffffffff810a0e84>] ? audit_syscall_entry+0x145/0x171
[ 5343.444469] [<ffffffff811302d1>] ? putname+0x34/0x36
[ 5343.444473] [<ffffffff8148e842>] system_call_fastpath+0x16/0x1b
[ 5343.444476]
[ 5343.444477] Restarting tasks ... done.
wait_for_response basically does this to put a task to sleep while it's
waiting for the server to respond:
error = wait_event_killable(server->response_q,
midQ->midState != MID_REQUEST_SUBMITTED);
NFS does similar sorts of things, and I think it has similar problems
with the freezer.
The problem there is pretty clear. That won't wake up unless you send
it a fatal signal, and we need it to wake up and freeze in that
situation. So, I made a stab at rolling a wait_event_freezekillable()
macro, based on wait_event_freezable.
-----------------------[snip]-----------------------------
#define wait_event_freezekillable(wq, condition) \
({ \
int __retval; \
do { \
__retval = wait_event_killable(wq, \
(condition) || freezing(current)); \
if (__retval && !freezing(current)) \
break; \
else if (!(condition)) \
__retval = -ERESTARTSYS; \
} while (try_to_freeze()); \
__retval; \
})
-----------------------[snip]-----------------------------
However, I still got the same problem when trying to put the task to
sleep. I could dig in and try to figure out why this isn't working like
I expect, but I figured I'd ask here first to see if I can determine
whether linux-pm has advice on how best to approach this. :)
Basically, what we'd like is something akin to wait_event_freezable,
but that only returns -ERESTARTSYS on fatal signals (SIGKILL).
Thoughts?
--
Jeff Layton <jlayton@redhat.com>
^ permalink raw reply
* Re: [PATCH 6/6] cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
From: Frederic Weisbecker @ 2011-08-31 13:42 UTC (permalink / raw)
To: Tejun Heo; +Cc: containers, lizf, linux-kernel, linux-pm, paul, kamezawa.hiroyu
In-Reply-To: <20110831070313.GA29179@mtj.dyndns.org>
On Wed, Aug 31, 2011 at 09:03:13AM +0200, Tejun Heo wrote:
> Hello, Frederic.
>
> On Tue, Aug 30, 2011 at 10:10:32PM +0200, Frederic Weisbecker wrote:
> > In order to keep the fix queued in -mm (https://lkml.org/lkml/2011/8/26/262)
> > the tasks that have failed to migrate should be removed from the iterator
> > so that they are not included in the batch in ->attach().
>
> I don't think that's a good approach. It breaks the symmetry when
> calling different callbacks. What if ->can_attach() allocates
> per-task resources and the task exits in the middle? I think we
> better lock down fork/exit/exec. I'll send patches but I'm currently
> moving / traveling w/ limited access to my toys so it might take some
> time.
My task counter subsystem patchset brings a cancel_attach_task() callback
that cancels can_attach_task() effects.
I thought that rebased on top of your patch it's going to be merged inside
cancel_attach() but OTOH we can't cancel the effect of failed migration
on a thread that way.
May be we need to keep a cancel_attach_task() just for that purpose?
^ permalink raw reply
* Re: [PATCH pm-freezer 1/4] cgroup_freezer: fix freezer->state setting bug in freezer_change_state()
From: Oleg Nesterov @ 2011-08-31 18:08 UTC (permalink / raw)
To: Tejun Heo; +Cc: linux-kernel, Paul Menage, containers, linux-pm
In-Reply-To: <20110831102100.GA2828@mtj.dyndns.org>
On 08/31, Tejun Heo wrote:
>
> I'm in the process of moving and can only use a quite old laptop. I
> tested compile but couldn't really do much else, so please proceed
> with caution. Oleg, can you please ack the patches if you agree with
> the updated versions?
Everything looks fine. But I am already sleeping now ;)
Rafael, Tejun, I'll try to re-read 1-4 tomorrow. I do not expect I'll
find something interesting, just I am paranoid.
Looks like, 1/4 could have an additional note in the changelog, with
this patch we avoid the unnecessary try_to_freeze_cgroup() and this
looks like a win to me...
Oleg.
^ permalink raw reply
* [RFC PATCH v1] ACPI S3 to work under Xen.
From: Konrad Rzeszutek Wilk @ 2011-08-31 18:31 UTC (permalink / raw)
To: x86, tglx, tboot-devel, shane.wang, linux-pm, linux-acpi,
len.brown
Cc: xen-devel
Attached is an RFC set of patches to enable S3 to work with the Xen hypervisor.
The relationship that Xen has with Linux kernel is symbiotic. The Linux
kernel does the ACPI "stuff" and tells the hypervisor to do the low-level
stuff (such as program the IOAPIC, setup vectors, etc). The realm of
ACPI S3 is more complex as we need to save the CPU state (and Intel TXT
values - which the hypervisor has to do) and then restore them.
The major difficulties we hit was with 'acpi_suspend_lowlevel' - which tweaks
a lot of lowlevel values and some of them are not properly handled by Xen.
Liang Tang has figured which ones of them we trip over (read below) - and he
suggested that perhaps we can provide a registration mechanism to abstract
this away.
So the attached patches do exactly that - there are two entry points
in the ACPI.
1). For S3: acpi_suspend_lowlevel -> .. lots of code -> acpi_enter_sleep_state
2). For S1/S4/S5: acpi_enter_sleep_state
The first naive idea was of abstracting away in the 'acpi_enter_sleep_state'
function the tboot_sleep code so that we can use it too. And low-behold - it
worked splendidly for powering off (S5 I believe)
For S3 that did not work - during suspend the hypervisor tripped over when
saving cr8. During resume it tripped over at restoring the cr3, cr8, idt,
and gdt values.
What do you guys think? One thought is to use the paravirt interface to
deal with cr3, cr8, idt, gdt for suspend/resume case.. But that is a lot
of extra 'if' in the paravirt code - which the callback registration would
effectively do the same thing as the paravirt - except at a higher level.
Thoughts?
Konrad Rzeszutek Wilk (5):
x86: Expand the x86_msi_ops to have a restore MSIs.
x86, acpi, tboot: Have a ACPI sleep override instead of calling tboot_sleep.
xen: Utilize the restore_msi_irqs hook.
xen/acpi/sleep: Enable ACPI sleep via the __acpi_override_sleep
xen/acpi/sleep: Register to the acpi_suspend_lowlevel a callback.
Liang Tang (1):
x86/acpi/sleep: Provide registration for acpi_suspend_lowlevel.
Yu Ke (1):
xen/acpi: Domain0 acpi parser related platform hypercall
arch/ia64/include/asm/xen/interface.h | 1 +
arch/x86/include/asm/acpi.h | 5 +-
arch/x86/include/asm/pci.h | 9 +
arch/x86/include/asm/x86_init.h | 1 +
arch/x86/include/asm/xen/hypercall.h | 8 +
arch/x86/include/asm/xen/interface.h | 1 +
arch/x86/kernel/acpi/boot.c | 5 +
arch/x86/kernel/acpi/sleep.c | 4 +-
arch/x86/kernel/acpi/sleep.h | 2 +
arch/x86/kernel/tboot.c | 13 +-
arch/x86/kernel/x86_init.c | 1 +
arch/x86/pci/xen.c | 12 ++
arch/x86/xen/enlighten.c | 3 +
drivers/acpi/acpica/hwsleep.c | 12 +-
drivers/acpi/sleep.c | 2 +
drivers/pci/msi.c | 29 +++-
drivers/xen/Makefile | 2 +-
drivers/xen/acpi.c | 25 +++
include/linux/tboot.h | 3 +-
include/xen/acpi.h | 38 ++++
include/xen/interface/physdev.h | 7 +
include/xen/interface/platform.h | 320 +++++++++++++++++++++++++++++++++
include/xen/interface/xen.h | 1 +
23 files changed, 491 insertions(+), 13 deletions(-)
^ permalink raw reply
* [PATCH 1/7] x86: Expand the x86_msi_ops to have a restore MSIs.
From: Konrad Rzeszutek Wilk @ 2011-08-31 18:31 UTC (permalink / raw)
To: x86, tglx, tboot-devel, shane.wang, linux-pm, linux-acpi,
len.brown
Cc: xen-devel, Konrad Rzeszutek Wilk
In-Reply-To: <1314815484-4668-1-git-send-email-konrad.wilk@oracle.com>
The MSI restore function will become a function pointer in an
x86_msi_ops struct. It defaults to the implementation in the
io_apic.c and msi.c. We piggyback on the indirection mechanism
introduced by "x86: Introduce x86_msi_ops".
Cc: x86@kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
arch/x86/include/asm/pci.h | 9 +++++++++
arch/x86/include/asm/x86_init.h | 1 +
arch/x86/kernel/x86_init.c | 1 +
drivers/pci/msi.c | 29 +++++++++++++++++++++++++++--
4 files changed, 38 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
index d498943..df75d07 100644
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -112,19 +112,28 @@ static inline void x86_teardown_msi_irq(unsigned int irq)
{
x86_msi.teardown_msi_irq(irq);
}
+static inline void x86_restore_msi_irqs(struct pci_dev *dev, int irq)
+{
+ x86_msi.restore_msi_irqs(dev, irq);
+}
#define arch_setup_msi_irqs x86_setup_msi_irqs
#define arch_teardown_msi_irqs x86_teardown_msi_irqs
#define arch_teardown_msi_irq x86_teardown_msi_irq
+#define arch_restore_msi_irqs x86_restore_msi_irqs
/* implemented in arch/x86/kernel/apic/io_apic. */
int native_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
void native_teardown_msi_irq(unsigned int irq);
+void native_restore_msi_irqs(struct pci_dev *dev, int irq);
/* default to the implementation in drivers/lib/msi.c */
#define HAVE_DEFAULT_MSI_TEARDOWN_IRQS
+#define HAVE_DEFAULT_MSI_RESTORE_IRQS
void default_teardown_msi_irqs(struct pci_dev *dev);
+void default_restore_msi_irqs(struct pci_dev *dev, int irq);
#else
#define native_setup_msi_irqs NULL
#define native_teardown_msi_irq NULL
#define default_teardown_msi_irqs NULL
+#define default_restore_msi_irqs NULL
#endif
#define PCI_DMA_BUS_IS_PHYS (dma_ops->is_phys)
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index d3d8590..7af18be 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -174,6 +174,7 @@ struct x86_msi_ops {
int (*setup_msi_irqs)(struct pci_dev *dev, int nvec, int type);
void (*teardown_msi_irq)(unsigned int irq);
void (*teardown_msi_irqs)(struct pci_dev *dev);
+ void (*restore_msi_irqs)(struct pci_dev *dev, int irq);
};
extern struct x86_init_ops x86_init;
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 6f164bd..bd1fe10 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -110,4 +110,5 @@ struct x86_msi_ops x86_msi = {
.setup_msi_irqs = native_setup_msi_irqs,
.teardown_msi_irq = native_teardown_msi_irq,
.teardown_msi_irqs = default_teardown_msi_irqs,
+ .restore_msi_irqs = default_restore_msi_irqs,
};
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 2f10328..f1fd801 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -85,6 +85,31 @@ void default_teardown_msi_irqs(struct pci_dev *dev)
}
#endif
+#ifndef arch_restore_msi_irqs
+# define arch_restore_msi_irqs default_restore_msi_irqs
+# define HAVE_DEFAULT_MSI_RESTORE_IRQS
+#endif
+
+#ifdef HAVE_DEFAULT_MSI_RESTORE_IRQS
+void default_restore_msi_irqs(struct pci_dev *dev, int irq)
+{
+ struct msi_desc *entry;
+
+ entry = NULL;
+ if (dev->msix_enabled) {
+ list_for_each_entry(entry, &dev->msi_list, list) {
+ if (irq == entry->irq)
+ break;
+ }
+ } else if (dev->msi_enabled) {
+ entry = irq_get_msi_desc(irq);
+ }
+
+ if (entry)
+ write_msi_msg(irq, &entry->msg);
+}
+#endif
+
static void msi_set_enable(struct pci_dev *dev, int pos, int enable)
{
u16 control;
@@ -359,7 +384,7 @@ static void __pci_restore_msi_state(struct pci_dev *dev)
pci_intx_for_msi(dev, 0);
msi_set_enable(dev, pos, 0);
- write_msi_msg(dev->irq, &entry->msg);
+ arch_restore_msi_irqs(dev, dev->irq);
pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &control);
msi_mask_irq(entry, msi_capable_mask(control), entry->masked);
@@ -387,7 +412,7 @@ static void __pci_restore_msix_state(struct pci_dev *dev)
pci_write_config_word(dev, pos + PCI_MSIX_FLAGS, control);
list_for_each_entry(entry, &dev->msi_list, list) {
- write_msi_msg(entry->irq, &entry->msg);
+ arch_restore_msi_irqs(dev, entry->irq);
msix_mask_irq(entry, entry->masked);
}
--
1.7.4.1
^ permalink raw reply related
* [PATCH 2/7] x86, acpi, tboot: Have a ACPI sleep override instead of calling tboot_sleep.
From: Konrad Rzeszutek Wilk @ 2011-08-31 18:31 UTC (permalink / raw)
To: x86, tglx, tboot-devel, shane.wang, linux-pm, linux-acpi,
len.brown
Cc: xen-devel, Konrad Rzeszutek Wilk
In-Reply-To: <1314815484-4668-1-git-send-email-konrad.wilk@oracle.com>
The ACPI suspend path makes a call to tboot_sleep right before
it writes the PM1A, PM1B values. We replace the direct call to
tboot via an registration callback similar to __acpi_register_gsi.
CC: Thomas Gleixner <tglx@linutronix.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
CC: Len Brown <len.brown@intel.com>
CC: Joseph Cihula <joseph.cihula@intel.com>
CC: Shane Wang <shane.wang@intel.com>
CC: xen-devel@lists.xensource.com
CC: linux-pm@lists.linux-foundation.org
CC: tboot-devel@lists.sourceforge.net
CC: linux-acpi@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
arch/x86/include/asm/acpi.h | 3 +++
arch/x86/kernel/acpi/boot.c | 3 +++
arch/x86/kernel/tboot.c | 13 +++++++++----
drivers/acpi/acpica/hwsleep.c | 12 ++++++++++--
include/linux/tboot.h | 3 ++-
5 files changed, 27 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index 610001d..49864a1 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -98,6 +98,9 @@ void acpi_pic_sci_set_trigger(unsigned int, u16);
extern int (*__acpi_register_gsi)(struct device *dev, u32 gsi,
int trigger, int polarity);
+extern int (*__acpi_override_sleep)(u8 sleep_state, u32 pm1a_ctrl,
+ u32 pm1b_ctrl, bool *skip_rest);
+
static inline void disable_acpi(void)
{
acpi_disabled = 1;
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 4558f0d..d191b4c 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -552,6 +552,9 @@ static int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
int (*__acpi_register_gsi)(struct device *dev, u32 gsi,
int trigger, int polarity) = acpi_register_gsi_pic;
+int (*__acpi_override_sleep)(u8 sleep_state, u32 pm1a_ctrl,
+ u32 pm1b_ctrl, bool *skip_rest) = NULL;
+
/*
* success: return IRQ number (>=0)
* failure: return < 0
diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index 30ac65d..a18070c 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -41,7 +41,7 @@
#include <asm/setup.h>
#include <asm/e820.h>
#include <asm/io.h>
-
+#include <linux/acpi.h>
#include "acpi/realmode/wakeup.h"
/* Global pointer to shared data; NULL means no measured launch. */
@@ -270,7 +270,8 @@ static void tboot_copy_fadt(const struct acpi_table_fadt *fadt)
offsetof(struct acpi_table_facs, firmware_waking_vector);
}
-void tboot_sleep(u8 sleep_state, u32 pm1a_control, u32 pm1b_control)
+int tboot_sleep(u8 sleep_state, u32 pm1a_control, u32 pm1b_control,
+ bool *skip_rest)
{
static u32 acpi_shutdown_map[ACPI_S_STATE_COUNT] = {
/* S0,1,2: */ -1, -1, -1,
@@ -279,7 +280,7 @@ void tboot_sleep(u8 sleep_state, u32 pm1a_control, u32 pm1b_control)
/* S5: */ TB_SHUTDOWN_S5 };
if (!tboot_enabled())
- return;
+ return AE_OK;
tboot_copy_fadt(&acpi_gbl_FADT);
tboot->acpi_sinfo.pm1a_cnt_val = pm1a_control;
@@ -290,10 +291,12 @@ void tboot_sleep(u8 sleep_state, u32 pm1a_control, u32 pm1b_control)
if (sleep_state >= ACPI_S_STATE_COUNT ||
acpi_shutdown_map[sleep_state] == -1) {
pr_warning("unsupported sleep state 0x%x\n", sleep_state);
- return;
+ return AE_ERROR;
}
tboot_shutdown(acpi_shutdown_map[sleep_state]);
+
+ return AE_OK;
}
static atomic_t ap_wfs_count;
@@ -343,6 +346,8 @@ static __init int tboot_late_init(void)
atomic_set(&ap_wfs_count, 0);
register_hotcpu_notifier(&tboot_cpu_notifier);
+
+ __acpi_override_sleep = tboot_sleep;
return 0;
}
diff --git a/drivers/acpi/acpica/hwsleep.c b/drivers/acpi/acpica/hwsleep.c
index 2ac28bb..31d1198 100644
--- a/drivers/acpi/acpica/hwsleep.c
+++ b/drivers/acpi/acpica/hwsleep.c
@@ -45,7 +45,6 @@
#include <acpi/acpi.h>
#include "accommon.h"
#include "actables.h"
-#include <linux/tboot.h>
#define _COMPONENT ACPI_HARDWARE
ACPI_MODULE_NAME("hwsleep")
@@ -343,8 +342,17 @@ acpi_status asmlinkage acpi_enter_sleep_state(u8 sleep_state)
ACPI_FLUSH_CPU_CACHE();
- tboot_sleep(sleep_state, pm1a_control, pm1b_control);
+ if (__acpi_override_sleep) {
+ bool skip_rest = false;
+ status = __acpi_override_sleep(sleep_state, pm1a_control,
+ pm1b_control, &skip_rest);
+
+ if (ACPI_FAILURE(status))
+ return_ACPI_STATUS(status);
+ if (skip_rest)
+ return_ACPI_STATUS(AE_OK);
+ }
/* Write #2: Write both SLP_TYP + SLP_EN */
status = acpi_hw_write_pm1_control(pm1a_control, pm1b_control);
diff --git a/include/linux/tboot.h b/include/linux/tboot.h
index 1dba6ee..19badbd 100644
--- a/include/linux/tboot.h
+++ b/include/linux/tboot.h
@@ -143,7 +143,8 @@ static inline int tboot_enabled(void)
extern void tboot_probe(void);
extern void tboot_shutdown(u32 shutdown_type);
-extern void tboot_sleep(u8 sleep_state, u32 pm1a_control, u32 pm1b_control);
+extern int tboot_sleep(u8 sleep_state, u32 pm1a_control, u32 pm1b_control,
+ bool *skip);
extern struct acpi_table_header *tboot_get_dmar_table(
struct acpi_table_header *dmar_tbl);
extern int tboot_force_iommu(void);
--
1.7.4.1
^ permalink raw reply related
* [PATCH 3/7] x86/acpi/sleep: Provide registration for acpi_suspend_lowlevel.
From: Konrad Rzeszutek Wilk @ 2011-08-31 18:31 UTC (permalink / raw)
To: x86, tglx, tboot-devel, shane.wang, linux-pm, linux-acpi,
len.brown
Cc: xen-devel, Konrad Rzeszutek Wilk
In-Reply-To: <1314815484-4668-1-git-send-email-konrad.wilk@oracle.com>
From: Liang Tang <liang.tang@oracle.com>
Which by default will be x86_acpi_suspend_lowlevel.
This registration allows us to register another callback
if there is a need to use another platform specific callback.
CC: Thomas Gleixner <tglx@linutronix.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
CC: Len Brown <len.brown@intel.com>
CC: Joseph Cihula <joseph.cihula@intel.com>
CC: Shane Wang <shane.wang@intel.com>
CC: linux-pm@lists.linux-foundation.org
CC: linux-acpi@vger.kernel.org
CC: Len Brown <len.brown@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Liang Tang <liang.tang@oracle.com>
---
arch/x86/include/asm/acpi.h | 2 +-
arch/x86/kernel/acpi/boot.c | 2 ++
arch/x86/kernel/acpi/sleep.c | 4 ++--
arch/x86/kernel/acpi/sleep.h | 2 ++
drivers/acpi/sleep.c | 2 ++
5 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index 49864a1..a5f0b73 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -118,7 +118,7 @@ static inline void acpi_disable_pci(void)
}
/* Low-level suspend routine. */
-extern int acpi_suspend_lowlevel(void);
+extern int (*acpi_suspend_lowlevel)(void);
extern const unsigned char acpi_wakeup_code[];
#define acpi_wakeup_address (__pa(TRAMPOLINE_SYM(acpi_wakeup_code)))
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index d191b4c..92f4f38 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -44,6 +44,7 @@
#include <asm/mpspec.h>
#include <asm/smp.h>
+#include "sleep.h" /* To include x86_acpi_suspend_lowlevel */
static int __initdata acpi_force = 0;
u32 acpi_rsdt_forced;
int acpi_disabled;
@@ -555,6 +556,7 @@ int (*__acpi_register_gsi)(struct device *dev, u32 gsi,
int (*__acpi_override_sleep)(u8 sleep_state, u32 pm1a_ctrl,
u32 pm1b_ctrl, bool *skip_rest) = NULL;
+int (*acpi_suspend_lowlevel)(void) = x86_acpi_suspend_lowlevel;
/*
* success: return IRQ number (>=0)
* failure: return < 0
diff --git a/arch/x86/kernel/acpi/sleep.c b/arch/x86/kernel/acpi/sleep.c
index 103b6ab..4d2d0b1 100644
--- a/arch/x86/kernel/acpi/sleep.c
+++ b/arch/x86/kernel/acpi/sleep.c
@@ -25,12 +25,12 @@ static char temp_stack[4096];
#endif
/**
- * acpi_suspend_lowlevel - save kernel state
+ * x86_acpi_suspend_lowlevel - save kernel state
*
* Create an identity mapped page table and copy the wakeup routine to
* low memory.
*/
-int acpi_suspend_lowlevel(void)
+int x86_acpi_suspend_lowlevel(void)
{
struct wakeup_header *header;
/* address in low memory of the wakeup routine. */
diff --git a/arch/x86/kernel/acpi/sleep.h b/arch/x86/kernel/acpi/sleep.h
index 416d4be..4d3feb5 100644
--- a/arch/x86/kernel/acpi/sleep.h
+++ b/arch/x86/kernel/acpi/sleep.h
@@ -13,3 +13,5 @@ extern unsigned long acpi_copy_wakeup_routine(unsigned long);
extern void wakeup_long64(void);
extern void do_suspend_lowlevel(void);
+
+extern int x86_acpi_suspend_lowlevel(void);
diff --git a/drivers/acpi/sleep.c b/drivers/acpi/sleep.c
index 6c94960..a6da454 100644
--- a/drivers/acpi/sleep.c
+++ b/drivers/acpi/sleep.c
@@ -254,6 +254,8 @@ static int acpi_suspend_enter(suspend_state_t pm_state)
break;
case ACPI_STATE_S3:
+ if (!acpi_suspend_lowlevel)
+ return -ENODEV;
error = acpi_suspend_lowlevel();
if (error)
return error;
--
1.7.4.1
^ permalink raw reply related
* [PATCH 4/7] xen: Utilize the restore_msi_irqs hook.
From: Konrad Rzeszutek Wilk @ 2011-08-31 18:31 UTC (permalink / raw)
To: x86, tglx, tboot-devel, shane.wang, linux-pm, linux-acpi,
len.brown
Cc: xen-devel, Konrad Rzeszutek Wilk
In-Reply-To: <1314815484-4668-1-git-send-email-konrad.wilk@oracle.com>
to make a hypercall to restore the vectors in the MSI/MSI-X
configuration space.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
arch/x86/pci/xen.c | 12 ++++++++++++
include/xen/interface/physdev.h | 7 +++++++
2 files changed, 19 insertions(+), 0 deletions(-)
diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index f567965..f140999 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -241,6 +241,17 @@ static int xen_initdom_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
out:
return ret;
}
+
+static void xen_initdom_restore_msi_irqs(struct pci_dev *dev, int irq)
+{
+ int ret = 0;
+ struct physdev_restore_msi restore;
+
+ restore.bus = dev->bus->number;
+ restore.devfn = dev->devfn;
+ ret = HYPERVISOR_physdev_op(PHYSDEVOP_restore_msi, &restore);
+ WARN(ret && ret != -ENOSYS, "restore_msi -> %d\n", ret);
+}
#endif
#endif
@@ -458,6 +469,7 @@ static int __init pci_xen_initial_domain(void)
#ifdef CONFIG_PCI_MSI
x86_msi.setup_msi_irqs = xen_initdom_setup_msi_irqs;
x86_msi.teardown_msi_irq = xen_teardown_msi_irq;
+ x86_msi.restore_msi_irqs = xen_initdom_restore_msi_irqs;
#endif
xen_setup_acpi_sci();
__acpi_register_gsi = acpi_register_gsi_xen;
diff --git a/include/xen/interface/physdev.h b/include/xen/interface/physdev.h
index 534cac8..44aefa9 100644
--- a/include/xen/interface/physdev.h
+++ b/include/xen/interface/physdev.h
@@ -144,6 +144,13 @@ struct physdev_manage_pci {
uint8_t devfn;
};
+#define PHYSDEVOP_restore_msi 19
+struct physdev_restore_msi {
+ /* IN */
+ uint8_t bus;
+ uint8_t devfn;
+};
+
#define PHYSDEVOP_manage_pci_add_ext 20
struct physdev_manage_pci_ext {
/* IN */
--
1.7.4.1
^ permalink raw reply related
* [PATCH 5/7] xen/acpi: Domain0 acpi parser related platform hypercall
From: Konrad Rzeszutek Wilk @ 2011-08-31 18:31 UTC (permalink / raw)
To: x86, tglx, tboot-devel, shane.wang, linux-pm, linux-acpi,
len.brown
Cc: xen-devel, Jeremy Fitzhardinge, Konrad Rzeszutek Wilk
In-Reply-To: <1314815484-4668-1-git-send-email-konrad.wilk@oracle.com>
From: Yu Ke <ke.yu@intel.com>
This patches implements the xen_platform_op hypercall, to pass the parsed
ACPI info to hypervisor.
Signed-off-by: Yu Ke <ke.yu@intel.com>
Signed-off-by: Tian Kevin <kevin.tian@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
[v1: Added DEFINE_GUEST.. in appropiate headers]
[v2: Ripped out typedefs]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
arch/ia64/include/asm/xen/interface.h | 1 +
arch/x86/include/asm/xen/interface.h | 1 +
include/xen/interface/platform.h | 320 +++++++++++++++++++++++++++++++++
include/xen/interface/xen.h | 1 +
4 files changed, 323 insertions(+), 0 deletions(-)
create mode 100644 include/xen/interface/platform.h
diff --git a/arch/ia64/include/asm/xen/interface.h b/arch/ia64/include/asm/xen/interface.h
index e951e74..1d2427d 100644
--- a/arch/ia64/include/asm/xen/interface.h
+++ b/arch/ia64/include/asm/xen/interface.h
@@ -76,6 +76,7 @@ DEFINE_GUEST_HANDLE(char);
DEFINE_GUEST_HANDLE(int);
DEFINE_GUEST_HANDLE(long);
DEFINE_GUEST_HANDLE(void);
+DEFINE_GUEST_HANDLE(uint64_t);
typedef unsigned long xen_pfn_t;
DEFINE_GUEST_HANDLE(xen_pfn_t);
diff --git a/arch/x86/include/asm/xen/interface.h b/arch/x86/include/asm/xen/interface.h
index 5d4922a..a1f2db5 100644
--- a/arch/x86/include/asm/xen/interface.h
+++ b/arch/x86/include/asm/xen/interface.h
@@ -55,6 +55,7 @@ DEFINE_GUEST_HANDLE(char);
DEFINE_GUEST_HANDLE(int);
DEFINE_GUEST_HANDLE(long);
DEFINE_GUEST_HANDLE(void);
+DEFINE_GUEST_HANDLE(uint64_t);
#endif
#ifndef HYPERVISOR_VIRT_START
diff --git a/include/xen/interface/platform.h b/include/xen/interface/platform.h
new file mode 100644
index 0000000..c168468
--- /dev/null
+++ b/include/xen/interface/platform.h
@@ -0,0 +1,320 @@
+/******************************************************************************
+ * platform.h
+ *
+ * Hardware platform operations. Intended for use by domain-0 kernel.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2002-2006, K Fraser
+ */
+
+#ifndef __XEN_PUBLIC_PLATFORM_H__
+#define __XEN_PUBLIC_PLATFORM_H__
+
+#include "xen.h"
+
+#define XENPF_INTERFACE_VERSION 0x03000001
+
+/*
+ * Set clock such that it would read <secs,nsecs> after 00:00:00 UTC,
+ * 1 January, 1970 if the current system time was <system_time>.
+ */
+#define XENPF_settime 17
+struct xenpf_settime {
+ /* IN variables. */
+ uint32_t secs;
+ uint32_t nsecs;
+ uint64_t system_time;
+};
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_settime_t);
+
+/*
+ * Request memory range (@mfn, @mfn+@nr_mfns-1) to have type @type.
+ * On x86, @type is an architecture-defined MTRR memory type.
+ * On success, returns the MTRR that was used (@reg) and a handle that can
+ * be passed to XENPF_DEL_MEMTYPE to accurately tear down the new setting.
+ * (x86-specific).
+ */
+#define XENPF_add_memtype 31
+struct xenpf_add_memtype {
+ /* IN variables. */
+ unsigned long mfn;
+ uint64_t nr_mfns;
+ uint32_t type;
+ /* OUT variables. */
+ uint32_t handle;
+ uint32_t reg;
+};
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_add_memtype_t);
+
+/*
+ * Tear down an existing memory-range type. If @handle is remembered then it
+ * should be passed in to accurately tear down the correct setting (in case
+ * of overlapping memory regions with differing types). If it is not known
+ * then @handle should be set to zero. In all cases @reg must be set.
+ * (x86-specific).
+ */
+#define XENPF_del_memtype 32
+struct xenpf_del_memtype {
+ /* IN variables. */
+ uint32_t handle;
+ uint32_t reg;
+};
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_del_memtype_t);
+
+/* Read current type of an MTRR (x86-specific). */
+#define XENPF_read_memtype 33
+struct xenpf_read_memtype {
+ /* IN variables. */
+ uint32_t reg;
+ /* OUT variables. */
+ unsigned long mfn;
+ uint64_t nr_mfns;
+ uint32_t type;
+};
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_read_memtype_t);
+
+#define XENPF_microcode_update 35
+struct xenpf_microcode_update {
+ /* IN variables. */
+ GUEST_HANDLE(void) data; /* Pointer to microcode data */
+ uint32_t length; /* Length of microcode data. */
+};
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_microcode_update_t);
+
+#define XENPF_platform_quirk 39
+#define QUIRK_NOIRQBALANCING 1 /* Do not restrict IO-APIC RTE targets */
+#define QUIRK_IOAPIC_BAD_REGSEL 2 /* IO-APIC REGSEL forgets its value */
+#define QUIRK_IOAPIC_GOOD_REGSEL 3 /* IO-APIC REGSEL behaves properly */
+struct xenpf_platform_quirk {
+ /* IN variables. */
+ uint32_t quirk_id;
+};
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_platform_quirk_t);
+
+#define XENPF_firmware_info 50
+#define XEN_FW_DISK_INFO 1 /* from int 13 AH=08/41/48 */
+#define XEN_FW_DISK_MBR_SIGNATURE 2 /* from MBR offset 0x1b8 */
+#define XEN_FW_VBEDDC_INFO 3 /* from int 10 AX=4f15 */
+struct xenpf_firmware_info {
+ /* IN variables. */
+ uint32_t type;
+ uint32_t index;
+ /* OUT variables. */
+ union {
+ struct {
+ /* Int13, Fn48: Check Extensions Present. */
+ uint8_t device; /* %dl: bios device number */
+ uint8_t version; /* %ah: major version */
+ uint16_t interface_support; /* %cx: support bitmap */
+ /* Int13, Fn08: Legacy Get Device Parameters. */
+ uint16_t legacy_max_cylinder; /* %cl[7:6]:%ch: max cyl # */
+ uint8_t legacy_max_head; /* %dh: max head # */
+ uint8_t legacy_sectors_per_track; /* %cl[5:0]: max sector # */
+ /* Int13, Fn41: Get Device Parameters (as filled into %ds:%esi). */
+ /* NB. First uint16_t of buffer must be set to buffer size. */
+ GUEST_HANDLE(void) edd_params;
+ } disk_info; /* XEN_FW_DISK_INFO */
+ struct {
+ uint8_t device; /* bios device number */
+ uint32_t mbr_signature; /* offset 0x1b8 in mbr */
+ } disk_mbr_signature; /* XEN_FW_DISK_MBR_SIGNATURE */
+ struct {
+ /* Int10, AX=4F15: Get EDID info. */
+ uint8_t capabilities;
+ uint8_t edid_transfer_time;
+ /* must refer to 128-byte buffer */
+ GUEST_HANDLE(uchar) edid;
+ } vbeddc_info; /* XEN_FW_VBEDDC_INFO */
+ } u;
+};
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_firmware_info_t);
+
+#define XENPF_enter_acpi_sleep 51
+struct xenpf_enter_acpi_sleep {
+ /* IN variables */
+ uint16_t pm1a_cnt_val; /* PM1a control value. */
+ uint16_t pm1b_cnt_val; /* PM1b control value. */
+ uint32_t sleep_state; /* Which state to enter (Sn). */
+ uint32_t flags; /* Must be zero. */
+};
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_enter_acpi_sleep_t);
+
+#define XENPF_change_freq 52
+struct xenpf_change_freq {
+ /* IN variables */
+ uint32_t flags; /* Must be zero. */
+ uint32_t cpu; /* Physical cpu. */
+ uint64_t freq; /* New frequency (Hz). */
+};
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_change_freq_t);
+
+/*
+ * Get idle times (nanoseconds since boot) for physical CPUs specified in the
+ * @cpumap_bitmap with range [0..@cpumap_nr_cpus-1]. The @idletime array is
+ * indexed by CPU number; only entries with the corresponding @cpumap_bitmap
+ * bit set are written to. On return, @cpumap_bitmap is modified so that any
+ * non-existent CPUs are cleared. Such CPUs have their @idletime array entry
+ * cleared.
+ */
+#define XENPF_getidletime 53
+struct xenpf_getidletime {
+ /* IN/OUT variables */
+ /* IN: CPUs to interrogate; OUT: subset of IN which are present */
+ GUEST_HANDLE(uchar) cpumap_bitmap;
+ /* IN variables */
+ /* Size of cpumap bitmap. */
+ uint32_t cpumap_nr_cpus;
+ /* Must be indexable for every cpu in cpumap_bitmap. */
+ GUEST_HANDLE(uint64_t) idletime;
+ /* OUT variables */
+ /* System time when the idletime snapshots were taken. */
+ uint64_t now;
+};
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_getidletime_t);
+
+#define XENPF_set_processor_pminfo 54
+
+/* ability bits */
+#define XEN_PROCESSOR_PM_CX 1
+#define XEN_PROCESSOR_PM_PX 2
+#define XEN_PROCESSOR_PM_TX 4
+
+/* cmd type */
+#define XEN_PM_CX 0
+#define XEN_PM_PX 1
+#define XEN_PM_TX 2
+
+/* Px sub info type */
+#define XEN_PX_PCT 1
+#define XEN_PX_PSS 2
+#define XEN_PX_PPC 4
+#define XEN_PX_PSD 8
+
+struct xen_power_register {
+ uint32_t space_id;
+ uint32_t bit_width;
+ uint32_t bit_offset;
+ uint32_t access_size;
+ uint64_t address;
+};
+
+struct xen_processor_csd {
+ uint32_t domain; /* domain number of one dependent group */
+ uint32_t coord_type; /* coordination type */
+ uint32_t num; /* number of processors in same domain */
+};
+DEFINE_GUEST_HANDLE_STRUCT(xen_processor_csd);
+
+struct xen_processor_cx {
+ struct xen_power_register reg; /* GAS for Cx trigger register */
+ uint8_t type; /* cstate value, c0: 0, c1: 1, ... */
+ uint32_t latency; /* worst latency (ms) to enter/exit this cstate */
+ uint32_t power; /* average power consumption(mW) */
+ uint32_t dpcnt; /* number of dependency entries */
+ GUEST_HANDLE(xen_processor_csd) dp; /* NULL if no dependency */
+};
+DEFINE_GUEST_HANDLE_STRUCT(xen_processor_cx);
+
+struct xen_processor_flags {
+ uint32_t bm_control:1;
+ uint32_t bm_check:1;
+ uint32_t has_cst:1;
+ uint32_t power_setup_done:1;
+ uint32_t bm_rld_set:1;
+};
+
+struct xen_processor_power {
+ uint32_t count; /* number of C state entries in array below */
+ struct xen_processor_flags flags; /* global flags of this processor */
+ GUEST_HANDLE(xen_processor_cx) states; /* supported c states */
+};
+
+struct xen_pct_register {
+ uint8_t descriptor;
+ uint16_t length;
+ uint8_t space_id;
+ uint8_t bit_width;
+ uint8_t bit_offset;
+ uint8_t reserved;
+ uint64_t address;
+};
+
+struct xen_processor_px {
+ uint64_t core_frequency; /* megahertz */
+ uint64_t power; /* milliWatts */
+ uint64_t transition_latency; /* microseconds */
+ uint64_t bus_master_latency; /* microseconds */
+ uint64_t control; /* control value */
+ uint64_t status; /* success indicator */
+};
+DEFINE_GUEST_HANDLE_STRUCT(xen_processor_px);
+
+struct xen_psd_package {
+ uint64_t num_entries;
+ uint64_t revision;
+ uint64_t domain;
+ uint64_t coord_type;
+ uint64_t num_processors;
+};
+
+struct xen_processor_performance {
+ uint32_t flags; /* flag for Px sub info type */
+ uint32_t platform_limit; /* Platform limitation on freq usage */
+ struct xen_pct_register control_register;
+ struct xen_pct_register status_register;
+ uint32_t state_count; /* total available performance states */
+ GUEST_HANDLE(xen_processor_px) states;
+ struct xen_psd_package domain_info;
+ uint32_t shared_type; /* coordination type of this processor */
+};
+DEFINE_GUEST_HANDLE_STRUCT(xen_processor_performance);
+
+struct xenpf_set_processor_pminfo {
+ /* IN variables */
+ uint32_t id; /* ACPI CPU ID */
+ uint32_t type; /* {XEN_PM_CX, XEN_PM_PX} */
+ union {
+ struct xen_processor_power power;/* Cx: _CST/_CSD */
+ struct xen_processor_performance perf; /* Px: _PPC/_PCT/_PSS/_PSD */
+ };
+};
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_set_processor_pminfo);
+
+struct xen_platform_op {
+ uint32_t cmd;
+ uint32_t interface_version; /* XENPF_INTERFACE_VERSION */
+ union {
+ struct xenpf_settime settime;
+ struct xenpf_add_memtype add_memtype;
+ struct xenpf_del_memtype del_memtype;
+ struct xenpf_read_memtype read_memtype;
+ struct xenpf_microcode_update microcode;
+ struct xenpf_platform_quirk platform_quirk;
+ struct xenpf_firmware_info firmware_info;
+ struct xenpf_enter_acpi_sleep enter_acpi_sleep;
+ struct xenpf_change_freq change_freq;
+ struct xenpf_getidletime getidletime;
+ struct xenpf_set_processor_pminfo set_pminfo;
+ uint8_t pad[128];
+ } u;
+};
+DEFINE_GUEST_HANDLE_STRUCT(xen_platform_op_t);
+
+#endif /* __XEN_PUBLIC_PLATFORM_H__ */
diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h
index 70213b4..d83cc08 100644
--- a/include/xen/interface/xen.h
+++ b/include/xen/interface/xen.h
@@ -453,6 +453,7 @@ struct start_info {
/* These flags are passed in the 'flags' field of start_info_t. */
#define SIF_PRIVILEGED (1<<0) /* Is the domain privileged? */
#define SIF_INITDOMAIN (1<<1) /* Is this the initial control domain? */
+#define SIF_PM_MASK (0xFF<<8) /* reserve 1 byte for xen-pm options */
typedef uint64_t cpumap_t;
--
1.7.4.1
^ permalink raw reply related
* [PATCH 6/7] xen/acpi/sleep: Enable ACPI sleep via the __acpi_override_sleep
From: Konrad Rzeszutek Wilk @ 2011-08-31 18:31 UTC (permalink / raw)
To: x86, tglx, tboot-devel, shane.wang, linux-pm, linux-acpi,
len.brown
Cc: xen-devel, Konrad Rzeszutek Wilk
In-Reply-To: <1314815484-4668-1-git-send-email-konrad.wilk@oracle.com>
Provide the registration callback to call in the Xen's
ACPI sleep functionality. This means that during S3/S5
we make a hypercall XENPF_enter_acpi_sleep with the
proper PM1A/PM1B registers.
Based of Ke Yu's <ke.yu@intel.com> initial idea.
[ From http://xenbits.xensource.com/linux-2.6.18-xen.hg
change c68699484a65 ]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
arch/x86/include/asm/xen/hypercall.h | 8 ++++++++
arch/x86/xen/enlighten.c | 3 +++
drivers/xen/Makefile | 2 +-
drivers/xen/acpi.c | 25 +++++++++++++++++++++++++
include/xen/acpi.h | 26 ++++++++++++++++++++++++++
5 files changed, 63 insertions(+), 1 deletions(-)
create mode 100644 drivers/xen/acpi.c
create mode 100644 include/xen/acpi.h
diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h
index d240ea9..0c9894e 100644
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -45,6 +45,7 @@
#include <xen/interface/xen.h>
#include <xen/interface/sched.h>
#include <xen/interface/physdev.h>
+#include <xen/interface/platform.h>
/*
* The hypercall asms have to meet several constraints:
@@ -299,6 +300,13 @@ HYPERVISOR_set_timer_op(u64 timeout)
}
static inline int
+HYPERVISOR_dom0_op(struct xen_platform_op *platform_op)
+{
+ platform_op->interface_version = XENPF_INTERFACE_VERSION;
+ return _hypercall1(int, dom0_op, platform_op);
+}
+
+static inline int
HYPERVISOR_set_debugreg(int reg, unsigned long value)
{
return _hypercall2(int, set_debugreg, reg, value);
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 5525163..6962653 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -42,6 +42,7 @@
#include <xen/page.h>
#include <xen/hvm.h>
#include <xen/hvc-console.h>
+#include <xen/acpi.h>
#include <asm/paravirt.h>
#include <asm/apic.h>
@@ -1250,6 +1251,8 @@ asmlinkage void __init xen_start_kernel(void)
} else {
/* Make sure ACS will be enabled */
pci_request_acs();
+
+ xen_acpi_sleep_register();
}
diff --git a/drivers/xen/Makefile b/drivers/xen/Makefile
index bbc1825..370552d 100644
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -16,7 +16,7 @@ obj-$(CONFIG_XENFS) += xenfs/
obj-$(CONFIG_XEN_SYS_HYPERVISOR) += sys-hypervisor.o
obj-$(CONFIG_XEN_PLATFORM_PCI) += xen-platform-pci.o
obj-$(CONFIG_SWIOTLB_XEN) += swiotlb-xen.o
-obj-$(CONFIG_XEN_DOM0) += pci.o
+obj-$(CONFIG_XEN_DOM0) += pci.o acpi.o
xen-evtchn-y := evtchn.o
xen-gntdev-y := gntdev.o
diff --git a/drivers/xen/acpi.c b/drivers/xen/acpi.c
new file mode 100644
index 0000000..c0f829f
--- /dev/null
+++ b/drivers/xen/acpi.c
@@ -0,0 +1,25 @@
+#include <xen/acpi.h>
+#include <xen/interface/platform.h>
+#include <asm/xen/hypercall.h>
+#include <asm/xen/hypervisor.h>
+
+int xen_acpi_notify_hypervisor_state(u8 sleep_state,
+ u32 pm1a_cnt, u32 pm1b_cnt,
+ bool *skip_rest)
+{
+ struct xen_platform_op op = {
+ .cmd = XENPF_enter_acpi_sleep,
+ .interface_version = XENPF_INTERFACE_VERSION,
+ .u = {
+ .enter_acpi_sleep = {
+ .pm1a_cnt_val = (u16)pm1a_cnt,
+ .pm1b_cnt_val = (u16)pm1b_cnt,
+ .sleep_state = sleep_state,
+ },
+ },
+ };
+ if (skip_rest)
+ *skip_rest = true;
+
+ return HYPERVISOR_dom0_op(&op);
+}
diff --git a/include/xen/acpi.h b/include/xen/acpi.h
new file mode 100644
index 0000000..e414f14
--- /dev/null
+++ b/include/xen/acpi.h
@@ -0,0 +1,26 @@
+#ifndef _XEN_ACPI_H
+#define _XEN_ACPI_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_XEN_DOM0
+#include <asm/xen/hypervisor.h>
+#include <xen/xen.h>
+#include <linux/acpi.h>
+
+int xen_acpi_notify_hypervisor_state(u8 sleep_state,
+ u32 pm1a_cnt, u32 pm1b_cnd,
+ bool *skip_rest);
+
+static inline void xen_acpi_sleep_register(void)
+{
+ if (xen_initial_domain())
+ __acpi_override_sleep = xen_acpi_notify_hypervisor_state;
+}
+#else
+static inline void xen_acpi_sleep_register(void)
+{
+}
+#endif
+
+#endif /* _XEN_ACPI_H */
--
1.7.4.1
^ permalink raw reply related
* [PATCH 7/7] xen/acpi/sleep: Register to the acpi_suspend_lowlevel a callback.
From: Konrad Rzeszutek Wilk @ 2011-08-31 18:31 UTC (permalink / raw)
To: x86, tglx, tboot-devel, shane.wang, linux-pm, linux-acpi,
len.brown
Cc: xen-devel, Konrad Rzeszutek Wilk
In-Reply-To: <1314815484-4668-1-git-send-email-konrad.wilk@oracle.com>
We piggyback on "x86/acpi: Provide registration for acpi_suspend_lowlevel."
to register a Xen version of the callback. The callback does not
do anything special - except it omits the x86_acpi_suspend_lowlevel.
It does that b/c during suspend it tries to save cr8 values (which
the hypervisor does not support), and then on resume path the
cr3, cr8, idt, and gdt are all resumed which clashes with what
the hypervisor has set up for the guest.
Signed-off-by: Liang Tang <liang.tang@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
include/xen/acpi.h | 14 +++++++++++++-
1 files changed, 13 insertions(+), 1 deletions(-)
diff --git a/include/xen/acpi.h b/include/xen/acpi.h
index e414f14..0409919 100644
--- a/include/xen/acpi.h
+++ b/include/xen/acpi.h
@@ -12,10 +12,22 @@ int xen_acpi_notify_hypervisor_state(u8 sleep_state,
u32 pm1a_cnt, u32 pm1b_cnd,
bool *skip_rest);
+static inline int xen_acpi_suspend_lowlevel(void)
+{
+ /*
+ * Xen will save and restore CPU context, so
+ * we can skip that and just go straight to
+ * the suspend.
+ */
+ acpi_enter_sleep_state(ACPI_STATE_S3);
+ return 0;
+}
static inline void xen_acpi_sleep_register(void)
{
- if (xen_initial_domain())
+ if (xen_initial_domain()) {
+ acpi_suspend_lowlevel = xen_acpi_suspend_lowlevel;
__acpi_override_sleep = xen_acpi_notify_hypervisor_state;
+ }
}
#else
static inline void xen_acpi_sleep_register(void)
--
1.7.4.1
^ permalink raw reply related
* Re: [BUG] Soft-lockup during cpu-hotplug in VFS callpaths
From: Maciej Rutecki @ 2011-08-31 18:40 UTC (permalink / raw)
To: Srivatsa S. Bhat; +Cc: linux-fsdevel, linux-pm, linux-kernel
In-Reply-To: <4E550057.9070609@linux.vnet.ibm.com>
On środa, 24 sierpnia 2011 o 15:44:55 Srivatsa S. Bhat wrote:
> Hi,
>
> While running stressful cpu hotplug tests along with kernel compilation
> running in background, soft-lockups are detected on multiple CPUs.
> Sometimes this also leads to hard lockups and kernel panic.
> All the soft-lockups seem to occur at vfsmount_lock_local_cpu() or other
> VFS callpaths.
>
>
> [37108.410813] BUG: soft lockup - CPU#5 stuck for 22s! [cc1:29669]
> <snip>
> [37108.694781] Call Trace:
> [37108.697306] [<ffffffff81199e70>] ?
> vfsmount_lock_local_lock_cpu+0x70/0x70 [37108.704258]
> [<ffffffff81187cb5>] path_init+0x315/0x400
> [37108.709558] [<ffffffff8127c398>] ? __raw_spin_lock_init+0x38/0x70
> [37108.715812] [<ffffffff8118961c>] path_openat+0x8c/0x3f0
> [37108.721203] [<ffffffff81012129>] ? sched_clock+0x9/0x10
> [37108.726597] [<ffffffff8109416d>] ? sched_clock_cpu+0xcd/0x110
> [37108.732508] [<ffffffff810a178d>] ? trace_hardirqs_off+0xd/0x10
> [37108.738498] [<ffffffff8109421f>] ? local_clock+0x6f/0x80
> [37108.743970] [<ffffffff81189a99>] do_filp_open+0x49/0xa0
> [37108.749362] [<ffffffff811982f3>] ? alloc_fd+0xc3/0x210
> [37108.754665] [<ffffffff8152584b>] ? _raw_spin_unlock+0x2b/0x40
> [37108.760575] [<ffffffff811982f3>] ? alloc_fd+0xc3/0x210
> [37108.765875] [<ffffffff81179607>] do_sys_open+0x107/0x1e0
> [37108.771352] [<ffffffff810d610f>] ? audit_syscall_entry+0x1bf/0x1f0
> [37108.777695] [<ffffffff81179720>] sys_open+0x20/0x30
> [37108.782741] [<ffffffff8152e202>] system_call_fastpath+0x16/0x1b
>
> Kernel version: 3.0.1, 3.0.3
> Hardware: Dual socket quad-core hyper-threaded Intel x86 machine
> Scenario:
> (a) Stressful cpu hotplug tests + kernel compilation
>
> (b) IRQ balancing had been disabled and all the IRQs were made to be
> routed to CPU 0 (except the ones that couldn't be routed).
>
> (c) Lockdep was enabled during kernel configuration.
>
> Steps (b) and (c) were done to dig deeper into the issue. However the same
> issue was observed by just doing step (a).
>
> Definitely there seems to be a race condition occurring here, because this
> issue is hit after sometime, after starting the tests. And the time it
> takes to hit the issue increases as we increase the number of debug print
> statements. In some cases (especially when the number of debug print
> statements were quite high), the stress on the machine had to be increased
> in order to hit the issue within measurable time. In my tests, a maximum
> of about 2 to 2.5 hours was sufficient, to hit this bug.
>
> Please find the console log attached with this mail.
>
> Any ideas on how to go about fixing this bug?
It is a regression?
Regards
--
Maciej Rutecki
http://www.maciek.unixy.pl
_______________________________________________
linux-pm mailing list
linux-pm@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/linux-pm
^ permalink raw reply
* Re: [PATCH v9 2/4] PM: Introduce devfreq: generic DVFS framework with device-specific OPPs
From: Turquette, Mike @ 2011-08-31 20:05 UTC (permalink / raw)
To: MyungJoo Ham
Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, linux-pm,
Thomas Gleixner
In-Reply-To: <1314775779-21399-3-git-send-email-myungjoo.ham@samsung.com>
On Wed, Aug 31, 2011 at 12:29 AM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
[snip]
> +/**
> + * get_devfreq() - find devfreq struct. a wrapped find_device_devfreq()
> + * with mutex protection. exported for governors
> + * @dev: device pointer used to lookup device devfreq.
> + */
> +struct devfreq *get_devfreq(struct device *dev)
> +{
> + struct devfreq *ret;
> +
> + mutex_lock(&devfreq_list_lock);
> + ret = find_device_devfreq(dev);
> + mutex_unlock(&devfreq_list_lock);
You prevent changes to the devfreq list while searching (good) but
after returning the pointer there is no protection from that item
being removed from the list. Generally "get" and "put" functions do
more than just return a pointer: get functions often increment a
refcount, or hold a lock. The put function will decrement the
refcount or release the lock. Maybe you want something like the
following:
mutex_lock(&devfreq_list_lock);
ret = find_device_devfreq(dev);
mutex_lock(&devfreq->lock);
mutex_unlock(&devfreq_list_lock);
Then you need a corresponding put which does the mutex_unlock(&devfreq->lock).
It looks like the only consumers of get_devfreq are the sysfs
show/store interfaces, which immediately hold devfreq->lock, so the
above proposal certainly makes fits the existing use case.
Also CPUfreq's "cpufreq_get" function does a nice job of protecting
the object from getting freed with a rw_semaphore. It is a bit more
"complicated" but makes for good reading.
> +
> + return ret;
> +}
> +
> +/**
> + * devfreq_do() - Check the usage profile of a given device and configure
> + * frequency and voltage accordingly
> + * @devfreq: devfreq info of the given device
> + */
> +static int devfreq_do(struct devfreq *devfreq)
> +{
> + struct opp *opp;
> + unsigned long freq;
> + int err;
> +
Put the mutex_is_locked check here? See more below.
> + err = devfreq->governor->get_target_freq(devfreq, &freq);
> + if (err)
> + return err;
> +
> + opp = opp_find_freq_ceil(devfreq->dev, &freq);
> + if (opp == ERR_PTR(-ENODEV))
> + opp = opp_find_freq_floor(devfreq->dev, &freq);
> +
> + if (IS_ERR(opp))
> + return PTR_ERR(opp);
> +
> + if (devfreq->previous_freq == freq)
> + return 0;
> +
> + err = devfreq->profile->target(devfreq->dev, opp);
> + if (err)
> + return err;
> +
> + devfreq->previous_freq = freq;
> + return 0;
> +}
> +
> +/**
> + * update_devfreq() - Notify that the device OPP or frequency requirement
> + * has been changed. This function is exported for governors.
> + * @devfreq: the devfreq instance.
> + *
> + * Note: lock devfreq->lock before calling update_devfreq
> + */
> +int update_devfreq(struct devfreq *devfreq)
> +{
> + int err = 0;
> +
> + if (!mutex_is_locked(&devfreq->lock)) {
> + WARN(true, "devfreq->lock must be locked by the caller.\n");
> + return -EINVAL;
> + }
> +
> + /* Reevaluate the proper frequency */
> + err = devfreq_do(devfreq);
> + return err;
> +}
> +
> +/**
> + * devfreq_update() - Notify that the device OPP has been changed.
> + * @dev: the device whose OPP has been changed.
> + *
> + * Called by OPP notifier.
> + */
> +static int devfreq_update(struct notifier_block *nb, unsigned long type,
> + void *devp)
> +{
> + struct devfreq *devfreq = container_of(nb, struct devfreq, nb);
> + int ret;
> +
> + mutex_lock(&devfreq->lock);
> + ret = update_devfreq(devfreq);
> + mutex_unlock(&devfreq->lock);
The whole devfreq_update/update_devfreq pairing is redundant.
update_devfreq's purpose is to make sure the lock is held before going
further, and the only caller of update_devfreq is devfreq_update which
always holds the lock.
This still doesn't stop a bad driver writer from just calling
devfreq_do with an extern. Perhaps the lock detection should be moved
into devfreq_do and update_devfreq should go away?
> +
> + return ret;
> +}
> +
> +/**
> + * devfreq_monitor() - Periodically run devfreq_do()
> + * @work: the work struct used to run devfreq_monitor periodically.
> + *
> + */
> +static void devfreq_monitor(struct work_struct *work)
> +{
> + static unsigned long last_polled_at;
> + struct devfreq *devfreq, *tmp;
> + int error;
> + unsigned long jiffies_passed;
> + unsigned long next_jiffies = ULONG_MAX, now = jiffies;
> +
> + /* Initially last_polled_at = 0, polling every device at bootup */
> + jiffies_passed = now - last_polled_at;
> + last_polled_at = now;
> + if (jiffies_passed == 0)
> + jiffies_passed = 1;
> +
> + mutex_lock(&devfreq_list_lock);
Should not lock the list here. If we lock the list for all major
operations, it nullifies the performance benefit of having a mutex in
struct devfreq.
> +
> + list_for_each_entry_safe(devfreq, tmp, &devfreq_list, node) {
> + mutex_lock(&devfreq->lock);
> +
> + if (devfreq->next_polling == 0) {
> + mutex_unlock(&devfreq->lock);
> + continue;
> + }
> +
> + /*
> + * Reduce more next_polling if devfreq_wq took an extra
> + * delay. (i.e., CPU has been idled.)
> + */
> + if (devfreq->next_polling <= jiffies_passed) {
> + error = devfreq_do(devfreq);
> +
> + /* Remove a devfreq with an error. */
> + if (error && error != -EAGAIN) {
> + dev_err(devfreq->dev, "Due to devfreq_do error(%d), devfreq(%s) is removed from the device\n",
> + error, devfreq->governor->name);
> +
> + list_del(&devfreq->node);
> + mutex_unlock(&devfreq->lock);
> + kfree(devfreq);
> + continue;
Should this error handling also unregister the OPP notifier? This
code duplicates portions of devfreq_remove_device. I propose instead
here we do the following:
mutex_unlock(&devfreq->lock);
_devfreq_remove_lock(devfreq);
/* this locks the list first,
* then locks devfreq->lock,
* then does the house cleaning
*/
continue;
> + }
> + devfreq->next_polling = devfreq->polling_jiffies;
> +
> + /* No more polling required (polling_ms changed) */
> + if (devfreq->next_polling == 0) {
> + mutex_unlock(&devfreq->lock);
> + continue;
> + }
> + } else {
> + devfreq->next_polling -= jiffies_passed;
> + }
> +
> + next_jiffies = (next_jiffies > devfreq->next_polling) ?
> + devfreq->next_polling : next_jiffies;
> +
> + mutex_unlock(&devfreq->lock);
> + }
> +
> + if (next_jiffies > 0 && next_jiffies < ULONG_MAX) {
> + polling = true;
> + queue_delayed_work(devfreq_wq, &devfreq_work, next_jiffies);
> + } else {
> + polling = false;
> + }
> +
> + mutex_unlock(&devfreq_list_lock);
Again, list should not be locked over this whole function. It blocks
other unrelated devfreq devices from scaling.
> +}
[snip]
> +/**
> + * devfreq_remove_device() - Remove devfreq feature from a device.
> + * @device: the device to remove devfreq feature.
> + */
> +int devfreq_remove_device(struct device *dev)
Why does this take a struct device*? Shouldn't it be a struct devfreq*?
If there is a case for removing a devfreq device with only struct
device* as input, how about:
int devfreq_remove_device(struct device *dev)
{
struct devfreq *devfreq;
mutex_lock(&devfreq_list_lock);
devfreq = find_device_devfreq(dev);
mutex_unlock(&devfreq_list_lock);
return _devfreq_remove_device(struct devfreq *df);
/* _devfreq_remove_device does the real work and can also be
called from devfreq_monitor */
}
Regards,
Mike
> +{
> + struct devfreq *devfreq;
> + struct srcu_notifier_head *nh;
> + int err = 0;
> +
> + if (!dev)
> + return -EINVAL;
> +
> + mutex_lock(&devfreq_list_lock);
> + devfreq = find_device_devfreq(dev);
> + if (IS_ERR(devfreq)) {
> + err = PTR_ERR(devfreq);
> + goto out;
> + }
> +
> + mutex_lock(&devfreq->lock);
> + nh = opp_get_notifier(dev);
> + if (IS_ERR(nh)) {
> + err = PTR_ERR(nh);
> + mutex_unlock(&devfreq->lock);
> + goto out;
> + }
> +
> + list_del(&devfreq->node);
> +
> + if (devfreq->governor->exit)
> + devfreq->governor->exit(devfreq);
> +
> + srcu_notifier_chain_unregister(nh, &devfreq->nb);
> + mutex_unlock(&devfreq->lock);
> + kfree(devfreq);
> +out:
> + mutex_unlock(&devfreq_list_lock);
> + return 0;
> +}
[snip]
^ permalink raw reply
* Re: [PATCH v9 3/4] PM / devfreq: add common sysfs interfaces
From: Turquette, Mike @ 2011-08-31 21:27 UTC (permalink / raw)
To: MyungJoo Ham
Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, linux-pm,
Thomas Gleixner
In-Reply-To: <1314775779-21399-4-git-send-email-myungjoo.ham@samsung.com>
On Wed, Aug 31, 2011 at 12:29 AM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
> Device specific sysfs interface /sys/devices/.../power/devfreq_*
> - governor R: name of governor
> - cur_freq R: current frequency
> - max_freq R: maximum operable frequency
> - min_freq R: minimum operable frequency
> - polling_interval R: polling interval in ms given with devfreq profile
> W: update polling interval.
>
> Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
>
> --
> Changes from v8
> - applied per-devfreq locking mechanism
>
> Changes from v7
> - removed set_freq from the common devfreq interface
> - added get_devfreq, a mutex-protected wrapper for find_device_devfreq
> (for sysfs interfaces and later with governor-support)
> - corrected ABI documentation.
>
> Changes from v6
> - poling_interval is writable.
>
> Changes from v5
> - updated devferq_update usage.
>
> Changes from v4
> - removed system-wide sysfs interface
> - removed tickling sysfs interface
> - added set_freq for userspace governor (and any other governors that
> require user input)
>
> Changes from v3
> - corrected sysfs API usage
> - corrected error messages
> - moved sysfs entry location
> - added sysfs entries
>
> Changes from v2
> - add ABI entries for devfreq sysfs interface
> ---
> Documentation/ABI/testing/sysfs-devices-power | 37 +++++
> drivers/devfreq/devfreq.c | 203 +++++++++++++++++++++++++
> 2 files changed, 240 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-devices-power b/Documentation/ABI/testing/sysfs-devices-power
> index 8ffbc25..57f4591 100644
> --- a/Documentation/ABI/testing/sysfs-devices-power
> +++ b/Documentation/ABI/testing/sysfs-devices-power
> @@ -165,3 +165,40 @@ Description:
>
> Not all drivers support this attribute. If it isn't supported,
> attempts to read or write it will yield I/O errors.
> +
> +What: /sys/devices/.../power/devfreq_governor
> +Date: July 2011
> +Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
> +Description:
> + The /sys/devices/.../power/devfreq_governor shows the name
> + of the governor used by the corresponding device.
> +
> +What: /sys/devices/.../power/devfreq_cur_freq
> +Date: July 2011
> +Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
> +Description:
> + The /sys/devices/.../power/devfreq_cur_freq shows the current
> + frequency of the corresponding device.
> +
> +What: /sys/devices/.../power/devfreq_max_freq
> +Date: July 2011
> +Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
> +Description:
> + The /sys/devices/.../power/devfreq_max_freq shows the
> + maximum operable frequency of the corresponding device.
> +
> +What: /sys/devices/.../power/devfreq_min_freq
> +Date: July 2011
> +Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
> +Description:
> + The /sys/devices/.../power/devfreq_min_freq shows the
> + minimum operable frequency of the corresponding device.
> +
> +What: /sys/devices/.../power/devfreq_polling_interval
> +Date: July 2011
> +Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
> +Description:
> + The /sys/devices/.../power/devfreq_polling_interval sets and
> + shows the requested polling interval of the corresponding
> + device. The values are represented in ms. If the value is less
> + than 1 jiffy, it is considered to be 0, which means no polling.
> diff --git a/drivers/devfreq/devfreq.c b/drivers/devfreq/devfreq.c
> index 621b863..1c46052 100644
> --- a/drivers/devfreq/devfreq.c
> +++ b/drivers/devfreq/devfreq.c
> @@ -37,6 +37,8 @@ static struct delayed_work devfreq_work;
> static LIST_HEAD(devfreq_list);
> static DEFINE_MUTEX(devfreq_list_lock);
>
> +static struct attribute_group dev_attr_group;
> +
> /**
> * find_device_devfreq() - find devfreq struct using device pointer
> * @dev: device pointer used to lookup device devfreq.
> @@ -191,6 +193,8 @@ static void devfreq_monitor(struct work_struct *work)
> dev_err(devfreq->dev, "Due to devfreq_do error(%d), devfreq(%s) is removed from the device\n",
> error, devfreq->governor->name);
>
> + sysfs_unmerge_group(&devfreq->dev->kobj,
> + &dev_attr_group);
> list_del(&devfreq->node);
> mutex_unlock(&devfreq->lock);
> kfree(devfreq);
> @@ -293,6 +297,8 @@ int devfreq_add_device(struct device *dev, struct devfreq_dev_profile *profile,
> queue_delayed_work(devfreq_wq, &devfreq_work,
> devfreq->next_polling);
> }
> +
> + sysfs_merge_group(&dev->kobj, &dev_attr_group);
> mutex_unlock(&devfreq->lock);
> goto out;
> err_init:
> @@ -333,6 +339,8 @@ int devfreq_remove_device(struct device *dev)
> goto out;
> }
>
> + sysfs_unmerge_group(&dev->kobj, &dev_attr_group);
> +
> list_del(&devfreq->node);
>
> if (devfreq->governor->exit)
> @@ -346,6 +354,201 @@ out:
> return 0;
> }
>
> +static ssize_t show_governor(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct devfreq *df;
> + ssize_t ret;
> +
> + mutex_lock(&devfreq_list_lock);
> + df = find_device_devfreq(dev);
> + if (IS_ERR(df)) {
> + ret = PTR_ERR(df);
> + goto out;
> + }
> +
> + mutex_lock(&df->lock);
> + if (!df->governor) {
> + ret = -EINVAL;
> + goto out_l;
> + }
> +
> + ret = sprintf(buf, "%s\n", df->governor->name);
> +out_l:
> + mutex_unlock(&df->lock);
> +out:
> + mutex_unlock(&devfreq_list_lock);
> + return ret;
> +}
> +
> +static ssize_t show_freq(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct devfreq *df;
> + ssize_t ret;
> +
> + mutex_lock(&devfreq_list_lock);
> + df = find_device_devfreq(dev);
> + if (IS_ERR(df)) {
> + ret = PTR_ERR(df);
> + goto out;
> + }
> +
> + ret = sprintf(buf, "%lu\n", df->previous_freq);
> +out:
> + mutex_unlock(&devfreq_list_lock);
> + return ret;
> +}
> +
> +static ssize_t show_max_freq(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct devfreq *df;
> + ssize_t ret;
> + unsigned long freq = ULONG_MAX;
> + struct opp *opp;
> +
> + mutex_lock(&devfreq_list_lock);
> + df = find_device_devfreq(dev);
> + if (IS_ERR(df)) {
> + ret = PTR_ERR(df);
> + goto out;
> + }
> +
> + mutex_lock(&df->lock);
> + opp = opp_find_freq_floor(df->dev, &freq);
> + if (IS_ERR(opp)) {
> + ret = PTR_ERR(opp);
> + goto out_l;
> + }
> +
> + ret = sprintf(buf, "%lu\n", freq);
> +out_l:
> + mutex_unlock(&df->lock);
> +out:
> + mutex_unlock(&devfreq_list_lock);
> + return ret;
> +}
> +
> +static ssize_t show_min_freq(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct devfreq *df;
> + ssize_t ret;
> + unsigned long freq = 0;
> + struct opp *opp;
> +
> + mutex_lock(&devfreq_list_lock);
> + df = find_device_devfreq(dev);
> + if (IS_ERR(df)) {
> + ret = PTR_ERR(df);
> + goto out;
> + }
> +
> + mutex_lock(&df->lock);
> + opp = opp_find_freq_ceil(df->dev, &freq);
> + if (IS_ERR(opp)) {
> + ret = PTR_ERR(opp);
> + goto out_l;
> + }
> +
> + ret = sprintf(buf, "%lu\n", freq);
> +out_l:
> + mutex_unlock(&df->lock);
> +out:
> + mutex_unlock(&devfreq_list_lock);
> + return ret;
> +}
> +
> +static ssize_t show_polling_interval(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct devfreq *df;
> + ssize_t ret;
> +
> + mutex_lock(&devfreq_list_lock);
> + df = find_device_devfreq(dev);
> + if (IS_ERR(df)) {
> + ret = PTR_ERR(df);
> + goto out;
> + }
> +
> + mutex_lock(&df->lock);
> + if (!df->profile) {
> + ret = -EINVAL;
> + goto out_l;
> + }
> +
> + ret = sprintf(buf, "%d\n", df->profile->polling_ms);
> +out_l:
> + mutex_unlock(&df->lock);
> +out:
> + mutex_unlock(&devfreq_list_lock);
> + return ret;
> +}
> +
> +static ssize_t store_polling_interval(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + struct devfreq *df;
> + unsigned int value;
> + int ret;
> +
> + mutex_lock(&devfreq_list_lock);
> + df = find_device_devfreq(dev);
> + if (IS_ERR(df)) {
> + count = PTR_ERR(df);
> + goto out;
> + }
> + mutex_lock(&df->lock);
> + if (!df->profile) {
> + count = -EINVAL;
> + goto out_l;
> + }
> +
> + ret = sscanf(buf, "%u", &value);
> + if (ret != 1) {
> + count = -EINVAL;
> + goto out_l;
> + }
> +
> + df->profile->polling_ms = value;
> + df->next_polling = df->polling_jiffies
> + = msecs_to_jiffies(value);
> +
> + if (df->next_polling > 0 && !polling) {
> + polling = true;
> + queue_delayed_work(devfreq_wq, &devfreq_work,
> + df->next_polling);
> + }
> +out_l:
> + mutex_unlock(&df->lock);
> +out:
> + mutex_unlock(&devfreq_list_lock);
> +
> + return count;
> +}
> +
> +static DEVICE_ATTR(devfreq_governor, 0444, show_governor, NULL);
> +static DEVICE_ATTR(devfreq_cur_freq, 0444, show_freq, NULL);
> +static DEVICE_ATTR(devfreq_max_freq, 0444, show_max_freq, NULL);
> +static DEVICE_ATTR(devfreq_min_freq, 0444, show_min_freq, NULL);
> +static DEVICE_ATTR(devfreq_polling_interval, 0644, show_polling_interval,
Instead of DEVICE_ATTR, why don't you create your own ktype specific
to devfreq? That would also mean that you don't have to do the struct
device * conversions to get struct devfreq * everytime (which requires
locking and walking the list).
Regards,
Mike
> + store_polling_interval);
> +static struct attribute *dev_entries[] = {
> + &dev_attr_devfreq_governor.attr,
> + &dev_attr_devfreq_cur_freq.attr,
> + &dev_attr_devfreq_max_freq.attr,
> + &dev_attr_devfreq_min_freq.attr,
> + &dev_attr_devfreq_polling_interval.attr,
> + NULL,
> +};
> +static struct attribute_group dev_attr_group = {
> + .name = power_group_name,
> + .attrs = dev_entries,
> +};
> +
> /**
> * devfreq_init() - Initialize data structure for devfreq framework and
> * start polling registered devfreq devices.
> --
> 1.7.4.1
>
>
^ permalink raw reply
* Re: [PATCH v9 2/4] PM: Introduce devfreq: generic DVFS framework with device-specific OPPs
From: MyungJoo Ham @ 2011-09-01 4:51 UTC (permalink / raw)
To: Turquette, Mike
Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, linux-pm,
Thomas Gleixner
In-Reply-To: <CAJOA=zM+MOAd9cCo04EzsegvkbOHRykZpiyappPptpC95b18=w@mail.gmail.com>
On Thu, Sep 1, 2011 at 5:05 AM, Turquette, Mike <mturquette@ti.com> wrote:
> On Wed, Aug 31, 2011 at 12:29 AM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
> [snip]
>> +/**
>> + * get_devfreq() - find devfreq struct. a wrapped find_device_devfreq()
>> + * with mutex protection. exported for governors
>> + * @dev: device pointer used to lookup device devfreq.
>> + */
>> +struct devfreq *get_devfreq(struct device *dev)
>> +{
>> + struct devfreq *ret;
>> +
>> + mutex_lock(&devfreq_list_lock);
>> + ret = find_device_devfreq(dev);
>> + mutex_unlock(&devfreq_list_lock);
>
> You prevent changes to the devfreq list while searching (good) but
> after returning the pointer there is no protection from that item
> being removed from the list. Generally "get" and "put" functions do
> more than just return a pointer: get functions often increment a
> refcount, or hold a lock. The put function will decrement the
> refcount or release the lock. Maybe you want something like the
> following:
>
> mutex_lock(&devfreq_list_lock);
> ret = find_device_devfreq(dev);
> mutex_lock(&devfreq->lock);
> mutex_unlock(&devfreq_list_lock);
>
> Then you need a corresponding put which does the mutex_unlock(&devfreq->lock).
>
> It looks like the only consumers of get_devfreq are the sysfs
> show/store interfaces, which immediately hold devfreq->lock, so the
> above proposal certainly makes fits the existing use case.
>
> Also CPUfreq's "cpufreq_get" function does a nice job of protecting
> the object from getting freed with a rw_semaphore. It is a bit more
> "complicated" but makes for good reading.
>
Thank you. The possibility that someone may do something on struct
devfreq after mutex_unlock(&devfreq_list_lock) and before
mutex_lock(&devfreq->lock) bothered me and it appears that I need to
add devfreq_put() anyway. I hesitated it because users might forget
using devfreq_put() after calling devfreq_get(); however, it is just
same as forgetting mutex_unlock after mutex_lock. So I wouldn't mind
that much.
The next devfreq_get() will do
> mutex_lock(&devfreq_list_lock);
> ret = find_device_devfreq(dev);
> mutex_lock(&devfreq->lock);
> mutex_unlock(&devfreq_list_lock);
and devfreq_put() will do
> mutex_unlock(&devfreq->lock);
as you've suggested.
>> +static int devfreq_do(struct devfreq *devfreq)
>> +{
>> + struct opp *opp;
>> + unsigned long freq;
>> + int err;
>> +
>
> Put the mutex_is_locked check here? See more below.
>
[]
>> +static int devfreq_update(struct notifier_block *nb, unsigned long type,
>> + void *devp)
>> +{
>> + struct devfreq *devfreq = container_of(nb, struct devfreq, nb);
>> + int ret;
>> +
>> + mutex_lock(&devfreq->lock);
>> + ret = update_devfreq(devfreq);
>> + mutex_unlock(&devfreq->lock);
>
> The whole devfreq_update/update_devfreq pairing is redundant.
> update_devfreq's purpose is to make sure the lock is held before going
> further, and the only caller of update_devfreq is devfreq_update which
> always holds the lock.
>
> This still doesn't stop a bad driver writer from just calling
> devfreq_do with an extern. Perhaps the lock detection should be moved
> into devfreq_do and update_devfreq should go away?
>
- update_devfreq: extern for governors
- devfreq_update: notifier callback for OPP
- devfreq_do: internal function of devfreq.
Anyway, as you've mentioned, it seems I'd better rename devfreq_do as
update_devfreq and make it exported with mutex check.
[]
>> +static void devfreq_monitor(struct work_struct *work)
>> +{
>> + static unsigned long last_polled_at;
>> + struct devfreq *devfreq, *tmp;
>> + int error;
>> + unsigned long jiffies_passed;
>> + unsigned long next_jiffies = ULONG_MAX, now = jiffies;
>> +
>> + /* Initially last_polled_at = 0, polling every device at bootup */
>> + jiffies_passed = now - last_polled_at;
>> + last_polled_at = now;
>> + if (jiffies_passed == 0)
>> + jiffies_passed = 1;
>> +
>> + mutex_lock(&devfreq_list_lock);
>
> Should not lock the list here. If we lock the list for all major
> operations, it nullifies the performance benefit of having a mutex in
> struct devfreq.
>
Ok... then.. how about locking like this? :
mutex_lock(&devfreq_list_lock);
list_for_each_entry_safe(devfreq, tmp, &devfreq_list, node) {
mutex_lock(&devfreq->lock);
mutex_unlock(&devfreq_list_lock);
blahblah
mutex_unlock(&devfreq->lock);
mutex_lock(&devfreq_list_lock);
}
mutex_unlock(&devfreq_list_lock);
Anyway, there is one more problem with allowing
unlocked-devfreq_list_lock in the loop.
list_for_each_entry_safe(devfreq, tmp, &devfreq_list, node) is safe
for the removal of devfreq from the list in the loop.
However, it is not safe against the removal of devfreq's next member
in the loop and while devfreq_list_lock is unlocked,
devfreq_remove_device may remove that one; thus, breaking the
list_for_each_entry_safe loop.
Such break is prevent by adding one more mutex_lock/mutex_unlock to
the loop for tmp->lock. However, if we do mutex_unlock(&tmp->lock)
before mutex_lock(&devfreq_list_lock), we still have the same
breaking-the-loop issue and if we do it after
mutex_lock(&devfreq_list_lock), we have a deadlock issue (someone
might have locked devfreq_list_lock and waiting to lock tmp->lock).
Thus, we will need to block devfreq_remove_device at devfreq_monitor
whlie unlocking devfreq_list in the loop. Other operations (add /
list) on the list are fine for it.
So, the loop will be:
mutex_lock(&devfreq_list_lock);
prohibit_devfreq_remove = true;
list_for_each_entry_safe(devfreq, tmp, &devfreq_list, node) {
mutex_lock(&devfreq->lock);
mutex_unlock(&devfreq_list_lock);
blahblah
mutex_unlock(&devfreq->lock);
mutex_lock(&devfreq_list_lock);
}
prohibit_devfreq_remove = false;
mutex_unlock(&devfreq_list_lock);
[]
>> + /* Remove a devfreq with an error. */
>> + if (error && error != -EAGAIN) {
>> + dev_err(devfreq->dev, "Due to devfreq_do error(%d), devfreq(%s) is removed from the device\n",
>> + error, devfreq->governor->name);
>> +
>> + list_del(&devfreq->node);
>> + mutex_unlock(&devfreq->lock);
>> + kfree(devfreq);
>> + continue;
>
> Should this error handling also unregister the OPP notifier? This
> code duplicates portions of devfreq_remove_device. I propose instead
> here we do the following:
>
> mutex_unlock(&devfreq->lock);
> _devfreq_remove_lock(devfreq);
> /* this locks the list first,
> * then locks devfreq->lock,
> * then does the house cleaning
> */
> continue;
Ah.. that was missing. Thanks!
I'll make a _devfreq_remove_device(struct device *dev, bool
call_by_monitor) and let devfreq_remove_device use it.
>
>> + }
>> + devfreq->next_polling = devfreq->polling_jiffies;
>> +
>> + /* No more polling required (polling_ms changed) */
>> + if (devfreq->next_polling == 0) {
>> + mutex_unlock(&devfreq->lock);
>> + continue;
>> + }
>> + } else {
>> + devfreq->next_polling -= jiffies_passed;
>> + }
>> +
>> + next_jiffies = (next_jiffies > devfreq->next_polling) ?
>> + devfreq->next_polling : next_jiffies;
>> +
>> + mutex_unlock(&devfreq->lock);
>> + }
>> +
>> + if (next_jiffies > 0 && next_jiffies < ULONG_MAX) {
>> + polling = true;
>> + queue_delayed_work(devfreq_wq, &devfreq_work, next_jiffies);
>> + } else {
>> + polling = false;
>> + }
>> +
>> + mutex_unlock(&devfreq_list_lock);
>
> Again, list should not be locked over this whole function. It blocks
> other unrelated devfreq devices from scaling.
>
>> +}
>
> [snip]
>
>> +/**
>> + * devfreq_remove_device() - Remove devfreq feature from a device.
>> + * @device: the device to remove devfreq feature.
>> + */
>> +int devfreq_remove_device(struct device *dev)
>
> Why does this take a struct device*? Shouldn't it be a struct devfreq*?
It is because devfreq_add_device() does not return struct devfreq.
struct devfreq is not visible to device drivers. It is visible to
governors.
>
> If there is a case for removing a devfreq device with only struct
> device* as input, how about:
>
> int devfreq_remove_device(struct device *dev)
> {
> struct devfreq *devfreq;
> mutex_lock(&devfreq_list_lock);
> devfreq = find_device_devfreq(dev);
> mutex_unlock(&devfreq_list_lock);
>
> return _devfreq_remove_device(struct devfreq *df);
> /* _devfreq_remove_device does the real work and can also be
> called from devfreq_monitor */
> }
Sure, I'll do so and reduce the redundancy.
>
> Regards,
> Mike
>
[]
Thank you so much!
Cheers,
MyungJoo
--
MyungJoo Ham (함명주), Ph.D.
Mobile Software Platform Lab,
Digital Media and Communications (DMC) Business
Samsung Electronics
cell: 82-10-6714-2858
_______________________________________________
linux-pm mailing list
linux-pm@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/linux-pm
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox