Linux Power Management development
 help / color / mirror / Atom feed
* Re: [PATCH v2] PM: add statistics debugfs file for suspend to ram
From: Rafael J. Wysocki @ 2011-08-08 20:46 UTC (permalink / raw)
  To: Liu, ShuoX
  Cc: Brown, Len, linux-pm@lists.linux-foundation.org, Greg KH,
	linux-kernel@vger.kernel.org
In-Reply-To: <6E3BC7F7C9A4BF4286DD4C043110F30B5B8057C04B@shsmsx502.ccr.corp.intel.com>

Hi,

On Monday, August 08, 2011, Liu, ShuoX wrote:
...
> @@ -982,6 +1003,8 @@ int dpm_suspend(pm_message_t state)
>  		error = async_error;
>  	if (!error)
>  		dpm_show_time(starttime, state, NULL);
> +	else
> +		suspend_stats.failed_suspend++;
>  	return error;
>  }

I'd prefer it to be

  if (error)
      something;
  else
      something else;

> @@ -1090,6 +1113,8 @@ int dpm_suspend_start(pm_message_t state)
>  	error = dpm_prepare(state);
>  	if (!error)
>  		error = dpm_suspend(state);
> +	else
> +		suspend_stats.failed_prepare++;
>  	return error;
>  }

Same here.

>  EXPORT_SYMBOL_GPL(dpm_suspend_start);
> diff --git a/include/linux/suspend.h b/include/linux/suspend.h
> index 6bbcef2..6a8ff23 100644
> --- a/include/linux/suspend.h
> +++ b/include/linux/suspend.h
> @@ -34,6 +34,22 @@ typedef int __bitwise suspend_state_t;
>  #define PM_SUSPEND_MEM		((__force suspend_state_t) 3)
>  #define PM_SUSPEND_MAX		((__force suspend_state_t) 4)
>  
> +struct suspend_stats {
> +	int	success;
> +	int	fail;
> +	int	failed_freeze;
> +	int	failed_prepare;
> +	int	failed_suspend;
> +	int	failed_suspend_noirq;
> +	int	failed_resume;
> +	int	failed_resume_noirq;
> +#define	REC_FAILED_DEV_NUM	2
> +	char	failed_devs[REC_FAILED_DEV_NUM][40];
> +	int	last_failed;
> +};
> +
> +extern struct suspend_stats suspend_stats;
> +
>  /**
>   * struct platform_suspend_ops - Callbacks for managing platform dependent
>   *	system sleep states.
> diff --git a/kernel/power/main.c b/kernel/power/main.c
> index 6c601f8..0977f5d 100644
> --- a/kernel/power/main.c
> +++ b/kernel/power/main.c
> @@ -12,6 +12,8 @@
>  #include <linux/string.h>
>  #include <linux/resume-trace.h>
>  #include <linux/workqueue.h>
> +#include <linux/debugfs.h>
> +#include <linux/seq_file.h>
>  
>  #include "power.h"
>  
> @@ -133,6 +135,59 @@ power_attr(pm_test);
>  
>  #endif /* CONFIG_PM_SLEEP */
>  
> +#ifdef CONFIG_DEBUG_FS
> +static int suspend_stats_show(struct seq_file *s, void *unused)
> +{
> +	int i, index, last_index;
> +
> +	last_index = suspend_stats.last_failed + REC_FAILED_DEV_NUM - 1;
> +	last_index %= REC_FAILED_DEV_NUM;
> +	seq_printf(s, "%s: %d\n%s: %d\n%s: %d\n%s: %d\n"
> +			"%s: %d\n%s: %d\n%s: %d\n%s: %d\n",
> +			"success", suspend_stats.success,
> +			"fail", suspend_stats.fail,
> +			"failed_freeze", suspend_stats.failed_freeze,
> +			"failed_prepare", suspend_stats.failed_prepare,
> +			"failed_suspend", suspend_stats.failed_suspend,
> +			"failed_suspend_noirq",
> +				suspend_stats.failed_suspend_noirq,
> +			"failed_resume", suspend_stats.failed_resume,
> +			"failed_resume_noirq",
> +				suspend_stats.failed_resume_noirq);
> +	seq_printf(s,	"failed_devs:\n  last_failed:\t%s\n",
> +			suspend_stats.failed_devs[last_index]);
> +	for (i = 1; i < REC_FAILED_DEV_NUM; i++) {
> +		index = last_index + REC_FAILED_DEV_NUM - i;
> +		index %= REC_FAILED_DEV_NUM;
> +		seq_printf(s, "\t\t%s\n",
> +			suspend_stats.failed_devs[index]);
> +	}
> +
> +	return 0;
> +}
> +
> +static int suspend_stats_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, suspend_stats_show, NULL);
> +}
> +
> +static const struct file_operations suspend_stats_operations = {
> +	.open           = suspend_stats_open,
> +	.read           = seq_read,
> +	.llseek         = seq_lseek,
> +	.release        = single_release,
> +};
> +
> +static int __init pm_debugfs_init(void)
> +{
> +	debugfs_create_file("suspend_stats", S_IFREG | S_IRUGO,
> +			NULL, NULL, &suspend_stats_operations);
> +	return 0;
> +}
> +
> +late_initcall(pm_debugfs_init);
> +#endif /* CONFIG_DEBUG_FS */
> +
>  struct kobject *power_kobj;
>  
>  /**
> @@ -194,6 +249,10 @@ static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr,
>  	}
>  	if (state < PM_SUSPEND_MAX && *s)
>  		error = enter_state(state);
> +		if (error)
> +			suspend_stats.fail++;
> +		else
> +			suspend_stats.success++;
>  #endif
>  
>   Exit:
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index b6b71ad..9bb4281 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -106,6 +106,8 @@ static int suspend_prepare(void)
>  	error = suspend_freeze_processes();
>  	if (!error)
>  		return 0;
> +	else
> +		suspend_stats.failed_freeze++;
>  

And same here.

Apart from the above, I don't have complaints.

Thanks,
Rafael

^ permalink raw reply

* Re: [PATCH] PM: add statistics sysfs file for suspend to ram
From: Rafael J. Wysocki @ 2011-08-08 20:37 UTC (permalink / raw)
  To: yanmin_zhang, Greg KH
  Cc: Liu, ShuoX, linux-pm@lists.linux-foundation.org, Brown, Len
In-Reply-To: <1312775304.2043.37.camel@ymzhang>

On Monday, August 08, 2011, Yanmin Zhang wrote:
> On Fri, 2011-08-05 at 21:20 +0200, Rafael J. Wysocki wrote:
> > On Friday, August 05, 2011, Yanmin Zhang wrote:
> > > On Fri, 2011-08-05 at 09:57 +0800, Liu, ShuoX wrote:
> > > > > On Thursday, August 04, 2011, Greg KH wrote:
> > > > > > On Thu, Aug 04, 2011 at 01:13:38PM +0800, Liu, ShuoX wrote:
> > > > > > > >From a906b0b5b4ff3414ceb9fc7a69a3d7c9d66e46b1 Mon Sep 17
> > > > > 00:00:00 2001
> > > > > > > From: ShuoX Liu <shuox.liu@intel.com>
> > > > > > > Date: Thu, 28 Jul 2011 10:54:22 +0800
> > > > > > > Subject: [PATCH] PM: add statistics sysfs file for suspend to ram.
> > > > > >
> > > > > > What's this stuff here for?  That's not needed (hint, I would have to
> > > > > > edit it out by hand to be able to apply this patch.)
> > > > > >
> > > > > > Thanks for resending a version that passes checkpatch.pl and could be
> > > > > > applied, but all of my previous comments still stand.  This patch, as
> > > > > > is, is totally unacceptable.
> > > > > 
> > > > > Agreed, plus I'd like to know the motivation behind it.  That is, we have
> > > > > quite a few debug facilities in that code, so why are they insufficient?
> > > Thanks Greg, Rafael. We are changing the patch based on your comments.
> > > 
> > > > 
> > > > Some explanation from Yanmin,
> > > > "We are enabling power features on Medfield. Some testers and developers
> > > > complain they don't know if system tries suspend-2-ram, and what device 
> > > > fails to suspend. They need such info for a quick check. If turning on 
> > > > CONFIG_PM_DEBUG, we get too much info and testers need recompile 
> > > > the system.
> > > Comparing with PC/notebook, a mobile enters/exits suspend-2-ram (we call it s3 on
> > > Medfield) far more frequently. If it can't enter suspend-2-ram in time, the power
> > > might be used up soon.
> > > 
> > > We often find sometimes, a device suspend fails. Then, system retries s3 over
> > > and over again. As display is off, testers/developers don't know what happens.
> > > Teh system 
> > > 
> > > With the patch, we could know what the bad device is.
> > > 
> > > The patch doesn't hurt performance as it's just statistics collector.
> > > 
> > > CONFIG_PM_DEBUG is very useful for finer investigation about what happens behind. What
> > > we provide by the patch is to analyze the issues quickly, even by an ordinary tester.
> > 
> > Well, what about using dynamic debug?
> Thanks for the nice pointer. I checked dynamic debug. It's really a good debug tool.
> With the dynamic debug:
> 1) user need write a user space parser to process the syslog output;
> 2) Our testing scenario is we leave the mobile for at least hours. Then, check its status.
> No serial console available during the testing. One is because console would be suspended,
> and the other is serial console connecting with spi or HSU devices would consume power. These
> devices are powered off at suspend-2-ram.
> 
> Below is an example output of the statistics from my mobile (we are changing the output
> from sysfs to debugfs now):
> #adb shell cat /sys/power/suspend_stats
> success: 5
> fail: 1
> failed_freeze: 0
> failed_prepare: 0
> failed_suspend: 1
> failed_suspend_noirq: 0
> failed_resume: 0
> failed_resume_noirq: 0
> failed_devs:
>   last_failed:	alarm

OK, I see.  Greg, what do you think?

Rafael

^ permalink raw reply

* Re: [PATCH v4 0/3] DEVFREQ, DVFS framework for non-CPU devices
From: Turquette, Mike @ 2011-08-08 19:13 UTC (permalink / raw)
  To: MyungJoo Ham
  Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, Thomas Gleixner,
	linux-pm
In-Reply-To: <CAJ0PZbQx-T+36eX7MiZqWSMxD-achZ9Z7G_kDWNQ2n0h1NTWSQ@mail.gmail.com>

On Thu, Aug 4, 2011 at 11:18 PM, MyungJoo Ham <myungjoo.ham@gmail.com> wrote:
> On Fri, Aug 5, 2011 at 6:59 AM, Turquette, Mike <mturquette@ti.com> wrote:
>> On Thu, Aug 4, 2011 at 1:15 AM, MyungJoo Ham <myungjoo.ham@gmail.com> wrote:
>>>
>>> Ok.. I see.
>>>
>>> Now, I can agree with you that tickle is subset of QoS request.
>>>
>>> As long as we have QoS request feature on devices with either OPP or
>>> DEVFREQ, tickling is not needed.
>>>
>>> I'll remove tickle in the next revision (along with some bugfixes for
>>> bugs found recently).
>>>
>>>
>>> Anyway, it appears that clock-rate-wise QoS request may be dealt at
>>> OPP so that the OPPs meeting the QoS requests w/ frequency or voltage
>>> specifications are enabled and returned with opp_find_* functions.
>>> Maybe we will need to separate enable/disable by
>>> opp_enable()/opp_disable() from enable/disable by QoS requests so that
>>> the two may have different semantics. Then, QoS requests cannot
>>> override opp_disable and opp_enable cannot override QoS requests. This
>>> way, any clock-setting code properly based on OPP (including any
>>> customized devfreq governors) cannot violate QoS requests.
>>
>> If devfreq used the QoS API in it's ->target() call then this would
>> not be an issue, and further illustrates the idea of devfreq simply
>> being a policy into a proper QoS API.
>>
>> Regards,
>> Mike
>>
>
> Yes, if we choose an OPP that meets the PM-QoS requests before calling
> ->target() in devfreq_do(), there wouldn't be an issue.
>
> However, if a device is using DEVFREQ, it also means the device has
> OPPs (mandatory for DEVFREQ). If the device is using PM QoS as well as
> OPP, I guess the correctly implemented device driver needs to call
> opp_enable() and opp_disable() according to PM QoS's update_target()
> calls through the PM QoS notifier cb.
>
> Then, for such drivers, DEVFREQ automatically meets PM QoS requests
> without any modification; as long as QoS meeting OPPs are enabled at
> the device driver's PM QoS callback, there is no QoS issue.
>
> Therefore, now, it appears that neither OPP or DEVFREQ should be
> allowed to directly touch PM QoS APIs, but, the device driver should
> do so with notifier by simply calling opp-enable/disable if the
> frequency is the key QoS "actuator".
>
> If we are going to let DEVFREQ handle its corresponding devices' PM
> QoS APIs, it would mean that both device driver and its DEVFREQ codes
> will be handling PM QoS API duplicated (or worse, inconsistently).

I'm not getting too bogged down with the OPP specifics because I don't
know if that interface is going to be used in the future, and I don't
think that devfreq will need to know about OPPs once the DVFS QoS API
exists.  In that case, devfreq can just requests clock frequencies
through the QoS API.  I view devfreq's usage of the OPP library as
temporary.

The real issue here is that we don't want some weird feedback loop
with device QoS requests and devfreq targets stepping on each other.

One way to handle this is to partition QoS use in drivers away from
devfreq usage.  For example, if a GPU supports performance counters
and can introspect its own usage, then it is a perfect candidate for
devfreq; on the flip-side device drivers should *not* be allowed to
put performance/qos constraints on this particular GPU, since we
assume that the performance counters/devfreq governor will handle the
whole job for us.  Since this centralizes the decision-making for the
GPU it is safe for the devfreq->target() call to use the QoS APIs for
controlling the GPU, since no one else will.  This avoids the feedback
loop.

On the other hand, if the GPU does not support performance counters
then it should not use devfreq at all and rely 100% of QoS constraints
from various sources: the GPU driver might request a high OPP every
time work comes in, coupled with a timeout; if a QoS knob is exported
to userspace then some OpenGL library might hold constraints through
it; or some other kernel driver (video-related?) needs to use the GPU
then it can hold a constraint through the QoS API.

So there is a clear partition of QoS API usage between devices that
support performance counters and ones that do not.  We want to avoid a
feedback loop here.

Regards,
Mike

> Cheers,
> MyungJoo
>
>>> How about this concept of getting QoS requests associated with clock rates?
>>>
>>>
>>>
>>> Cheers!
>>> MyungJoo.
>>> --
>>> MyungJoo Ham, Ph.D.
>>> Mobile Software Platform Lab,
>>> Digital Media and Communications (DMC) Business
>>> Samsung Electronics
>>> cell: 82-10-6714-2858
>>> _______________________________________________
>>> linux-pm mailing list
>>> linux-pm@lists.linux-foundation.org
>>> https://lists.linux-foundation.org/mailman/listinfo/linux-pm
>>>
>>
>
>
>
> --
> MyungJoo Ham, Ph.D.
> Mobile Software Platform Lab,
> Digital Media and Communications (DMC) Business
> Samsung Electronics
> cell: 82-10-6714-2858
>

^ permalink raw reply

* Re: [PATCH v5 0/3] DEVFREQ, DVFS Framework for Non-CPU Devices.
From: MyungJoo Ham @ 2011-08-08  9:53 UTC (permalink / raw)
  To: linux-pm; +Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, Thomas Gleixner
In-Reply-To: <1312794188-9823-1-git-send-email-myungjoo.ham@samsung.com>

Hello all,


I've got one issue requesting some comments: whether to create devfreq
directory or not.

For now, we only have four basic governors and one not-yet-posted
devfreq user for DEVFREQ; thus, it seems not necessary for just now.

However, like CPUFREQ, DEVFREQ may have several users later and we may
want to gather these files to one location: /drivers/devfreq or
somewhere else in /drivers/*. For example, Exynos4210 Memory DEVFREQ
user is located at arch/arm/mach-exynos4/devfreq_bus.c and with a
separated directory, we may have it at
drivers/devfreq/exynos4_memory_bus.c.

How do you think about separating DEVFREQ directory? Should I do so or
wait until we have several DEVFREQ users?


Cheers!
MyungJoo
-- 
MyungJoo Ham, Ph.D.
Mobile Software Platform Lab,
Digital Media and Communications (DMC) Business
Samsung Electronics
cell: 82-10-6714-2858

^ permalink raw reply

* [PATCH v5 3/3] PM / DEVFREQ: add sysfs interface
From: MyungJoo Ham @ 2011-08-08  9:03 UTC (permalink / raw)
  To: linux-pm; +Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, Thomas Gleixner
In-Reply-To: <1312794188-9823-1-git-send-email-myungjoo.ham@samsung.com>

Device specific sysfs interface /sys/devices/.../power/devfreq_*
- governor	R: name of governor
- cur_freq	R: current frequency
- max_freq	R: maximum operable frequency
- min_freq	R: minimum operable frequency
- set_freq	R: read user specified frequency (0 if not specified yet)
		W: set user specified frequency
- polling_interval	R: polling interval in ms given with devfreq profile

Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>

--
Changed from v4
- removed system-wide sysfs interface
- removed tickling sysfs interface
- added set_freq for userspace governor (and any other governors that
  require user input)

Changed from v3
- corrected sysfs API usage
- corrected error messages
- moved sysfs entry location
- added sysfs entries

Changed from v2
- add ABI entries for devfreq sysfs interface
---
 Documentation/ABI/testing/sysfs-devices-power |   45 +++++++
 drivers/base/power/devfreq.c                  |  152 +++++++++++++++++++++++++
 2 files changed, 197 insertions(+), 0 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-devices-power b/Documentation/ABI/testing/sysfs-devices-power
index 8ffbc25..6e77604 100644
--- a/Documentation/ABI/testing/sysfs-devices-power
+++ b/Documentation/ABI/testing/sysfs-devices-power
@@ -165,3 +165,48 @@ Description:
 
 		Not all drivers support this attribute.  If it isn't supported,
 		attempts to read or write it will yield I/O errors.
+
+What:		/sys/devices/.../power/devfreq_governor
+Date:		July 2011
+Contact:	MyungJoo Ham <myungjoo.ham@samsung.com>
+Description:
+		The /sys/devices/.../power/devfreq_governor shows the name
+		of the governor used by the corresponding device.
+
+What:		/sys/devices/.../power/devfreq_cur_freq
+Date:		July 2011
+Contact:	MyungJoo Ham <myungjoo.ham@samsung.com>
+Description:
+		The /sys/devices/.../power/devfreq_cur_freq shows the current
+		frequency of the corresponding device.
+
+What:		/sys/devices/.../power/devfreq_max_freq
+Date:		July 2011
+Contact:	MyungJoo Ham <myungjoo.ham@samsung.com>
+Description:
+		The /sys/devices/.../power/devfreq_cur_freq shows the
+		maximum operable frequency of the corresponding device.
+
+What:		/sys/devices/.../power/devfreq_min_freq
+Date:		July 2011
+Contact:	MyungJoo Ham <myungjoo.ham@samsung.com>
+Description:
+		The /sys/devices/.../power/devfreq_cur_freq shows the
+		minimum operable frequency of the corresponding device.
+
+What:		/sys/devices/.../power/devfreq_set_freq
+Date:		August 2011
+Contact:	MyungJoo Ham <myungjoo.ham@samsung.com>
+Description:
+		The /sys/devices/.../power/devfreq_set_freq sets and shows
+		the user specified desired frequency of the device. The
+		governor may and may not use the value. With the basic
+		governors given with devfreq.c, userspace governor is
+		using the value.
+
+What:		/sys/devices/.../power/devfreq_polling_interval
+Date:		July 2011
+Contact:	MyungJoo Ham <myungjoo.ham@samsung.com>
+Description:
+		The /sys/devices/.../power/devfreq_polling_interval shows the
+		requested polling interval of the corresponding device.
diff --git a/drivers/base/power/devfreq.c b/drivers/base/power/devfreq.c
index c53bca9..4f964b7 100644
--- a/drivers/base/power/devfreq.c
+++ b/drivers/base/power/devfreq.c
@@ -42,6 +42,8 @@ static struct delayed_work devfreq_work;
 static LIST_HEAD(devfreq_list);
 static DEFINE_MUTEX(devfreq_list_lock);
 
+static struct attribute_group dev_attr_group;
+
 /**
  * find_device_devfreq() - find devfreq struct using device pointer
  * @dev:	device pointer used to lookup device devfreq.
@@ -138,6 +140,8 @@ static void devfreq_monitor(struct work_struct *work)
 					"devfreq is removed from the device\n",
 					error);
 
+				sysfs_unmerge_group(&devfreq->dev->kobj,
+						    &dev_attr_group);
 				list_del(&devfreq->node);
 				kfree(devfreq);
 
@@ -218,6 +222,8 @@ int devfreq_add_device(struct device *dev, struct devfreq_dev_profile *profile,
 		queue_delayed_work(devfreq_wq, &devfreq_work,
 				   msecs_to_jiffies(devfreq_interval));
 	}
+
+	sysfs_merge_group(&dev->kobj, &dev_attr_group);
 out:
 	mutex_unlock(&devfreq_list_lock);
 
@@ -244,6 +250,8 @@ int devfreq_remove_device(struct device *dev)
 		return -EINVAL;
 	}
 
+	sysfs_unmerge_group(&dev->kobj, &dev_attr_group);
+
 	list_del(&devfreq->node);
 
 	kfree(devfreq);
@@ -278,6 +286,150 @@ out:
 	return err;
 }
 
+static ssize_t show_governor(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	struct devfreq *df = find_device_devfreq(dev);
+
+	if (IS_ERR(df))
+		return PTR_ERR(df);
+	if (!df->governor)
+		return -EINVAL;
+
+	return sprintf(buf, "%s\n", df->governor->name);
+}
+
+static ssize_t show_freq(struct device *dev,
+			 struct device_attribute *attr, char *buf)
+{
+	struct devfreq *df = find_device_devfreq(dev);
+
+	if (IS_ERR(df))
+		return PTR_ERR(df);
+
+	return sprintf(buf, "%lu\n", df->previous_freq);
+}
+
+static ssize_t show_max_freq(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	struct devfreq *df = find_device_devfreq(dev);
+	unsigned long freq = ULONG_MAX;
+	struct opp *opp;
+
+	if (IS_ERR(df))
+		return PTR_ERR(df);
+	if (!df->dev)
+		return -EINVAL;
+
+	opp = opp_find_freq_floor(df->dev, &freq);
+	if (IS_ERR(opp))
+		return PTR_ERR(opp);
+
+	return sprintf(buf, "%lu\n", freq);
+}
+
+static ssize_t show_min_freq(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	struct devfreq *df = find_device_devfreq(dev);
+	unsigned long freq = 0;
+	struct opp *opp;
+
+	if (IS_ERR(df))
+		return PTR_ERR(df);
+	if (!df->dev)
+		return -EINVAL;
+
+	opp = opp_find_freq_ceil(df->dev, &freq);
+	if (IS_ERR(opp))
+		return PTR_ERR(opp);
+
+	return sprintf(buf, "%lu\n", freq);
+}
+
+static ssize_t show_polling_interval(struct device *dev,
+				     struct device_attribute *attr, char *buf)
+{
+	struct devfreq *df = find_device_devfreq(dev);
+
+	if (IS_ERR(df))
+		return PTR_ERR(df);
+	if (!df->profile)
+		return -EINVAL;
+
+	return sprintf(buf, "%d\n", df->profile->polling_ms);
+}
+
+static ssize_t set_user_frequency(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	struct devfreq *devfreq;
+	unsigned long wanted;
+	int err = 0;
+
+	sscanf(buf, "%lu", &wanted);
+
+	mutex_lock(&devfreq_list_lock);
+	devfreq = find_device_devfreq(dev);
+
+	if (IS_ERR(devfreq)) {
+		err = PTR_ERR(devfreq);
+		goto out;
+	}
+
+	devfreq->user_set_freq = wanted;
+	err = count;
+out:
+	mutex_unlock(&devfreq_list_lock);
+	if (err >= 0)
+		devfreq_update(dev);
+	return err;
+}
+
+static ssize_t show_user_frequency(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	struct devfreq *devfreq;
+	int err = 0;
+
+	mutex_lock(&devfreq_list_lock);
+	devfreq = find_device_devfreq(dev);
+
+	if (IS_ERR(devfreq)) {
+		err = PTR_ERR(devfreq);
+		goto out;
+	}
+
+	err = sprintf(buf, "%lu\n", devfreq->user_set_freq);
+out:
+	mutex_unlock(&devfreq_list_lock);
+	return err;
+
+}
+
+static DEVICE_ATTR(devfreq_governor, 0444, show_governor, NULL);
+static DEVICE_ATTR(devfreq_cur_freq, 0444, show_freq, NULL);
+static DEVICE_ATTR(devfreq_max_freq, 0444, show_max_freq, NULL);
+static DEVICE_ATTR(devfreq_min_freq, 0444, show_min_freq, NULL);
+static DEVICE_ATTR(devfreq_polling_interval, 0444, show_polling_interval, NULL);
+static DEVICE_ATTR(devfreq_set_freq, 0644, show_user_frequency,
+		   set_user_frequency);
+static struct attribute *dev_entries[] = {
+	&dev_attr_devfreq_governor.attr,
+	&dev_attr_devfreq_cur_freq.attr,
+	&dev_attr_devfreq_max_freq.attr,
+	&dev_attr_devfreq_min_freq.attr,
+	&dev_attr_devfreq_polling_interval.attr,
+	&dev_attr_devfreq_set_freq.attr,
+	NULL,
+};
+static struct attribute_group dev_attr_group = {
+	.name	= power_group_name,
+	.attrs	= dev_entries,
+};
+
 /**
  * devfreq_init() - Initialize data structure for devfreq framework and
  *		  start polling registered devfreq devices.
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH v5 2/3] PM / DEVFREQ: add basic governors
From: MyungJoo Ham @ 2011-08-08  9:03 UTC (permalink / raw)
  To: linux-pm; +Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, Thomas Gleixner
In-Reply-To: <1312794188-9823-1-git-send-email-myungjoo.ham@samsung.com>

Four CPUFREQ-like governors are provided as examples.

powersave: use the lowest frequency possible. The user (device) should
set the polling_ms as 0 because polling is useless for this governor.

performance: use the highest freqeuncy possible. The user (device)
should set the polling_ms as 0 because polling is useless for this
governor.

userspace: use the user specified frequency stored at
devfreq.user_set_freq. With sysfs support in the following patch, a user
may set the value with the sysfs interface.

simple_ondemand: simplified version of CPUFREQ's ONDEMAND governor.

When a user updates OPP entries (enable/disable/add), OPP framework
automatically notifies DEVFREQ to update operating frequency
accordingly. Thus, DEVFREQ users (device drivers) do not need to update
DEVFREQ manually with OPP entry updates or set polling_ms for powersave
, performance, userspace, or any other "static" governors.

Note that these are given only as basic examples for governors and any
devices with DEVFREQ may implement their own governors with the drivers
and use them.

Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>

---
Changes from v4:
- Added userspace governor

Changes from v3:
- Bugfixes on simple-ondemand governor (divide by zero / overflow)
- Style fixes
- Give names to governors
---
 drivers/base/power/devfreq.c |  100 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/devfreq.h      |    8 +++
 2 files changed, 108 insertions(+), 0 deletions(-)

diff --git a/drivers/base/power/devfreq.c b/drivers/base/power/devfreq.c
index 6f4bd3a..c53bca9 100644
--- a/drivers/base/power/devfreq.c
+++ b/drivers/base/power/devfreq.c
@@ -301,3 +301,103 @@ static int __init devfreq_init(void)
 	return 0;
 }
 late_initcall(devfreq_init);
+
+static int devfreq_powersave_func(struct devfreq *df,
+				  unsigned long *freq)
+{
+	*freq = 0; /* devfreq_do will run "ceiling" to 0 */
+	return 0;
+}
+
+struct devfreq_governor devfreq_powersave = {
+	.name = "powersave",
+	.get_target_freq = devfreq_powersave_func,
+};
+
+static int devfreq_performance_func(struct devfreq *df,
+				    unsigned long *freq)
+{
+	*freq = UINT_MAX; /* devfreq_do will run "floor" */
+	return 0;
+}
+
+struct devfreq_governor devfreq_performance = {
+	.name = "performance",
+	.get_target_freq = devfreq_performance_func,
+};
+
+static int devfreq_userspace_func(struct devfreq *df, unsigned long *freq)
+{
+	if (df->user_set_freq == 0)
+		*freq = df->previous_freq; /* No user freq specified yet */
+	else
+		*freq = df->user_set_freq;
+
+	return 0;
+}
+
+struct devfreq_governor devfreq_userspace = {
+	.name = "userspace",
+	.get_target_freq = devfreq_userspace_func,
+};
+
+/* Constants for DevFreq-Simple-Ondemand (DFSO) */
+#define DFSO_UPTHRESHOLD	(90)
+#define DFSO_DOWNDIFFERENCTIAL	(5)
+static int devfreq_simple_ondemand_func(struct devfreq *df,
+					unsigned long *freq)
+{
+	struct devfreq_dev_status stat;
+	int err = df->profile->get_dev_status(df->dev, &stat);
+	unsigned long long a, b;
+
+	if (err)
+		return err;
+
+	/* Assume MAX if it is going to be divided by zero */
+	if (stat.total_time == 0) {
+		*freq = UINT_MAX;
+		return 0;
+	}
+
+	/* Prevent overflow */
+	if (stat.busy_time >= (1 << 24) || stat.total_time >= (1 << 24)) {
+		stat.busy_time >>= 7;
+		stat.total_time >>= 7;
+	}
+
+	/* Set MAX if it's busy enough */
+	if (stat.busy_time * 100 >
+	    stat.total_time * DFSO_UPTHRESHOLD) {
+		*freq = UINT_MAX;
+		return 0;
+	}
+
+	/* Set MAX if we do not know the initial frequency */
+	if (stat.current_frequency == 0) {
+		*freq = UINT_MAX;
+		return 0;
+	}
+
+	/* Keep the current frequency */
+	if (stat.busy_time * 100 >
+	    stat.total_time * (DFSO_UPTHRESHOLD - DFSO_DOWNDIFFERENCTIAL)) {
+		*freq = stat.current_frequency;
+		return 0;
+	}
+
+	/* Set the desired frequency based on the load */
+	a = stat.busy_time;
+	a *= stat.current_frequency;
+	b = div_u64(a, stat.total_time);
+	b *= 100;
+	b = div_u64(b, (DFSO_UPTHRESHOLD - DFSO_DOWNDIFFERENCTIAL / 2));
+	*freq = (unsigned long) b;
+
+	return 0;
+}
+
+struct devfreq_governor devfreq_simple_ondemand = {
+	.name = "simple_ondemand",
+	.get_target_freq = devfreq_simple_ondemand_func,
+};
diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
index 6ec630b..7131d2a 100644
--- a/include/linux/devfreq.h
+++ b/include/linux/devfreq.h
@@ -57,6 +57,8 @@ struct devfreq_governor {
  * @next_polling	the number of remaining "devfreq_monitor" executions to
  *			reevaluate frequency/voltage of the device. Set by
  *			profile's polling_ms interval.
+ * @user_set_freq	User specified adequete frequency value (thru sysfs
+ *		interface). Governors may and may not use this value.
  * @data	Private data of the governor. The devfreq framework does not
  *		touch this.
  *
@@ -72,6 +74,7 @@ struct devfreq {
 	unsigned long previous_freq;
 	unsigned int next_polling;
 
+	unsigned long user_set_freq; /* governors may ignore this. */
 	void *data; /* private data for governors */
 };
 
@@ -81,6 +84,11 @@ extern int devfreq_add_device(struct device *dev,
 			   struct devfreq_governor *governor);
 extern int devfreq_remove_device(struct device *dev);
 extern int devfreq_update(struct device *dev);
+
+extern struct devfreq_governor devfreq_powersave;
+extern struct devfreq_governor devfreq_performance;
+extern struct devfreq_governor devfreq_userspace;
+extern struct devfreq_governor devfreq_simple_ondemand;
 #else /* !CONFIG_PM_DEVFREQ */
 static int devfreq_add_device(struct device *dev,
 			   struct devfreq_dev_profile *profile,
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH v5 1/3] PM: Introduce DEVFREQ: generic DVFS framework with device-specific OPPs
From: MyungJoo Ham @ 2011-08-08  9:03 UTC (permalink / raw)
  To: linux-pm; +Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, Thomas Gleixner
In-Reply-To: <1312794188-9823-1-git-send-email-myungjoo.ham@samsung.com>

With OPPs, a device may have multiple operable frequency and voltage
sets. However, there can be multiple possible operable sets and a system
will need to choose one from them. In order to reduce the power
consumption (by reducing frequency and voltage) without affecting the
performance too much, a Dynamic Voltage and Frequency Scaling (DVFS)
scheme may be used.

This patch introduces the DVFS capability to non-CPU devices with OPPs.
DVFS is a techique whereby the frequency and supplied voltage of a
device is adjusted on-the-fly. DVFS usually sets the frequency as low
as possible with given conditions (such as QoS assurance) and adjusts
voltage according to the chosen frequency in order to reduce power
consumption and heat dissipation.

The generic DVFS for devices, DEVFREQ, may appear quite similar with
/drivers/cpufreq.  However, CPUFREQ does not allow to have multiple
devices registered and is not suitable to have multiple heterogenous
devices with different (but simple) governors.

Normally, DVFS mechanism controls frequency based on the demand for
the device, and then, chooses voltage based on the chosen frequency.
DEVFREQ also controls the frequency based on the governor's frequency
recommendation and let OPP pick up the pair of frequency and voltage
based on the recommended frequency. Then, the chosen OPP is passed to
device driver's "target" callback.

When PM QoS is going to be used with the DEVFREQ device, the device
driver should enable OPPs that are appropriate with the current PM QoS
requests. In order to do so, the device driver may call opp_enable and
opp_disable at the notifier callback of PM QoS so that PM QoS's
update_target() call enables the appropriate OPPs. Note that at least
one of OPPs should be enabled at any time; be careful when there is a
transition.

Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>

---
Tested with memory bus of Exynos4-NURI board.

The test code with board support for Exynos4-NURI is at
http://git.infradead.org/users/kmpark/linux-2.6-samsung/shortlog/refs/heads/devfreq

---
Thank you for your valuable comments, Rafael, Greg, Pavel, Colin, Mike,
and Kevin.

Changed from v4
- Removed tickle, which is a duplicated feature; PM QoS can do the same.
- Allow to extend polling interval if devices have longer polling intervals.
- Relocated private data of governors.
- Removed system-wide sysfs

Changed from v3
- In kerneldoc comments, DEVFREQ has ben replaced by devfreq
- Revised removing devfreq entries with error mechanism
- Added and revised comments
- Removed unnecessary codes
- Allow to give a name to a governor
- Bugfix: a tickle call may cancel an older tickle call that is still in
  effect.

Changed from v2
- Code style revised and cleaned up.
- Remove DEVFREQ entries that incur errors except for EAGAIN
- Bug fixed: tickle for devices without polling governors

Changes from v1(RFC)
- Rename: DVFS --> DEVFREQ
- Revised governor design
    . Governor receives the whole struct devfreq
    . Governor should gather usage information (thru get_dev_status)
itself
- Periodic monitoring runs only when needed.
- DEVFREQ no more deals with voltage information directly
- Removed some printks.
- Some cosmetics update
- Use freezable_wq.
---
 drivers/base/power/Makefile  |    1 +
 drivers/base/power/devfreq.c |  303 ++++++++++++++++++++++++++++++++++++++++++
 drivers/base/power/opp.c     |    9 ++
 include/linux/devfreq.h      |  103 ++++++++++++++
 kernel/power/Kconfig         |   34 +++++
 5 files changed, 450 insertions(+), 0 deletions(-)
 create mode 100644 drivers/base/power/devfreq.c
 create mode 100644 include/linux/devfreq.h

diff --git a/drivers/base/power/Makefile b/drivers/base/power/Makefile
index 3647e11..20118dc 100644
--- a/drivers/base/power/Makefile
+++ b/drivers/base/power/Makefile
@@ -4,5 +4,6 @@ obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 obj-$(CONFIG_PM_OPP)	+= opp.o
 obj-$(CONFIG_HAVE_CLK)	+= clock_ops.o
+obj-$(CONFIG_PM_DEVFREQ)	+= devfreq.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
\ No newline at end of file
diff --git a/drivers/base/power/devfreq.c b/drivers/base/power/devfreq.c
new file mode 100644
index 0000000..6f4bd3a
--- /dev/null
+++ b/drivers/base/power/devfreq.c
@@ -0,0 +1,303 @@
+/*
+ * devfreq: Generic Dynamic Voltage and Frequency Scaling (DVFS) Framework
+ *	    for Non-CPU Devices Based on OPP.
+ *
+ * Copyright (C) 2011 Samsung Electronics
+ *	MyungJoo Ham <myungjoo.ham@samsung.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/opp.h>
+#include <linux/devfreq.h>
+#include <linux/workqueue.h>
+#include <linux/platform_device.h>
+#include <linux/list.h>
+#include <linux/printk.h>
+#include <linux/hrtimer.h>
+
+/*
+ * devfreq polling interval in ms.
+ * It is recommended to be "jiffy_in_ms" * n, where n is an integer >= 1.
+ */
+static unsigned int devfreq_interval = 20;
+
+/*
+ * devfreq_work periodically (given by devfreq_interval) monitors every
+ * registered device.
+ */
+static bool polling;
+static ktime_t last_polled_at;
+static struct workqueue_struct *devfreq_wq;
+static struct delayed_work devfreq_work;
+
+/* The list of all device-devfreq */
+static LIST_HEAD(devfreq_list);
+static DEFINE_MUTEX(devfreq_list_lock);
+
+/**
+ * find_device_devfreq() - find devfreq struct using device pointer
+ * @dev:	device pointer used to lookup device devfreq.
+ *
+ * Search the list of device devfreqs and return the matched device's
+ * devfreq info. devfreq_list_lock should be held by the caller.
+ */
+static struct devfreq *find_device_devfreq(struct device *dev)
+{
+	struct devfreq *tmp_devfreq;
+
+	if (unlikely(IS_ERR_OR_NULL(dev))) {
+		pr_err("DEVFREQ: %s: Invalid parameters\n", __func__);
+		return ERR_PTR(-EINVAL);
+	}
+
+	list_for_each_entry(tmp_devfreq, &devfreq_list, node) {
+		if (tmp_devfreq->dev == dev)
+			return tmp_devfreq;
+	}
+
+	return ERR_PTR(-ENODEV);
+}
+
+/**
+ * devfreq_do() - Check the usage profile of a given device and configure
+ *		frequency and voltage accordingly
+ * @devfreq:	devfreq info of the given device
+ */
+static int devfreq_do(struct devfreq *devfreq)
+{
+	struct opp *opp;
+	unsigned long freq;
+	int err;
+
+	err = devfreq->governor->get_target_freq(devfreq, &freq);
+	if (err)
+		return err;
+
+	opp = opp_find_freq_ceil(devfreq->dev, &freq);
+	if (opp == ERR_PTR(-ENODEV))
+		opp = opp_find_freq_floor(devfreq->dev, &freq);
+
+	if (IS_ERR(opp))
+		return PTR_ERR(opp);
+
+	if (devfreq->previous_freq == freq)
+		return 0;
+
+	err = devfreq->profile->target(devfreq->dev, opp);
+	if (err)
+		return err;
+
+	devfreq->previous_freq = freq;
+	return 0;
+}
+
+/**
+ * devfreq_monitor() - Periodically run devfreq_do()
+ * @work: the work struct used to run devfreq_monitor periodically.
+ *
+ */
+static void devfreq_monitor(struct work_struct *work)
+{
+	struct devfreq *devfreq, *tmp;
+	int error;
+	unsigned int next_poll_min = UINT_MAX;
+	ktime_t now = ktime_get();
+	s64 time_passed = ktime_to_ms(ktime_sub(now, last_polled_at));
+	int iterations_passed = 1;
+
+	/* If n * devfreq_interval has passed, count it */
+	do_div(time_passed, devfreq_interval);
+	if (time_passed > 1)
+		iterations_passed = time_passed;
+	last_polled_at = now;
+
+	mutex_lock(&devfreq_list_lock);
+
+	list_for_each_entry_safe(devfreq, tmp, &devfreq_list, node) {
+		if (devfreq->next_polling == 0)
+			continue;
+
+		/*
+		 * Reduce more next_polling if devfreq_wq took an extra
+		 * delay. (i.e., CPU has been idled.)
+		 */
+		if (devfreq->next_polling <= iterations_passed) {
+			error = devfreq_do(devfreq);
+
+			/* Remove a devfreq with an error. */
+			if (error && error != -EAGAIN) {
+				dev_err(devfreq->dev, "devfreq_do error(%d). "
+					"devfreq is removed from the device\n",
+					error);
+
+				list_del(&devfreq->node);
+				kfree(devfreq);
+
+				continue;
+			}
+			devfreq->next_polling = DIV_ROUND_UP(
+						devfreq->profile->polling_ms,
+						devfreq_interval);
+
+			/* No more polling required (polling_ms changed) */
+			if (devfreq->next_polling == 0)
+				continue;
+		} else {
+			devfreq->next_polling -= iterations_passed;
+		}
+
+		next_poll_min = (next_poll_min > devfreq->next_polling) ?
+				devfreq->next_polling : next_poll_min;
+	}
+
+	if (next_poll_min > 0 && next_poll_min < UINT_MAX) {
+		polling = true;
+		queue_delayed_work(devfreq_wq, &devfreq_work, msecs_to_jiffies(
+					   devfreq_interval * next_poll_min));
+	} else {
+		polling = false;
+	}
+
+	mutex_unlock(&devfreq_list_lock);
+}
+
+/**
+ * devfreq_add_device() - Add devfreq feature to the device
+ * @dev:	the device to add devfreq feature.
+ * @profile:	device-specific profile to run devfreq.
+ * @governor:	the policy to choose frequency.
+ */
+int devfreq_add_device(struct device *dev, struct devfreq_dev_profile *profile,
+		       struct devfreq_governor *governor)
+{
+	struct devfreq *new_devfreq, *devfreq;
+	int err = 0;
+
+	if (!dev || !profile || !governor) {
+		dev_err(dev, "%s: Invalid parameters.\n", __func__);
+		return -EINVAL;
+	}
+
+	mutex_lock(&devfreq_list_lock);
+
+	devfreq = find_device_devfreq(dev);
+	if (!IS_ERR(devfreq)) {
+		dev_err(dev, "%s: Unable to create devfreq for the device. "
+			"It already has one.\n", __func__);
+		err = -EINVAL;
+		goto out;
+	}
+
+	new_devfreq = kzalloc(sizeof(struct devfreq), GFP_KERNEL);
+	if (!new_devfreq) {
+		dev_err(dev, "%s: Unable to create devfreq for the device\n",
+			__func__);
+		err = -ENOMEM;
+		goto out;
+	}
+
+	new_devfreq->dev = dev;
+	new_devfreq->profile = profile;
+	new_devfreq->governor = governor;
+	new_devfreq->next_polling = DIV_ROUND_UP(profile->polling_ms,
+						 devfreq_interval);
+	new_devfreq->previous_freq = profile->initial_freq;
+
+	list_add(&new_devfreq->node, &devfreq_list);
+
+	if (devfreq_wq && new_devfreq->next_polling && !polling) {
+		polling = true;
+		queue_delayed_work(devfreq_wq, &devfreq_work,
+				   msecs_to_jiffies(devfreq_interval));
+	}
+out:
+	mutex_unlock(&devfreq_list_lock);
+
+	return err;
+}
+
+/**
+ * devfreq_remove_device() - Remove devfreq feature from a device.
+ * @device:	the device to remove devfreq feature.
+ */
+int devfreq_remove_device(struct device *dev)
+{
+	struct devfreq *devfreq;
+
+	if (!dev)
+		return -EINVAL;
+
+	mutex_lock(&devfreq_list_lock);
+	devfreq = find_device_devfreq(dev);
+	if (IS_ERR(devfreq)) {
+		dev_err(dev, "%s: Unable to find devfreq entry for the device.\n",
+			__func__);
+		mutex_unlock(&devfreq_list_lock);
+		return -EINVAL;
+	}
+
+	list_del(&devfreq->node);
+
+	kfree(devfreq);
+
+	mutex_unlock(&devfreq_list_lock);
+
+	return 0;
+}
+
+/**
+ * devfreq_update() - Notify that the device OPP has been changed.
+ * @dev:	the device whose OPP has been changed.
+ */
+int devfreq_update(struct device *dev)
+{
+	struct devfreq *devfreq;
+	int err = 0;
+
+	mutex_lock(&devfreq_list_lock);
+
+	devfreq = find_device_devfreq(dev);
+	if (IS_ERR(devfreq)) {
+		err = PTR_ERR(devfreq);
+		goto out;
+	}
+
+	/* Reevaluate the proper frequency */
+	err = devfreq_do(devfreq);
+
+out:
+	mutex_unlock(&devfreq_list_lock);
+	return err;
+}
+
+/**
+ * devfreq_init() - Initialize data structure for devfreq framework and
+ *		  start polling registered devfreq devices.
+ */
+static int __init devfreq_init(void)
+{
+	devfreq_interval = jiffies_to_msecs(msecs_to_jiffies(devfreq_interval));
+	if (devfreq_interval <= 0) {
+		pr_err("DEVFREQ: devfreq_interval too small.\n");
+		return -EINVAL;
+	}
+
+	mutex_lock(&devfreq_list_lock);
+	polling = false;
+	devfreq_wq = create_freezable_workqueue("devfreq_wq");
+	INIT_DELAYED_WORK_DEFERRABLE(&devfreq_work, devfreq_monitor);
+	mutex_unlock(&devfreq_list_lock);
+
+	last_polled_at = ktime_get();
+	devfreq_monitor(&devfreq_work.work);
+	return 0;
+}
+late_initcall(devfreq_init);
diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
index 56a6899..819c1b3 100644
--- a/drivers/base/power/opp.c
+++ b/drivers/base/power/opp.c
@@ -21,6 +21,7 @@
 #include <linux/rculist.h>
 #include <linux/rcupdate.h>
 #include <linux/opp.h>
+#include <linux/devfreq.h>
 
 /*
  * Internal data structure organization with the OPP layer library is as
@@ -428,6 +429,11 @@ int opp_add(struct device *dev, unsigned long freq, unsigned long u_volt)
 	list_add_rcu(&new_opp->node, head);
 	mutex_unlock(&dev_opp_list_lock);
 
+	/*
+	 * Notify generic dvfs for the change and ignore error
+	 * because the device may not have a devfreq entry
+	 */
+	devfreq_update(dev);
 	return 0;
 }
 
@@ -512,6 +518,9 @@ unlock:
 	mutex_unlock(&dev_opp_list_lock);
 out:
 	kfree(new_opp);
+
+	/* Notify generic dvfs for the change and ignore error */
+	devfreq_update(dev);
 	return r;
 }
 
diff --git a/include/linux/devfreq.h b/include/linux/devfreq.h
new file mode 100644
index 0000000..6ec630b
--- /dev/null
+++ b/include/linux/devfreq.h
@@ -0,0 +1,103 @@
+/*
+ * devfreq: Generic Dynamic Voltage and Frequency Scaling (DVFS) Framework
+ *	    for Non-CPU Devices Based on OPP.
+ *
+ * Copyright (C) 2011 Samsung Electronics
+ *	MyungJoo Ham <myungjoo.ham@samsung.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef __LINUX_DEVFREQ_H__
+#define __LINUX_DEVFREQ_H__
+
+#define DEVFREQ_NAME_LEN 16
+
+struct devfreq;
+struct devfreq_dev_status {
+	/* both since the last measure */
+	unsigned long total_time;
+	unsigned long busy_time;
+	unsigned long current_frequency;
+};
+
+struct devfreq_dev_profile {
+	unsigned long max_freq; /* may be larger than the actual value */
+	unsigned long initial_freq;
+	int polling_ms;	/* 0 for at opp change only */
+
+	int (*target)(struct device *dev, struct opp *opp);
+	int (*get_dev_status)(struct device *dev,
+			      struct devfreq_dev_status *stat);
+};
+
+/**
+ * struct devfreq_governor - Devfreq policy governor
+ * @name		Governor's name
+ * @get_target_freq	Returns desired operating frequency for the device.
+ *			Basically, get_target_freq will run
+ *			devfreq_dev_profile.get_dev_status() to get the
+ *			status of the device (load = busy_time / total_time).
+ */
+struct devfreq_governor {
+	char name[DEVFREQ_NAME_LEN];
+	int (*get_target_freq)(struct devfreq *this, unsigned long *freq);
+};
+
+/**
+ * struct devfreq - Device devfreq structure
+ * @node	list node - contains the devices with devfreq that have been
+ *		registered.
+ * @dev		device pointer
+ * @profile	device-specific devfreq profile
+ * @governor	method how to choose frequency based on the usage.
+ * @previous_freq	previously configured frequency value.
+ * @next_polling	the number of remaining "devfreq_monitor" executions to
+ *			reevaluate frequency/voltage of the device. Set by
+ *			profile's polling_ms interval.
+ * @data	Private data of the governor. The devfreq framework does not
+ *		touch this.
+ *
+ * This structure stores the devfreq information for a give device.
+ */
+struct devfreq {
+	struct list_head node;
+
+	struct device *dev;
+	struct devfreq_dev_profile *profile;
+	struct devfreq_governor *governor;
+
+	unsigned long previous_freq;
+	unsigned int next_polling;
+
+	void *data; /* private data for governors */
+};
+
+#if defined(CONFIG_PM_DEVFREQ)
+extern int devfreq_add_device(struct device *dev,
+			   struct devfreq_dev_profile *profile,
+			   struct devfreq_governor *governor);
+extern int devfreq_remove_device(struct device *dev);
+extern int devfreq_update(struct device *dev);
+#else /* !CONFIG_PM_DEVFREQ */
+static int devfreq_add_device(struct device *dev,
+			   struct devfreq_dev_profile *profile,
+			   struct devfreq_governor *governor)
+{
+	return 0;
+}
+
+static int devfreq_remove_device(struct device *dev)
+{
+	return 0;
+}
+
+static int devfreq_update(struct device *dev)
+{
+	return 0;
+}
+#endif /* CONFIG_PM_DEVFREQ */
+
+#endif /* __LINUX_DEVFREQ_H__ */
diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig
index 87f4d24..b7e15c8 100644
--- a/kernel/power/Kconfig
+++ b/kernel/power/Kconfig
@@ -227,3 +227,37 @@ config PM_OPP
 config PM_RUNTIME_CLK
 	def_bool y
 	depends on PM_RUNTIME && HAVE_CLK
+
+config ARCH_HAS_DEVFREQ
+	bool
+	depends on ARCH_HAS_OPP
+	help
+	  Denotes that the architecture supports DEVFREQ. If the architecture
+	  supports multiple OPP entries per device and the frequency of the
+	  devices with OPPs may be altered dynamically, the architecture
+	  supports DEVFREQ.
+
+config PM_DEVFREQ
+	bool "Generic Dynamic Voltage and Frequency Scaling (DVFS) Framework"
+	depends on PM_OPP && ARCH_HAS_DEVFREQ
+	help
+	  With OPP support, a device may have a list of frequencies and
+	  voltages available. DEVFREQ, a generic DVFS framework can be
+	  registered for a device with OPP support in order to let the
+	  governor provided to DEVFREQ choose an operating frequency
+	  based on the OPP's list and the policy given with DEVFREQ.
+
+	  Each device may have its own governor and policy. DEVFREQ can
+	  reevaluate the device state periodically and/or based on the
+	  OPP list changes (each frequency/voltage pair in OPP may be
+	  disabled or enabled).
+
+	  Like some CPUs with CPUFREQ, a device may have multiple clocks.
+	  However, because the clock frequencies of a single device are
+	  determined by the single device's state, an instance of DEVFREQ
+	  is attached to a single device and returns a "representative"
+	  clock frequency from the OPP of the device, which is also attached
+	  to a device by 1-to-1. The device registering DEVFREQ takes the
+	  responsiblity to "interpret" the frequency listed in OPP and
+	  to set its every clock accordingly with the "target" callback
+	  given to DEVFREQ.
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH v5 0/3] DEVFREQ, DVFS Framework for Non-CPU Devices.
From: MyungJoo Ham @ 2011-08-08  9:03 UTC (permalink / raw)
  To: linux-pm; +Cc: Len Brown, Greg Kroah-Hartman, Kyungmin Park, Thomas Gleixner

For a usage example, please look at
http://git.infradead.org/users/kmpark/linux-2.6-samsung/shortlog/refs/heads/devfreq

In the above git tree, DVFS (dynamic voltage and frequency scaling) mechanism
is applied to the memory bus of Exynos4210 for Exynos4210-NURI boards.
In the example, the LPDDR2 DRAM frequency changes between 133, 266, and 400MHz
and other related clocks simply follow the determined DDR RAM clock.

The DEVFREQ driver for Exynos4210 memory bus is at
/arch/arm/mach-exynos4/devfreq_bus.c in the git tree.

In the dd (writing and reading 360MiB) test with NURI board, the memory
throughput was not changed (the performance is not deteriorated) while
the SoC power consumption has been reduced by 1%. When the memory access
is not that intense while the CPU is heavily used, the SoC power consumption
has been reduced by 6%. The power consumption has been compared with the
case using the conventional Exynos4210 CPUFREQ driver, which sets memory
bus frequency according to the CPU core frequency. Besides, when the CPU core
running slow and the memory access is intense, the performance (memory
throughput) has been increased by 11% (with higher SoC power consumption of
5%). The tested governor is "simple-ondemand".

MyungJoo Ham (3):
  PM: Introduce DEVFREQ: generic DVFS framework with device-specific
    OPPs
  PM / DEVFREQ: add basic governors
  PM / DEVFREQ: add sysfs interface

 Documentation/ABI/testing/sysfs-devices-power |   45 ++
 drivers/base/power/Makefile                   |    1 +
 drivers/base/power/devfreq.c                  |  555 +++++++++++++++++++++++++
 drivers/base/power/opp.c                      |    9 +
 include/linux/devfreq.h                       |  111 +++++
 kernel/power/Kconfig                          |   34 ++
 6 files changed, 755 insertions(+), 0 deletions(-)
 create mode 100644 drivers/base/power/devfreq.c
 create mode 100644 include/linux/devfreq.h

-- 
1.7.4.1

^ permalink raw reply

* [PATCH v2] PM: add statistics debugfs file for suspend to ram
From: Liu, ShuoX @ 2011-08-08  8:53 UTC (permalink / raw)
  To: linux-pm@lists.linux-foundation.org
  Cc: Brown, Len, Greg KH, linux-kernel@vger.kernel.org

From: ShuoX Liu <shuox.liu@intel.com>

Record S3 failure time about each reason and the latest two failed
devices' names in S3 progress.
We can check it through 'suspend_stats' entry in debugfs.

The motivation of the patch:

We are enabling power features on Medfield. Comparing with PC/notebook,
a mobile enters/exits suspend-2-ram (we call it s3 on Medfield) far
more frequently. If it can't enter suspend-2-ram in time, the power
might be used up soon.

We often find sometimes, a device suspend fails. Then, system retries
s3 over and over again. As display is off, testers and developers
don't know what happens.

Some testers and developers complain they don't know if system
tries suspend-2-ram, and what device fails to suspend. They need
such info for a quick check. The patch adds suspend_stats under
debugfs for users to check suspend to RAM statistics quickly.

If not using this patch, we have other methods to get info about
what device fails. One is to turn on  CONFIG_PM_DEBUG, but users
would get too much info and testers need recompile the system.

In addition, dynamic debug is another good tool to dump debug info.
But it still doesn't match our utilization scenario closely.
1) user need write a user space parser to process the syslog output;
2) Our testing scenario is we leave the mobile for at least hours.
   Then, check its status. No serial console available during the
   testing. One is because console would be suspended, and the other
   is serial console connecting with spi or HSU devices would consume
   power. These devices are powered off at suspend-2-ram.

Thank Greg and Rafael for their great comments.

Change Log:
  V2:
	1) Change the sysfs entry to debugfs.
	2) Add some explanation in document file.

Contribution:
	yanmin_zhang@linux.intel.com

Signed-off-by: ShuoX Liu <shuox.liu@intel.com>
---
 Documentation/power/basic-pm-debugging.txt |   19 +++++++++
 drivers/base/power/main.c                  |   31 +++++++++++++-
 include/linux/suspend.h                    |   16 +++++++
 kernel/power/main.c                        |   59 ++++++++++++++++++++++++++++
 kernel/power/suspend.c                     |   13 +++++-
 5 files changed, 133 insertions(+), 5 deletions(-)

diff --git a/Documentation/power/basic-pm-debugging.txt b/Documentation/power/basic-pm-debugging.txt
index ddd7817..acbd3e0 100644
--- a/Documentation/power/basic-pm-debugging.txt
+++ b/Documentation/power/basic-pm-debugging.txt
@@ -201,3 +201,22 @@ case, you may be able to search for failing drivers by following the procedure
 analogous to the one described in section 1.  If you find some failing drivers,
 you will have to unload them every time before an STR transition (ie. before
 you run s2ram), and please report the problems with them.
+
+There is a debugfs entry which shows the suspend to RAM statistics. Here is an
+example of its output.
+	# mount -t debugfs none /sys/kernel/debug
+	# cat /sys/kernel/debug/suspend_stats
+	success: 20
+	fail: 5
+	failed_freeze: 0
+	failed_prepare: 0
+	failed_suspend: 5
+	failed_suspend_noirq: 0
+	failed_resume: 0
+	failed_resume_noirq: 0
+	failed_devs:
+	  last_failed: alarm
+		       adc
+Field success means the success number of suspend to RAM, and field fail means
+the failure number. Others are the failure number of different steps of suspend
+to RAM. suspend_stats just lists the last 2 failed devices.
diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index a854591..da1c561 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -46,6 +46,7 @@ LIST_HEAD(dpm_prepared_list);
 LIST_HEAD(dpm_suspended_list);
 LIST_HEAD(dpm_noirq_list);
 
+struct suspend_stats suspend_stats;
 static DEFINE_MUTEX(dpm_list_mtx);
 static pm_message_t pm_transition;
 
@@ -180,6 +181,15 @@ static void initcall_debug_report(struct device *dev, ktime_t calltime,
 	}
 }
 
+static void dpm_save_dev_name(const char *name)
+{
+	strlcpy(suspend_stats.failed_devs[suspend_stats.last_failed],
+		name,
+		sizeof(suspend_stats.failed_devs[0]));
+	suspend_stats.last_failed++;
+	suspend_stats.last_failed %= REC_FAILED_DEV_NUM;
+}
+
 /**
  * dpm_wait - Wait for a PM operation to complete.
  * @dev: Device to wait for.
@@ -464,8 +474,11 @@ void dpm_resume_noirq(pm_message_t state)
 		mutex_unlock(&dpm_list_mtx);
 
 		error = device_resume_noirq(dev, state);
-		if (error)
+		if (error) {
+			suspend_stats.failed_resume_noirq++;
+			dpm_save_dev_name(dev_name(dev));
 			pm_dev_err(dev, state, " early", error);
+		}
 
 		mutex_lock(&dpm_list_mtx);
 		put_device(dev);
@@ -626,8 +639,11 @@ void dpm_resume(pm_message_t state)
 			mutex_unlock(&dpm_list_mtx);
 
 			error = device_resume(dev, state, false);
-			if (error)
+			if (error) {
+				suspend_stats.failed_resume++;
+				dpm_save_dev_name(dev_name(dev));
 				pm_dev_err(dev, state, "", error);
+			}
 
 			mutex_lock(&dpm_list_mtx);
 		}
@@ -802,6 +818,8 @@ int dpm_suspend_noirq(pm_message_t state)
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			pm_dev_err(dev, state, " late", error);
+			suspend_stats.failed_suspend_noirq++;
+			dpm_save_dev_name(dev_name(dev));
 			put_device(dev);
 			break;
 		}
@@ -923,8 +941,10 @@ static void async_suspend(void *data, async_cookie_t cookie)
 	int error;
 
 	error = __device_suspend(dev, pm_transition, true);
-	if (error)
+	if (error) {
+		dpm_save_dev_name(dev_name(dev));
 		pm_dev_err(dev, pm_transition, " async", error);
+	}
 
 	put_device(dev);
 }
@@ -967,6 +987,7 @@ int dpm_suspend(pm_message_t state)
 		mutex_lock(&dpm_list_mtx);
 		if (error) {
 			pm_dev_err(dev, state, "", error);
+			dpm_save_dev_name(dev_name(dev));
 			put_device(dev);
 			break;
 		}
@@ -982,6 +1003,8 @@ int dpm_suspend(pm_message_t state)
 		error = async_error;
 	if (!error)
 		dpm_show_time(starttime, state, NULL);
+	else
+		suspend_stats.failed_suspend++;
 	return error;
 }
 
@@ -1090,6 +1113,8 @@ int dpm_suspend_start(pm_message_t state)
 	error = dpm_prepare(state);
 	if (!error)
 		error = dpm_suspend(state);
+	else
+		suspend_stats.failed_prepare++;
 	return error;
 }
 EXPORT_SYMBOL_GPL(dpm_suspend_start);
diff --git a/include/linux/suspend.h b/include/linux/suspend.h
index 6bbcef2..6a8ff23 100644
--- a/include/linux/suspend.h
+++ b/include/linux/suspend.h
@@ -34,6 +34,22 @@ typedef int __bitwise suspend_state_t;
 #define PM_SUSPEND_MEM		((__force suspend_state_t) 3)
 #define PM_SUSPEND_MAX		((__force suspend_state_t) 4)
 
+struct suspend_stats {
+	int	success;
+	int	fail;
+	int	failed_freeze;
+	int	failed_prepare;
+	int	failed_suspend;
+	int	failed_suspend_noirq;
+	int	failed_resume;
+	int	failed_resume_noirq;
+#define	REC_FAILED_DEV_NUM	2
+	char	failed_devs[REC_FAILED_DEV_NUM][40];
+	int	last_failed;
+};
+
+extern struct suspend_stats suspend_stats;
+
 /**
  * struct platform_suspend_ops - Callbacks for managing platform dependent
  *	system sleep states.
diff --git a/kernel/power/main.c b/kernel/power/main.c
index 6c601f8..0977f5d 100644
--- a/kernel/power/main.c
+++ b/kernel/power/main.c
@@ -12,6 +12,8 @@
 #include <linux/string.h>
 #include <linux/resume-trace.h>
 #include <linux/workqueue.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
 
 #include "power.h"
 
@@ -133,6 +135,59 @@ power_attr(pm_test);
 
 #endif /* CONFIG_PM_SLEEP */
 
+#ifdef CONFIG_DEBUG_FS
+static int suspend_stats_show(struct seq_file *s, void *unused)
+{
+	int i, index, last_index;
+
+	last_index = suspend_stats.last_failed + REC_FAILED_DEV_NUM - 1;
+	last_index %= REC_FAILED_DEV_NUM;
+	seq_printf(s, "%s: %d\n%s: %d\n%s: %d\n%s: %d\n"
+			"%s: %d\n%s: %d\n%s: %d\n%s: %d\n",
+			"success", suspend_stats.success,
+			"fail", suspend_stats.fail,
+			"failed_freeze", suspend_stats.failed_freeze,
+			"failed_prepare", suspend_stats.failed_prepare,
+			"failed_suspend", suspend_stats.failed_suspend,
+			"failed_suspend_noirq",
+				suspend_stats.failed_suspend_noirq,
+			"failed_resume", suspend_stats.failed_resume,
+			"failed_resume_noirq",
+				suspend_stats.failed_resume_noirq);
+	seq_printf(s,	"failed_devs:\n  last_failed:\t%s\n",
+			suspend_stats.failed_devs[last_index]);
+	for (i = 1; i < REC_FAILED_DEV_NUM; i++) {
+		index = last_index + REC_FAILED_DEV_NUM - i;
+		index %= REC_FAILED_DEV_NUM;
+		seq_printf(s, "\t\t%s\n",
+			suspend_stats.failed_devs[index]);
+	}
+
+	return 0;
+}
+
+static int suspend_stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, suspend_stats_show, NULL);
+}
+
+static const struct file_operations suspend_stats_operations = {
+	.open           = suspend_stats_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = single_release,
+};
+
+static int __init pm_debugfs_init(void)
+{
+	debugfs_create_file("suspend_stats", S_IFREG | S_IRUGO,
+			NULL, NULL, &suspend_stats_operations);
+	return 0;
+}
+
+late_initcall(pm_debugfs_init);
+#endif /* CONFIG_DEBUG_FS */
+
 struct kobject *power_kobj;
 
 /**
@@ -194,6 +249,10 @@ static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr,
 	}
 	if (state < PM_SUSPEND_MAX && *s)
 		error = enter_state(state);
+		if (error)
+			suspend_stats.fail++;
+		else
+			suspend_stats.success++;
 #endif
 
  Exit:
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index b6b71ad..9bb4281 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -106,6 +106,8 @@ static int suspend_prepare(void)
 	error = suspend_freeze_processes();
 	if (!error)
 		return 0;
+	else
+		suspend_stats.failed_freeze++;
 
 	suspend_thaw_processes();
 	usermodehelper_enable();
@@ -315,8 +317,15 @@ int enter_state(suspend_state_t state)
  */
 int pm_suspend(suspend_state_t state)
 {
-	if (state > PM_SUSPEND_ON && state <= PM_SUSPEND_MAX)
-		return enter_state(state);
+	int ret;
+	if (state > PM_SUSPEND_ON && state <= PM_SUSPEND_MAX) {
+		ret = enter_state(state);
+		if (ret)
+			suspend_stats.fail++;
+		else
+			suspend_stats.success++;
+		return ret;
+	}
 	return -EINVAL;
 }
 EXPORT_SYMBOL(pm_suspend);

^ permalink raw reply related

* Re: [PATCH] PM: add statistics sysfs file for suspend to ram
From: Yanmin Zhang @ 2011-08-08  3:48 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Liu, ShuoX, linux-pm@lists.linux-foundation.org, Greg KH,
	Brown, Len
In-Reply-To: <201108052120.14901.rjw@sisk.pl>

On Fri, 2011-08-05 at 21:20 +0200, Rafael J. Wysocki wrote:
> On Friday, August 05, 2011, Yanmin Zhang wrote:
> > On Fri, 2011-08-05 at 09:57 +0800, Liu, ShuoX wrote:
> > > > On Thursday, August 04, 2011, Greg KH wrote:
> > > > > On Thu, Aug 04, 2011 at 01:13:38PM +0800, Liu, ShuoX wrote:
> > > > > > >From a906b0b5b4ff3414ceb9fc7a69a3d7c9d66e46b1 Mon Sep 17
> > > > 00:00:00 2001
> > > > > > From: ShuoX Liu <shuox.liu@intel.com>
> > > > > > Date: Thu, 28 Jul 2011 10:54:22 +0800
> > > > > > Subject: [PATCH] PM: add statistics sysfs file for suspend to ram.
> > > > >
> > > > > What's this stuff here for?  That's not needed (hint, I would have to
> > > > > edit it out by hand to be able to apply this patch.)
> > > > >
> > > > > Thanks for resending a version that passes checkpatch.pl and could be
> > > > > applied, but all of my previous comments still stand.  This patch, as
> > > > > is, is totally unacceptable.
> > > > 
> > > > Agreed, plus I'd like to know the motivation behind it.  That is, we have
> > > > quite a few debug facilities in that code, so why are they insufficient?
> > Thanks Greg, Rafael. We are changing the patch based on your comments.
> > 
> > > 
> > > Some explanation from Yanmin,
> > > "We are enabling power features on Medfield. Some testers and developers
> > > complain they don't know if system tries suspend-2-ram, and what device 
> > > fails to suspend. They need such info for a quick check. If turning on 
> > > CONFIG_PM_DEBUG, we get too much info and testers need recompile 
> > > the system.
> > Comparing with PC/notebook, a mobile enters/exits suspend-2-ram (we call it s3 on
> > Medfield) far more frequently. If it can't enter suspend-2-ram in time, the power
> > might be used up soon.
> > 
> > We often find sometimes, a device suspend fails. Then, system retries s3 over
> > and over again. As display is off, testers/developers don't know what happens.
> > Teh system 
> > 
> > With the patch, we could know what the bad device is.
> > 
> > The patch doesn't hurt performance as it's just statistics collector.
> > 
> > CONFIG_PM_DEBUG is very useful for finer investigation about what happens behind. What
> > we provide by the patch is to analyze the issues quickly, even by an ordinary tester.
> 
> Well, what about using dynamic debug?
Thanks for the nice pointer. I checked dynamic debug. It's really a good debug tool.
With the dynamic debug:
1) user need write a user space parser to process the syslog output;
2) Our testing scenario is we leave the mobile for at least hours. Then, check its status.
No serial console available during the testing. One is because console would be suspended,
and the other is serial console connecting with spi or HSU devices would consume power. These
devices are powered off at suspend-2-ram.

Below is an example output of the statistics from my mobile (we are changing the output
from sysfs to debugfs now):
#adb shell cat /sys/power/suspend_stats
success: 5
fail: 1
failed_freeze: 0
failed_prepare: 0
failed_suspend: 1
failed_suspend_noirq: 0
failed_resume: 0
failed_resume_noirq: 0
failed_devs:
  last_failed:	alarm

^ permalink raw reply

* Re: LCDC problem after commit 794d78fea51504bad3880d14f354a9847f318f25
From: Magnus Damm @ 2011-08-08  1:46 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Linux PM mailing list, linux-sh
In-Reply-To: <201108080125.34329.rjw@sisk.pl>

Hi Rafael,

On Mon, Aug 8, 2011 at 8:25 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> Hi,
>
> Unfortunately, after commit 794d78fea51504bad3880d14f354a9847f318f25
> (drivers: sh: late disabling of clocks V2) the LCDC screen on my
> Mackerel board goes off (late) during boot and cannot be turned on
> by any way AFAICS (the backlight remains on, though).
>
> It looks like the late disabling of clocks interacts badly with the
> clock management done by PM domains.

This is most likely showing a so-far-not-handled clock dependency for
the LCDC hardware. I'll have a look.

Thanks,

/ magnus

^ permalink raw reply

* LCDC problem after commit 794d78fea51504bad3880d14f354a9847f318f25
From: Rafael J. Wysocki @ 2011-08-07 23:25 UTC (permalink / raw)
  To: Magnus Damm; +Cc: Linux PM mailing list, linux-sh

Hi,

Unfortunately, after commit 794d78fea51504bad3880d14f354a9847f318f25
(drivers: sh: late disabling of clocks V2) the LCDC screen on my
Mackerel board goes off (late) during boot and cannot be turned on
by any way AFAICS (the backlight remains on, though).

It looks like the late disabling of clocks interacts badly with the
clock management done by PM domains.

Thanks,
Rafael

^ permalink raw reply

* Re: [PATCH 02/11] PM: extend PM QoS with per-device wake-up constraints
From: Mark Brown @ 2011-08-07  2:47 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM mailing list, linux-omap, Jean Pihet, markgross
In-Reply-To: <201108062146.20423.rjw@sisk.pl>

On Sat, Aug 06, 2011 at 09:46:20PM +0200, Rafael J. Wysocki wrote:
> On Saturday, August 06, 2011, Mark Brown wrote:

> > I wouldn't say we've got to rely on userspace here - it seems like we
> > ought to be able to use DMI or other system data to identify the
> > affected systems and activate the workarounds.

> That's only practical on systems where the kernel can be rebuilt.
> Moreover, if that affects individual devices, using DMI-based blacklists
> is not really practical at all.

OK, so this does sound like there's probably a genuine issue on PCs due
to limitations in the environment.  Is it possible to expose these
things to userspace in a way that's limited to the affected platforms?

> > You're right that it doesn't stop the kernel doing anything, the concern
> > is that people just won't bother making the kernel work properly and
> > will just do their power management in userspace and not bother fixing
> > the kernel.

> I wouldn't call it "fixing the kernel".  I'd rather say "putting workarounds
> into the kernel", which I'm not sure is the right thing to at least in some
> cases.

That does sound like a fair characterization for PCs.  For embedded
systems where we have a *much* better knowledge of the hardware we're
running on you're just working with the basics of what the hardware
needs to run - the hardware needs whatever it needs and no matter what
you think of the quality of the hardware there's an expectation that the
kernel is ging to be able to work.

> > Punting to userspace seems like it is creating the
> > expectation that we can't make the kernel work and isn't great from a
> > usability perspective since users shouldn't really be worrying about bus
> > performance or so on, it's not something that's visible at the level
> > applications work at.

> However, platform builders may want to fine tuned things and I'm not sure
> we should require them to patch the kernel for this purpose.

As I've said it's not the fine tuning that I'm worried about, it's the
specific mechanism that's being suggested.  Being able to tune things in
a way that's relevant to the device being tuned seems entirely sensible.

> > I'm not sure I can see a lot of cases where you'd have root access and
> > not be able to do kernel updates if required?  Having stuff in debugfs
> > for diagnostics doesn't strike me as a problem if that's the issue.

> Root access doesn't necessarily mean you have all of the requisite
> tools (like compilers etc.) and installing them isn't always trivial
> (think of systems like phones etc.), let alone building the kernel from
> sources (where you don't necessarily know the original .config used for
> building your device's kernel).

Phones are exactly the sort of case I'm primarily concerned with here.

Realistically if you're in a position where you need to work at this
very low level on an embedded device you can replace the entire firmware
on the device.  We don't want to end up in the situation where we've got
kernel support for a device and the only way to get it to actually run
sensibly is to install the silicon vendor's proprietary userspace, and
we especially don't want to end up in the situation where that userspace
is using standard and supported kernel interfaces to do its thing.

> IOW, I don't buy the "you can always rebuild the kernel if necessary"
> argument.  It simply is not true in general.

You can push that argument to extremes, of course.

^ permalink raw reply

* Re: [PATCH] PM / Freezer: Freeze filesystems while freezing processes (v2)
From: Dave Chinner @ 2011-08-07  0:14 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christoph, Theodore Ts'o, LKML, xfs, Christoph Hellwig,
	linux-fsdevel, Linux PM mailing list, linux-ext4
In-Reply-To: <201108062317.19033.rjw@sisk.pl>

On Sat, Aug 06, 2011 at 11:17:18PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rjw@sisk.pl>
> 
> Freeze all filesystems during the freezing of tasks by calling
> freeze_bdev() for each of them and thaw them during the thawing
> of tasks with the help of thaw_bdev().
> 
> This is needed by hibernation, because some filesystems (e.g. XFS)
> deadlock with the preallocation of memory used by it if the memory
> pressure caused by it is too heavy.
> 
> The additional benefit of this change is that, if something goes
> wrong after filesystems have been frozen, they will stay in a
> consistent state and journal replays won't be necessary (e.g. after
> a failing suspend or resume).  In particular, this should help to
> solve a long-standing issue that in some cases during resume from
> hibernation the boot loader causes the journal to be replied for the
> filesystem containing the kernel image and initrd causing it to
> become inconsistent with the information stored in the hibernation
> image.
> 
> This change is based on earlier work by Nigel Cunningham.
> 
> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
> ---
> 
> OK, so nobody except for Pavel appears to have any comments, so I assume
> that everyone except for Pavel is fine with the approach, interestingly enough.
> 
> I've removed the MS_FROZEN Pavel complained about from freeze_filesystems()
> and added comments explaining why lockdep_off/on() are used.
> 
> Thanks,
> Rafael
> 
> ---
>  fs/block_dev.c         |   56 +++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h     |    6 +++++
>  kernel/power/process.c |    7 +++++-
>  3 files changed, 68 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h
> +++ linux-2.6/include/linux/fs.h
> @@ -211,6 +211,7 @@ struct inodes_stat_t {
>  #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
>  #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
>  #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
> +#define MS_FROZEN	(1<<25) /* bdev has been frozen */
>  #define MS_NOSEC	(1<<28)
>  #define MS_BORN		(1<<29)
>  #define MS_ACTIVE	(1<<30)
> @@ -2047,6 +2048,8 @@ extern struct super_block *freeze_bdev(s
>  extern void emergency_thaw_all(void);
>  extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
>  extern int fsync_bdev(struct block_device *);
> +extern void freeze_filesystems(void);
> +extern void thaw_filesystems(void);
>  #else
>  static inline void bd_forget(struct inode *inode) {}
>  static inline int sync_blockdev(struct block_device *bdev) { return 0; }
> @@ -2061,6 +2064,9 @@ static inline int thaw_bdev(struct block
>  {
>  	return 0;
>  }
> +
> +static inline void freeze_filesystems(void) {}
> +static inline void thaw_filesystems(void) {}
>  #endif
>  extern int sync_filesystem(struct super_block *);
>  extern const struct file_operations def_blk_fops;
> Index: linux-2.6/fs/block_dev.c
> ===================================================================
> --- linux-2.6.orig/fs/block_dev.c
> +++ linux-2.6/fs/block_dev.c
> @@ -314,6 +314,62 @@ out:
>  }
>  EXPORT_SYMBOL(thaw_bdev);
>  
> +/**
> + * freeze_filesystems - Force all filesystems into a consistent state.
> + */
> +void freeze_filesystems(void)
> +{
> +	struct super_block *sb;
> +
> +	/*
> +	 * This is necessary, because some filesystems (e.g. ext3) lock
> +	 * mutexes in their .freeze_fs() callbacks and leave them locked for
> +	 * their .unfreeze_fs() callbacks to unlock.  This is done under
> +	 * bdev->bd_fsfreeze_mutex, which is then released, but it makes
> +	 * lockdep think something may be wrong when freeze_bdev() attempts
> +	 * to acquire bdev->bd_fsfreeze_mutex for the next filesystem.
> +	 */
> +	lockdep_off();

I thought those problems were fixed. If they aren't, then they most
certainly need to be because holding mutexes over system calls is a
bug.

Well, well:

[252182.603134] ================================================
[252182.604832] [ BUG: lock held when returning to user space! ]
[252182.606086] ------------------------------------------------
[252182.607400] xfs_io/4917 is leaving the kernel with locks still held!
[252182.608905] 1 lock held by xfs_io/4917:
[252182.609739]  #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff812a2aaf>] journal_lock_updates+0xef/0x100

<sigh>

Looks like the problem was fixed for ext4, but not ext3.  Please
report this to the ext3/4 list and get it fixed, don't work around
it here.

> +	/*
> +	 * Freeze in reverse order so filesystems depending on others are
> +	 * frozen in the right order (eg. loopback on ext3).
> +	 */
> +	list_for_each_entry_reverse(sb, &super_blocks, s_list) {
> +		if (!sb->s_root || !sb->s_bdev ||
> +		    (sb->s_frozen == SB_FREEZE_TRANS) ||
> +		    (sb->s_flags & MS_RDONLY))
> +			continue;
> +
> +		freeze_bdev(sb->s_bdev);
> +		sb->s_flags |= MS_FROZEN;
> +	}

AFAIK, that won't work for btrfs - you have to call freeze_super()
directly for btrfs because it has a special relationship with
sb->s_bdev. And besides, all freeze_bdev does is get an active
reference on the superblock and call freeze_super().

Also, that's traversing the list of superblock with locking and
dereferencing the superblock without properly checking that the
superblock is not being torn down. You should probably use
iterate_supers (or at least copy the code), with a function that
drops the s_umount read lock befor calling freeze_super() and then
picks it back up afterwards.

> +
> +	lockdep_on();
> +}
> +
> +/**
> + * thaw_filesystems - Make all filesystems active again.
> + */
> +void thaw_filesystems(void)
> +{
> +	struct super_block *sb;
> +
> +	/*
> +	 * This is necessary for the same reason as in freeze_filesystems()
> +	 * above.
> +	 */
> +	lockdep_off();
> +
> +	list_for_each_entry(sb, &super_blocks, s_list)
> +		if (sb->s_flags & MS_FROZEN) {
> +			sb->s_flags &= ~MS_FROZEN;
> +			thaw_bdev(sb->s_bdev, sb);
> +		}

And once again, iterate_supers() is what you want here. And you
should only call thaw_bdev() as it needs to do checks other than
checking MS_FROZEN e.g. the above will unfreeze filesystems that
were already frozen at the time a suspend occurs, and that could
lead to corruption depending on why the filesystem was frozen...

Also, you still need to check for a valid sb->s_bdev here, otherwise
<splat>.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply

* [PATCH] PM / Freezer: Freeze filesystems while freezing processes (v2)
From: Rafael J. Wysocki @ 2011-08-06 21:17 UTC (permalink / raw)
  To: Linux PM mailing list
  Cc: Christoph, Theodore Ts'o, Dave Chinner, LKML, xfs,
	Christoph Hellwig, linux-fsdevel, linux-ext4
In-Reply-To: <201108050025.09792.rjw@sisk.pl>

From: Rafael J. Wysocki <rjw@sisk.pl>

Freeze all filesystems during the freezing of tasks by calling
freeze_bdev() for each of them and thaw them during the thawing
of tasks with the help of thaw_bdev().

This is needed by hibernation, because some filesystems (e.g. XFS)
deadlock with the preallocation of memory used by it if the memory
pressure caused by it is too heavy.

The additional benefit of this change is that, if something goes
wrong after filesystems have been frozen, they will stay in a
consistent state and journal replays won't be necessary (e.g. after
a failing suspend or resume).  In particular, this should help to
solve a long-standing issue that in some cases during resume from
hibernation the boot loader causes the journal to be replied for the
filesystem containing the kernel image and initrd causing it to
become inconsistent with the information stored in the hibernation
image.

This change is based on earlier work by Nigel Cunningham.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---

OK, so nobody except for Pavel appears to have any comments, so I assume
that everyone except for Pavel is fine with the approach, interestingly enough.

I've removed the MS_FROZEN Pavel complained about from freeze_filesystems()
and added comments explaining why lockdep_off/on() are used.

Thanks,
Rafael

---
 fs/block_dev.c         |   56 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h     |    6 +++++
 kernel/power/process.c |    7 +++++-
 3 files changed, 68 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -211,6 +211,7 @@ struct inodes_stat_t {
 #define MS_KERNMOUNT	(1<<22) /* this is a kern_mount call */
 #define MS_I_VERSION	(1<<23) /* Update inode I_version field */
 #define MS_STRICTATIME	(1<<24) /* Always perform atime updates */
+#define MS_FROZEN	(1<<25) /* bdev has been frozen */
 #define MS_NOSEC	(1<<28)
 #define MS_BORN		(1<<29)
 #define MS_ACTIVE	(1<<30)
@@ -2047,6 +2048,8 @@ extern struct super_block *freeze_bdev(s
 extern void emergency_thaw_all(void);
 extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
 extern int fsync_bdev(struct block_device *);
+extern void freeze_filesystems(void);
+extern void thaw_filesystems(void);
 #else
 static inline void bd_forget(struct inode *inode) {}
 static inline int sync_blockdev(struct block_device *bdev) { return 0; }
@@ -2061,6 +2064,9 @@ static inline int thaw_bdev(struct block
 {
 	return 0;
 }
+
+static inline void freeze_filesystems(void) {}
+static inline void thaw_filesystems(void) {}
 #endif
 extern int sync_filesystem(struct super_block *);
 extern const struct file_operations def_blk_fops;
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c
+++ linux-2.6/fs/block_dev.c
@@ -314,6 +314,62 @@ out:
 }
 EXPORT_SYMBOL(thaw_bdev);
 
+/**
+ * freeze_filesystems - Force all filesystems into a consistent state.
+ */
+void freeze_filesystems(void)
+{
+	struct super_block *sb;
+
+	/*
+	 * This is necessary, because some filesystems (e.g. ext3) lock
+	 * mutexes in their .freeze_fs() callbacks and leave them locked for
+	 * their .unfreeze_fs() callbacks to unlock.  This is done under
+	 * bdev->bd_fsfreeze_mutex, which is then released, but it makes
+	 * lockdep think something may be wrong when freeze_bdev() attempts
+	 * to acquire bdev->bd_fsfreeze_mutex for the next filesystem.
+	 */
+	lockdep_off();
+
+	/*
+	 * Freeze in reverse order so filesystems depending on others are
+	 * frozen in the right order (eg. loopback on ext3).
+	 */
+	list_for_each_entry_reverse(sb, &super_blocks, s_list) {
+		if (!sb->s_root || !sb->s_bdev ||
+		    (sb->s_frozen == SB_FREEZE_TRANS) ||
+		    (sb->s_flags & MS_RDONLY))
+			continue;
+
+		freeze_bdev(sb->s_bdev);
+		sb->s_flags |= MS_FROZEN;
+	}
+
+	lockdep_on();
+}
+
+/**
+ * thaw_filesystems - Make all filesystems active again.
+ */
+void thaw_filesystems(void)
+{
+	struct super_block *sb;
+
+	/*
+	 * This is necessary for the same reason as in freeze_filesystems()
+	 * above.
+	 */
+	lockdep_off();
+
+	list_for_each_entry(sb, &super_blocks, s_list)
+		if (sb->s_flags & MS_FROZEN) {
+			sb->s_flags &= ~MS_FROZEN;
+			thaw_bdev(sb->s_bdev, sb);
+		}
+
+	lockdep_on();
+}
+
 static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
 {
 	return block_write_full_page(page, blkdev_get_block, wbc);
Index: linux-2.6/kernel/power/process.c
===================================================================
--- linux-2.6.orig/kernel/power/process.c
+++ linux-2.6/kernel/power/process.c
@@ -12,10 +12,10 @@
 #include <linux/oom.h>
 #include <linux/suspend.h>
 #include <linux/module.h>
-#include <linux/syscalls.h>
 #include <linux/freezer.h>
 #include <linux/delay.h>
 #include <linux/workqueue.h>
+#include <linux/fs.h>
 
 /* 
  * Timeout for stopping processes
@@ -147,6 +147,10 @@ int freeze_processes(void)
 		goto Exit;
 	printk("done.\n");
 
+	pr_info("Freezing filesystems ... ");
+	freeze_filesystems();
+	pr_info("done.\n");
+
 	printk("Freezing remaining freezable tasks ... ");
 	error = try_to_freeze_tasks(false);
 	if (error)
@@ -188,6 +192,7 @@ void thaw_processes(void)
 	printk("Restarting tasks ... ");
 	thaw_workqueues();
 	thaw_tasks(true);
+	thaw_filesystems();
 	thaw_tasks(false);
 	schedule();
 	printk("done.\n");

^ permalink raw reply

* [GIT PULL] Two power management fixes for 3.1
From: Rafael J. Wysocki @ 2011-08-06 19:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux PM mailing list, LKML

Hi Linus,

Please pull two power management fixes for 3.1 from:

git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6.git pm-fixes

The first one fixes a mistake in (recently merged) pm_genpd_poweron() and
the second one makes the runtime PM framework be a bit more friendly to
driver writers.

Thanks!


 Documentation/power/runtime_pm.txt |   10 +++++-----
 drivers/base/power/domain.c        |    3 +--
 drivers/base/power/runtime.c       |   10 ++++++++--
 3 files changed, 14 insertions(+), 9 deletions(-)

---------------

Kevin Hilman (1):
      PM / Runtime: Allow _put_sync() from interrupts-disabled context

Rafael J. Wysocki (1):
      PM / Domains: Fix pm_genpd_poweron()

^ permalink raw reply

* Re: [PATCH 02/11] PM: extend PM QoS with per-device wake-up constraints
From: Rafael J. Wysocki @ 2011-08-06 19:46 UTC (permalink / raw)
  To: Mark Brown; +Cc: Linux PM mailing list, linux-omap, Jean Pihet, markgross
In-Reply-To: <20110806033711.GA18987@opensource.wolfsonmicro.com>

On Saturday, August 06, 2011, Mark Brown wrote:
> On Fri, Aug 05, 2011 at 09:37:36PM +0200, Rafael J. Wysocki wrote:
> > On Friday, August 05, 2011, Mark Brown wrote:
> 
> > > Do you have any examples of this that aren't better expressed in device
> > > specific terms?
> 
> > I'm not sure what you mean exactly, but if you take two PC-like systems
> > with similar hardware configurations, but different BIOS-es and motherboard
> > layouts, it's very likely that on one of them PCI PME won't be routed
> > correctly, for example.  In that case the kernel has no way to figure out
> > that that system is broken, the problem can only be worked around from user
> > space by diabling runtime PM on the affected PCI devices.  I expect similar
> > problems to appear for the PM QoS settings.
> 
> I wouldn't say we've got to rely on userspace here - it seems like we
> ought to be able to use DMI or other system data to identify the
> affected systems and activate the workarounds.

That's only practical on systems where the kernel can be rebuilt.
Moreover, if that affects individual devices, using DMI-based blacklists
is not really practical at all.

> > > It's not that users don't know what they're doing, it's
> > > that working around system integration and stability issues in userspace
> > > isn't really progressing things well or helping with maintainability.
> 
> > No, it's not, but sometimes we simply don't have the choice.  Besides,
> > in the particular case of PM QoS, the constraints set by user space will
> > simply be added to the constraints set by kernel subsystems.  Thus they
> > won't prevent any kernel subsystem from specifying its own constraints,
> > but they will give user space the option to override the constraints
> > originating from the kernel.
> 
> You're right that it doesn't stop the kernel doing anything, the concern
> is that people just won't bother making the kernel work properly and
> will just do their power management in userspace and not bother fixing
> the kernel.

I wouldn't call it "fixing the kernel".  I'd rather say "putting workarounds
into the kernel", which I'm not sure is the right thing to at least in some
cases.

> Punting to userspace seems like it is creating the
> expectation that we can't make the kernel work and isn't great from a
> usability perspective since users shouldn't really be worrying about bus
> performance or so on, it's not something that's visible at the level
> applications work at.

However, platform builders may want to fine tuned things and I'm not sure
we should require them to patch the kernel for this purpose.

> > > Generally if the user has sufficient access to be able to do anything
> > > with this stuff they've got just as much access to the kernel as to
> > > userspace.
> 
> > Do you mean they may rebuild the kernel?  That isn't always possible.
> 
> I'm not sure I can see a lot of cases where you'd have root access and
> not be able to do kernel updates if required?  Having stuff in debugfs
> for diagnostics doesn't strike me as a problem if that's the issue.

Root access doesn't necessarily mean you have all of the requisite
tools (like compilers etc.) and installing them isn't always trivial
(think of systems like phones etc.), let alone building the kernel from
sources (where you don't necessarily know the original .config used for
building your device's kernel).

IOW, I don't buy the "you can always rebuild the kernel if necessary"
argument.  It simply is not true in general.

Thanks,
Rafael

^ permalink raw reply

* Re: Crash when suspending Lenovo T510 laptop (2.6.39.3)
From: Rafael J. Wysocki @ 2011-08-06 19:36 UTC (permalink / raw)
  To: Alan Stern; +Cc: Len Brown, linux-pm, Jesper Juhl, linux-kernel
In-Reply-To: <Pine.LNX.4.44L0.1108051641040.2055-100000@iolanthe.rowland.org>

On Friday, August 05, 2011, Alan Stern wrote:
> On Fri, 5 Aug 2011, Rafael J. Wysocki wrote:
> 
> > > You're right, sorry.  System suspend was not involved, so the problem
> > > was triggerend by runtime suspend alone, resulting from unplugging the
> > > Ethernet cable.
> > > 
> > > It looks like the crash started in pci_disable_msi().
> > 
> > OTOH, I'm not sure how pci_legacy_suspend_late() was called.
> > In theory it is only called by the system suspend code and it surely
> > is not called for e1000e.
> 
> The ? indicates that pci_legacy_suspend_late() wasn't necessarily on
> the task's call stack; it may simply have been a leftover pointer from
> some other operation, sitting somewhere in the middle of the stack
> page.

However, Jesper said the crash happend during suspend, so I wonder how
that might be present on the stack while executing runtime PM routines?

Rafael

^ permalink raw reply

* Re: Crash when suspending Lenovo T510 laptop (2.6.39.3)
From: Jesper Juhl @ 2011-08-06  8:56 UTC (permalink / raw)
  To: Alan Stern; +Cc: Len Brown, linux-pm, linux-kernel
In-Reply-To: <Pine.LNX.4.44L0.1108051641040.2055-100000@iolanthe.rowland.org>

On Fri, 5 Aug 2011, Alan Stern wrote:

> On Fri, 5 Aug 2011, Rafael J. Wysocki wrote:
> 
> > > You're right, sorry.  System suspend was not involved, so the problem
> > > was triggerend by runtime suspend alone, resulting from unplugging the
> > > Ethernet cable.
> > > 
> > > It looks like the crash started in pci_disable_msi().
> > 
> > OTOH, I'm not sure how pci_legacy_suspend_late() was called.
> > In theory it is only called by the system suspend code and it surely
> > is not called for e1000e.
> 
> The ? indicates that pci_legacy_suspend_late() wasn't necessarily on
> the task's call stack; it may simply have been a leftover pointer from
> some other operation, sitting somewhere in the middle of the stack
> page.
> 

Not sure if it helps, but I've been trying to reproduce the crash and 
although I've not succeeded doing that yet, I've got something else.

While trying to reproduce by unplugging/replugging ethernet and 
resuming/suspending a lot I found the following in dmesg after one of the 
resumes - the system seems to be working fine though and network works 
as well, so it's not the same complete crash but it looks 
potentially related :

[ 6564.921192] e1000e 0000:00:19.0: restoring config space at offset 0x1 (was 0x100103, writing 0x100107)
[ 6564.921222] ehci_hcd 0000:00:1a.0: restoring config space at offset 0xf (was 0x400, writing 0x40b)
[ 6564.921240] ehci_hcd 0000:00:1a.0: restoring config space at offset 0x4 (was 0x0, writing 0xf2828000)
[ 6564.921247] ehci_hcd 0000:00:1a.0: restoring config space at offset 0x1 (was 0x2900000, writing 0x2900102)
[ 6564.937545] ehci_hcd 0000:00:1a.0: power state changed by ACPI to D0
[ 6564.937554] ehci_hcd 0000:00:1a.0: power state changed by ACPI to D0
[ 6564.937708] pcieport 0000:00:1c.1: restoring config space at offset 0x7 (was 0xf0, writing 0x200000f0)
[ 6564.937805] pcieport 0000:00:1c.4: restoring config space at offset 0xf (was 0x100, writing 0x4010b)
[ 6564.937815] pcieport 0000:00:1c.4: restoring config space at offset 0x9 (was 0x10001, writing 0x1fff1)
[ 6564.937820] pcieport 0000:00:1c.4: restoring config space at offset 0x8 (was 0x0, writing 0xf250f250)
[ 6564.937824] pcieport 0000:00:1c.4: restoring config space at offset 0x7 (was 0x20000000, writing 0x200000f0)
[ 6564.937832] pcieport 0000:00:1c.4: restoring config space at offset 0x3 (was 0x810000, writing 0x810010)
[ 6564.937838] pcieport 0000:00:1c.4: restoring config space at offset 0x1 (was 0x100000, writing 0x100107)
[ 6564.937872] ehci_hcd 0000:00:1d.0: restoring config space at offset 0xf (was 0x400, writing 0x40b)
[ 6564.937890] ehci_hcd 0000:00:1d.0: restoring config space at offset 0x4 (was 0x0, writing 0xf2828400)
[ 6564.937897] ehci_hcd 0000:00:1d.0: restoring config space at offset 0x1 (was 0x2900000, writing 0x2900102)
[ 6564.937915] ehci_hcd 0000:00:1d.0: power state changed by ACPI to D0
[ 6564.937918] ehci_hcd 0000:00:1d.0: power state changed by ACPI to D0
[ 6564.938019] ahci 0000:00:1f.2: restoring config space at offset 0x1 (was 0x2b00007, writing 0x2b00407)
[ 6564.938085] intel ips 0000:00:1f.6: restoring config space at offset 0xf (was 0x400, writing 0x40b)
[ 6564.938107] intel ips 0000:00:1f.6: restoring config space at offset 0x1 (was 0x100000, writing 0x100002)
[ 6564.938674] PM: early resume of devices complete after 17.784 msecs[ 6564.939048] sdhci-pci 0000:0d:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[ 6564.939053] iwlagn 0000:03:00.0: RF_KILL bit toggled to disable radio.
[ 6564.939118] e1000e 0000:00:19.0: irq 41 for MSI/MSI-X
[ 6564.942219] ehci_hcd 0000:00:1a.0: power state changed by ACPI to D0
[ 6564.942229] pci 0000:00:1e.0: setting latency timer to 64
[ 6564.942236] ehci_hcd 0000:00:1d.0: power state changed by ACPI to D0
[ 6564.942246] HDA Intel 0000:00:1b.0: PCI INT B -> GSI 17 (level, low) -> IRQ 17
[ 6564.942253] HDA Intel 0000:00:1b.0: setting latency timer to 64
[ 6564.942261] ehci_hcd 0000:00:1a.0: power state changed by ACPI to D0
[ 6564.942265] ahci 0000:00:1f.2: setting latency timer to 64
[ 6564.942276] ehci_hcd 0000:00:1a.0: PCI INT D -> GSI 23 (level, low) -> IRQ 23
[ 6564.942344] ehci_hcd 0000:00:1a.0: setting latency timer to 64
[ 6564.942360] HDA Intel 0000:00:1b.0: irq 44 for MSI/MSI-X
[ 6564.942436] ehci_hcd 0000:00:1d.0: power state changed by ACPI to D0
[ 6564.942444] ehci_hcd 0000:00:1d.0: PCI INT D -> GSI 19 (level, low) -> IRQ 19
[ 6564.942452] ehci_hcd 0000:00:1d.0: setting latency timer to 64
[ 6564.942456] sd 0:0:0:0: [sda] Starting disk
[ 6564.942458] i915 0000:00:02.0: power state changed by ACPI to D0
[ 6564.942463] i915 0000:00:02.0: power state changed by ACPI to D0
[ 6564.942466] i915 0000:00:02.0: setting latency timer to 64
[ 6564.997449] firewire_ohci 0000:0d:00.3: irq 45 for MSI/MSI-X
[ 6564.997785] firewire_core: skipped bus generations, destroying all nodes
[ 6565.123881] usb 2-1.1: reset full speed USB device number 3 using ehci_hcd
[ 6565.256738] ata6: SATA link down (SStatus 0 SControl 300)
[ 6565.256782] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 6565.259348] ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 (IDLE) succeeded
[ 6565.259563] ata2.00: ACPI cmd e3/00:02:00:00:00:a0 (IDLE) succeeded
[ 6565.263747] ata2.00: ACPI cmd e3/00:1f:00:00:00:a0 (IDLE) succeeded
[ 6565.263990] ata2.00: ACPI cmd e3/00:02:00:00:00:a0 (IDLE) succeeded
[ 6565.265544] ata2.00: configured for UDMA/66
[ 6565.276723] ata5: SATA link down (SStatus 0 SControl 300)
[ 6565.439619] sdhci-pci 0000:0d:00.0: Will use DMA mode even though HW doesn't fully claim to support it.
[ 6565.439631] sdhci-pci 0000:0d:00.0: setting latency timer to 64
[ 6565.463009] usb 1-1.2: reset low speed USB device number 3 using ehci_hcd
[ 6565.496163] firewire_core: rediscovered device fw0
[ 6565.802321] usb 1-1.3: reset full speed USB device number 4 using ehci_hcd
[ 6565.951782] usb 1-1.6: reset high speed USB device number 6 using ehci_hcd
[ 6566.014871] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 6566.021871] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
[ 6566.021877] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
[ 6566.022178] ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 (SET FEATURES) succeeded
[ 6566.022183] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
[ 6566.024813] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
[ 6566.024819] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
[ 6566.025080] ata1.00: ACPI cmd ef/5f:00:00:00:00:a0 (SET FEATURES) succeeded
[ 6566.025085] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
[ 6566.026327] ata1.00: configured for UDMA/100
[ 6566.104911] usb 2-1.1.4: reset full speed USB device number 4 using ehci_hcd
[ 6566.190354] PM: resume of devices complete after 1254.565 msecs
[ 6566.191321] PM: Finishing wakeup.
[ 6566.191323] Restarting tasks ... done.
[ 6566.199706] video LNXVIDEO:00: Restoring backlight state
[ 6566.661832] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx
[ 6566.661839] e1000e 0000:00:19.0: eth0: 10/100 speed: disabling TSO
[ 6566.662160] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 6568.525731] e1000e 0000:00:19.0: PME# enabled
[ 6568.551863] e1000e 0000:00:19.0: BAR 0: set to [mem 0xf2600000-0xf261ffff] (PCI address [0xf2600000-0xf261ffff])
[ 6568.551876] e1000e 0000:00:19.0: BAR 1: set to [mem 0xf2625000-0xf2625fff] (PCI address [0xf2625000-0xf2625fff])
[ 6568.551886] e1000e 0000:00:19.0: BAR 2: set to [io  0x1820-0x183f] (PCI address [0x1820-0x183f])
[ 6568.551909] e1000e 0000:00:19.0: restoring config space at offset 0xf (was 0x100, writing 0x10b)
[ 6568.551936] e1000e 0000:00:19.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100107)
[ 6568.551999] e1000e 0000:00:19.0: PME# disabled
[ 6568.645183] e1000e 0000:00:19.0: irq 41 for MSI/MSI-X
[ 6568.698271] e1000e 0000:00:19.0: irq 41 for MSI/MSI-X
[ 6568.699163] ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 6568.814811] e1000e 0000:00:19.0: PME# enabled
[ 6568.844535] e1000e 0000:00:19.0: BAR 0: set to [mem 0xf2600000-0xf261ffff] (PCI address [0xf2600000-0xf261ffff])
[ 6568.844548] e1000e 0000:00:19.0: BAR 1: set to [mem 0xf2625000-0xf2625fff] (PCI address [0xf2625000-0xf2625fff])
[ 6568.844558] e1000e 0000:00:19.0: BAR 2: set to [io  0x1820-0x183f] (PCI address [0x1820-0x183f])
[ 6568.844580] e1000e 0000:00:19.0: restoring config space at offset 0xf (was 0x100, writing 0x10b)
[ 6568.844607] e1000e 0000:00:19.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100507)
[ 6568.844668] e1000e 0000:00:19.0: PME# disabled
[ 6568.937809] e1000e 0000:00:19.0: irq 41 for MSI/MSI-X
[ 6568.990871] e1000e 0000:00:19.0: irq 41 for MSI/MSI-X
[ 6568.991756] ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 6569.991674] ------------[ cut here ]------------
[ 6569.991694] WARNING: at drivers/net/e1000e/netdev.c:5221 __e1000_shutdown+0x43f/0x4c0 [e1000e]()
[ 6569.991698] Hardware name: 4384GJG
[ 6569.991699] Modules linked in: snd_seq snd_seq_device aesni_intel cryptd aes_x86_64 aes_generic ipv6 uvcvideo videodev media v4l2_compat_ioctl32 usbhid hid snd_hda_codec_hdmi btusb bluetooth snd_hda_codec_conexant arc4 ecb snd_hda_intel snd_hda_codec snd_hwdep e1000e i915 snd_pcm iwlagn ehci_hcd snd_timer drm_kms_helper usbcore snd_page_alloc sdhci_pci sdhci pcspkr mmc_core sg drm vboxnetadp vboxnetflt evdev tpm_tis i2c_algo_bit mac80211 serio_raw psmouse iTCO_wdt vboxdrv firewire_ohci firewire_core i2c_i801 crc_itu_t i2c_core iTCO_vendor_support cfg80211 battery thermal intel_ips wmi tpm intel_agp video ac tpm_bios button intel_gtt cpufreq_ondemand msr acpi_cpufreq freq_table processor mperf thinkpad_acpi rfkill snd soundcore nvram ext4 mbcache jbd2 crc16 sr_mod sd_mod cdrom ahci libah
 ci libata scsi_mod
[ 6569.991778] Pid: 7847, comm: kworker/1:0 Not tainted 2.6.39-ARCH #1
[ 6569.991781] Call Trace:
[ 6569.991792]  [<ffffffff8105b90f>] warn_slowpath_common+0x7f/0xc0
[ 6569.991797]  [<ffffffff8105b96a>] warn_slowpath_null+0x1a/0x20
[ 6569.991804]  [<ffffffffa064df7f>] __e1000_shutdown+0x43f/0x4c0 [e1000e]
[ 6569.991816]  [<ffffffff81012fc9>] ? sched_clock+0x9/0x10
[ 6569.991819]  [<ffffffffa064e039>] e1000_runtime_suspend+0x39/0x50 [e1000e]
[ 6569.991823]  [<ffffffff8100a6cf>] ? __switch_to+0xbf/0x2f0
[ 6569.991828]  [<ffffffff812392dd>] pci_pm_runtime_suspend+0x4d/0x100
[ 6569.991830]  [<ffffffff81239290>] ? pci_legacy_suspend_late+0x100/0x100
[ 6569.991834]  [<ffffffff812e7d52>] rpm_callback+0x42/0x80
[ 6569.991837]  [<ffffffff813e4faf>] ? schedule+0x33f/0xad0
[ 6569.991839]  [<ffffffff812e82fa>] rpm_suspend+0x1da/0x3f0
[ 6569.991843]  [<ffffffff8107701c>] ? queue_delayed_work_on+0xbc/0x150
[ 6569.991846]  [<ffffffff812e9380>] ? pm_schedule_suspend+0xe0/0xe0
[ 6569.991848]  [<ffffffff812e941a>] pm_runtime_work+0x9a/0xc0
[ 6569.991850]  [<ffffffff81077f6e>] process_one_work+0x11e/0x4c0
[ 6569.991852]  [<ffffffff810788ff>] worker_thread+0x15f/0x350
[ 6569.991854]  [<ffffffff810787a0>] ? manage_workers.isra.29+0x230/0x230
[ 6569.991857]  [<ffffffff8107d6ec>] kthread+0x8c/0xa0
[ 6569.991860]  [<ffffffff813e9fe4>] kernel_thread_helper+0x4/0x10
[ 6569.991862]  [<ffffffff8107d660>] ? kthread_worker_fn+0x190/0x190
[ 6569.991864]  [<ffffffff813e9fe0>] ? gs_change+0x13/0x13
[ 6569.991865] ---[ end trace c54917cde5791882 ]---
[ 6570.221032] e1000e 0000:00:19.0: PME# enabled
[ 6570.464118] EXT4-fs (sda3): re-mounted. Opts: commit=0
[ 6570.466279] EXT4-fs (sda1): re-mounted. Opts: commit=0
[ 6570.644945] EXT4-fs (sda4): re-mounted. Opts: commit=0
[ 6571.823780] e1000e 0000:00:19.0: BAR 0: set to [mem 0xf2600000-0xf261ffff] (PCI address [0xf2600000-0xf261ffff])
[ 6571.823791] e1000e 0000:00:19.0: BAR 1: set to [mem 0xf2625000-0xf2625fff] (PCI address [0xf2625000-0xf2625fff])
[ 6571.823800] e1000e 0000:00:19.0: BAR 2: set to [io  0x1820-0x183f] (PCI address [0x1820-0x183f])
[ 6571.823821] e1000e 0000:00:19.0: restoring config space at offset 0xf (was 0x100, writing 0x10b)
[ 6571.823847] e1000e 0000:00:19.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100107)
[ 6571.823885] e1000e 0000:00:19.0: PME# disabled
[ 6571.823965] e1000e 0000:00:19.0: irq 41 for MSI/MSI-X
[ 6571.826445] e1000e 0000:00:19.0: eth0: PHY Wakeup cause - Link Status  Change
[ 6573.521494] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx
[ 6573.521500] e1000e 0000:00:19.0: eth0: 10/100 speed: disabling TSO
[ 6573.521814] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

The kernel is still the same 2.6.39.3

Hope that helps.

I'll keep trying to reproduce the crash - will probably build a 3.0.1 with 
lots of debug options enabled and try to reproduce with that.

-- 
Jesper Juhl <jj@chaosbits.net>       http://www.chaosbits.net/
Don't top-post http://www.catb.org/jargon/html/T/top-post.html
Plain text mails only, please.

^ permalink raw reply

* Re: [PATCH 02/11] PM: extend PM QoS with per-device wake-up constraints
From: Mark Brown @ 2011-08-06  3:37 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux PM mailing list, linux-omap, Jean Pihet, markgross
In-Reply-To: <201108052137.37011.rjw@sisk.pl>

On Fri, Aug 05, 2011 at 09:37:36PM +0200, Rafael J. Wysocki wrote:
> On Friday, August 05, 2011, Mark Brown wrote:

> > Do you have any examples of this that aren't better expressed in device
> > specific terms?

> I'm not sure what you mean exactly, but if you take two PC-like systems
> with similar hardware configurations, but different BIOS-es and motherboard
> layouts, it's very likely that on one of them PCI PME won't be routed
> correctly, for example.  In that case the kernel has no way to figure out
> that that system is broken, the problem can only be worked around from user
> space by diabling runtime PM on the affected PCI devices.  I expect similar
> problems to appear for the PM QoS settings.

I wouldn't say we've got to rely on userspace here - it seems like we
ought to be able to use DMI or other system data to identify the
affected systems and activate the workarounds.

> > It's not that users don't know what they're doing, it's
> > that working around system integration and stability issues in userspace
> > isn't really progressing things well or helping with maintainability.

> No, it's not, but sometimes we simply don't have the choice.  Besides,
> in the particular case of PM QoS, the constraints set by user space will
> simply be added to the constraints set by kernel subsystems.  Thus they
> won't prevent any kernel subsystem from specifying its own constraints,
> but they will give user space the option to override the constraints
> originating from the kernel.

You're right that it doesn't stop the kernel doing anything, the concern
is that people just won't bother making the kernel work properly and
will just do their power management in userspace and not bother fixing
the kernel.  Punting to userspace seems like it is creating the
expectation that we can't make the kernel work and isn't great from a
usability perspective since users shouldn't really be worrying about bus
performance or so on, it's not something that's visible at the level
applications work at.

> > Generally if the user has sufficient access to be able to do anything
> > with this stuff they've got just as much access to the kernel as to
> > userspace.

> Do you mean they may rebuild the kernel?  That isn't always possible.

I'm not sure I can see a lot of cases where you'd have root access and
not be able to do kernel updates if required?  Having stuff in debugfs
for diagnostics doesn't strike me as a problem if that's the issue.

^ permalink raw reply

* Re: [RFC/PATCH v2] PM / Runtime: allow _put_sync() from interrupts-disabled context
From: Kevin Hilman @ 2011-08-05 23:40 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-pm, linux-omap, linux-arm-kernel, Colin Cross
In-Reply-To: <201108052122.06949.rjw@sisk.pl>

"Rafael J. Wysocki" <rjw@sisk.pl> writes:

> On Friday, August 05, 2011, Kevin Hilman wrote:
>> "Rafael J. Wysocki" <rjw@sisk.pl> writes:
>> 
>> > On Friday, July 22, 2011, Kevin Hilman wrote:
>> >> Currently the use of pm_runtime_put_sync() is not safe from
>> >> interrupts-disabled context because rpm_idle() will release the
>> >> spinlock and enable interrupts for the idle callbacks.  This enables
>> >> interrupts during a time where interrupts were expected to be
>> >> disabled, and can have strange side effects on drivers that expected
>> >> interrupts to be disabled.
>> >> 
>> >> This is not a bug since the documentation clearly states that only
>> >> _put_sync_suspend() is safe in IRQ-safe mode.
>> >> 
>> >> However, pm_runtime_put_sync() could be made safe when in IRQ-safe
>> >> mode by releasing the spinlock but not re-enabling interrupts, which
>> >> is what this patch aims to do.
>> >> 
>> >> Problem was found when using some buggy drivers that set
>> >> pm_runtime_irq_safe() and used _put_sync() in interrupts-disabled
>> >> context.
>> >> 
>> >> The offending drivers have been fixed to use _put_sync_suspend(),
>> >> But this patch is an RFC to see if it might make sense to allow
>> >> using _put_sync() from interrupts-disabled context.
>> >
>> > OK, I'm going to take this for 3.2.
>> 
>> Rafael, 
>> 
>> Since you're planning to merge this, maybe we should consider merging
>> this as a fix for v3.1, and possibly even for v3.0 stable.  That way,
>> any current drivers using irq_safe and the normal _put_sync() will not
>> have this problem.
>
> I think I can push it for 3.1, but I don't think it's stable material.
>

OK, fair enough.

Kevin

^ permalink raw reply

* Re: Crash when suspending Lenovo T510 laptop (2.6.39.3)
From: Alan Stern @ 2011-08-05 20:44 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Len Brown, linux-pm, Jesper Juhl, linux-kernel
In-Reply-To: <201108052236.12404.rjw@sisk.pl>

On Fri, 5 Aug 2011, Rafael J. Wysocki wrote:

> > You're right, sorry.  System suspend was not involved, so the problem
> > was triggerend by runtime suspend alone, resulting from unplugging the
> > Ethernet cable.
> > 
> > It looks like the crash started in pci_disable_msi().
> 
> OTOH, I'm not sure how pci_legacy_suspend_late() was called.
> In theory it is only called by the system suspend code and it surely
> is not called for e1000e.

The ? indicates that pci_legacy_suspend_late() wasn't necessarily on
the task's call stack; it may simply have been a leftover pointer from
some other operation, sitting somewhere in the middle of the stack
page.

Alan Stern

^ permalink raw reply

* Re: Crash when suspending Lenovo T510 laptop (2.6.39.3)
From: Rafael J. Wysocki @ 2011-08-05 20:36 UTC (permalink / raw)
  To: Alan Stern; +Cc: Len Brown, linux-pm, Jesper Juhl, linux-kernel
In-Reply-To: <201108052230.48862.rjw@sisk.pl>

On Friday, August 05, 2011, Rafael J. Wysocki wrote:
> On Friday, August 05, 2011, Alan Stern wrote:
> > On Fri, 5 Aug 2011, Rafael J. Wysocki wrote:
> > 
> > > > > I don't know how to fix it yet, but I think I know what the problem is.
> > > > > Namely, a runtime suspend of the Ethernet adapter as occured in parallel with
> > > > > the system-wide suspend and they clashed.  The runtime suspend has probably
> > > > > been provoked by detaching the Ethernet cable from the box.
> > > > 
> > > > For them to clash in that way would mean that the PM workqueue didn't
> > > > get frozen.
> > > 
> > > Not necessarily, I think.  It's sufficient that system suspend is started
> > > while the runtime PM operation is in progress.
> > 
> > How so?  The system suspend doesn't call down to the subsystems or 
> > drivers until after all the threads are frozen.  In this case the stack 
> > dump shows that the runtime PM operation was asynchronous, running in a 
> > workqueue thread (pm_runtime_work is one of the routines near the end 
> > of the stack dump).  The thread wouldn't have frozen until the runtime 
> > PM operation was complete.
> 
> You're right, sorry.  System suspend was not involved, so the problem
> was triggerend by runtime suspend alone, resulting from unplugging the
> Ethernet cable.
> 
> It looks like the crash started in pci_disable_msi().

OTOH, I'm not sure how pci_legacy_suspend_late() was called.
In theory it is only called by the system suspend code and it surely
is not called for e1000e.

Thanks,
Rafael

^ permalink raw reply

* Re: Crash when suspending Lenovo T510 laptop (2.6.39.3)
From: Rafael J. Wysocki @ 2011-08-05 20:30 UTC (permalink / raw)
  To: Alan Stern; +Cc: Len Brown, linux-pm, Jesper Juhl, linux-kernel
In-Reply-To: <Pine.LNX.4.44L0.1108051617430.2055-100000@iolanthe.rowland.org>

On Friday, August 05, 2011, Alan Stern wrote:
> On Fri, 5 Aug 2011, Rafael J. Wysocki wrote:
> 
> > > > I don't know how to fix it yet, but I think I know what the problem is.
> > > > Namely, a runtime suspend of the Ethernet adapter as occured in parallel with
> > > > the system-wide suspend and they clashed.  The runtime suspend has probably
> > > > been provoked by detaching the Ethernet cable from the box.
> > > 
> > > For them to clash in that way would mean that the PM workqueue didn't
> > > get frozen.
> > 
> > Not necessarily, I think.  It's sufficient that system suspend is started
> > while the runtime PM operation is in progress.
> 
> How so?  The system suspend doesn't call down to the subsystems or 
> drivers until after all the threads are frozen.  In this case the stack 
> dump shows that the runtime PM operation was asynchronous, running in a 
> workqueue thread (pm_runtime_work is one of the routines near the end 
> of the stack dump).  The thread wouldn't have frozen until the runtime 
> PM operation was complete.

You're right, sorry.  System suspend was not involved, so the problem
was triggerend by runtime suspend alone, resulting from unplugging the
Ethernet cable.

It looks like the crash started in pci_disable_msi().

Thanks,
Rafael

^ permalink raw reply

* Re: Crash when suspending Lenovo T510 laptop (2.6.39.3)
From: Alan Stern @ 2011-08-05 20:21 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Len Brown, linux-pm, Jesper Juhl, linux-kernel
In-Reply-To: <201108052213.40969.rjw@sisk.pl>

On Fri, 5 Aug 2011, Rafael J. Wysocki wrote:

> > > I don't know how to fix it yet, but I think I know what the problem is.
> > > Namely, a runtime suspend of the Ethernet adapter as occured in parallel with
> > > the system-wide suspend and they clashed.  The runtime suspend has probably
> > > been provoked by detaching the Ethernet cable from the box.
> > 
> > For them to clash in that way would mean that the PM workqueue didn't
> > get frozen.
> 
> Not necessarily, I think.  It's sufficient that system suspend is started
> while the runtime PM operation is in progress.

How so?  The system suspend doesn't call down to the subsystems or 
drivers until after all the threads are frozen.  In this case the stack 
dump shows that the runtime PM operation was asynchronous, running in a 
workqueue thread (pm_runtime_work is one of the routines near the end 
of the stack dump).  The thread wouldn't have frozen until the runtime 
PM operation was complete.

Alan Stern

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox