Linux Power Management development

Linux Power Management development
 help / color / mirror / Atom feed

* Re: [PATCH] PM / Qos: Ensure device not in PRM_SUSPENDED when pm qos flags request functions are invoked in the pm core.
From: Lan Tianyu @ 2012-11-11 12:08 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: stern, linux-pm, linux-acpi
In-Reply-To: <5096852C.3000707@intel.com>

On 2012/11/4 23:09, Lan Tianyu wrote:
> On 2012/11/3 4:11, Rafael J. Wysocki wrote:
>>>>>    }
>>>>> > >>   EXPORT_SYMBOL_GPL(dev_pm_qos_expose_flags);
>>>>> > >>@@ -645,7 +649,9 @@ void dev_pm_qos_hide_flags(struct device *dev)
>>>>> > >>   {
>>>>> > >>       if (dev->power.qos && dev->power.qos->flags_req) {
>>>>> > >>           pm_qos_sysfs_remove_flags(dev);
>>>>> > >>+        pm_runtime_get_sync(dev);
>>>>> > >>           __dev_pm_qos_drop_user_request(dev, DEV_PM_QOS_FLAGS);
>>>>> > >>+        pm_runtime_put(dev);
>>>> > >
>>>> > >I'm not sure if these two are necessary.  If we remove a request,
>>>> > >then what happens worst case is that some flags will be cleared
>>>> > >effectively which means fewer restrictions on the next sleep state.
>>>> > >Then, it shouldn't hurt that the current sleep state is more
>>>> > >restricted.
>>> >
>>> >But this mean the device can be put into lower power state(power off).
>>> >So why not do that? that can save more power, right?
>> Correct.  On the other hand, though, if the device already is in the
>> deepest low-power state available, we will resume it unnecessarily this
>> way.  Which may not be a big deal, however, and since we do that in other
>> cases, we may as well do it here.
> Yeah. This seems not very reasonable. But we can optimize this
> later.From my previous opinion, add notifier for flags and let device
> driver or bus driver to determine when the device should be resumed.
> Since you said at another email you would remove all notifiers in the pm
> qos to make some functions able to be invoked in interrupt context. I
> have a thought that check the context before call notifiers chain. If it
> was in interrupt, not call notifier chain and trigger a work queue or
> other choices to do that. If not, call the chain. Does this make sense? :)
>
Hi Rafael:
	Do you have some opinions?

>>
>> Thanks,
>> Rafael
>
>


-- 
Best Regards
Tianyu Lan
linux kernel enabling team

^ permalink raw reply

* [PATCH v5 0/6] solve deadlock caused by memory allocation with I/O
From: Ming Lei @ 2012-11-11 12:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alan Stern, Oliver Neukum, Minchan Kim, Greg Kroah-Hartman,
	Rafael J. Wysocki, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-pm, linux-mm

This patchset try to solve one deadlock problem which might be caused
by memory allocation with block I/O during runtime PM and block device
error handling path. Traditionly, the problem is addressed by passing
GFP_NOIO statically to mm, but that is not a effective solution, see
detailed description in patch 1's commit log.

This patch set introduces one process flag and trys to fix the deadlock
problem on block device/network device during runtime PM or usb bus reset.

The 1st one is the change on include/sched.h and mm.

The 2nd patch introduces the flag of memalloc_noio on 'dev_pm_info',
and pm_runtime_set_memalloc_noio(), so that PM Core can teach mm to not
allocate mm with GFP_IO during the runtime_resume callback only on
device with the flag set.

The following 2 patches apply the introduced pm_runtime_set_memalloc_noio()
to mark all devices as memalloc_noio_resume in the path from the block or
network device to the root device in device tree.

The last 2 patches are applied again PM and USB subsystem to demonstrate
how to use the introduced mechanism to fix the deadlock problem.

Change logs:
V5:
	- don't clear GFP_FS
	- coding style fix
	- add comments
	- see details in individual change logs

V4:
        - patches from the 2nd to the 6th changed
        - call pm_runtime_set_memalloc_noio() after device_add() as pointed
        by Alan
        - set PF_MEMALLOC_NOIO during runtime_suspend()

V3:
        - patch 2/6 and 5/6 changed, see their commit log
        - remove RFC from title since several guys have expressed that
        it is a reasonable solution
V2:
        - remove changes on 'may_writepage' and 'may_swap'(1/6)
        - unset GFP_IOFS in try_to_free_pages() path(1/6)
        - introduce pm_runtime_set_memalloc_noio()
        - only apply the meachnism on block/network device and its ancestors
        for runtime resume context
V1:
        - take Minchan's change to avoid the check in alloc_page hot path
        - change the helpers' style into save/restore as suggested by Alan
        - memory allocation with no io in usb bus reset path for all devices
        as suggested by Greg and Oliver

 block/genhd.c                |   10 +++++
 drivers/base/power/runtime.c |   92 +++++++++++++++++++++++++++++++++++++++++-
 drivers/usb/core/hub.c       |   13 ++++++
 include/linux/pm.h           |    1 +
 include/linux/pm_runtime.h   |    3 ++
 include/linux/sched.h        |   22 ++++++++++
 mm/page_alloc.c              |    9 ++++-
 mm/vmscan.c                  |    4 +-
 net/core/net-sysfs.c         |    5 +++
 9 files changed, 154 insertions(+), 5 deletions(-)

Thanks,
--
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH v5 1/6] mm: teach mm by current context info to not do I/O during memory allocation
From: Ming Lei @ 2012-11-11 12:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alan Stern, Oliver Neukum, Minchan Kim, Greg Kroah-Hartman,
	Rafael J. Wysocki, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-pm, linux-mm, Ming Lei, Jiri Kosina,
	Mel Gorman, KAMEZAWA Hiroyuki, Michal Hocko, Ingo Molnar,
	Peter Zijlstra
In-Reply-To: <1352637278-19968-1-git-send-email-ming.lei@canonical.com>

This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task
to avoid doing I/O inside memory allocation in the task's context.

The patch trys to solve one deadlock problem caused by block device,
and the problem may happen at least in the below situations:

- during block device runtime resume, if memory allocation with
GFP_KERNEL is called inside runtime resume callback of any one
of its ancestors(or the block device itself), the deadlock may be
triggered inside the memory allocation since it might not complete
until the block device becomes active and the involed page I/O finishes.
The situation is pointed out first by Alan Stern. It is not a good
approach to convert all GFP_KERNEL[1] in the path into GFP_NOIO because
several subsystems may be involved(for example, PCI, USB and SCSI may
be involved for usb mass stoarage device, network devices involved too
in the iSCSI case)

- during block device runtime suspend, because runtime resume need
to wait for completion of concurrent runtime suspend.

- during error handling of usb mass storage deivce, USB bus reset
will be put on the device, so there shouldn't have any
memory allocation with GFP_KERNEL during USB bus reset, otherwise
the deadlock similar with above may be triggered. Unfortunately, any
usb device may include one mass storage interface in theory, so it
requires all usb interface drivers to handle the situation. In fact,
most usb drivers don't know how to handle bus reset on the device
and don't provide .pre_set() and .post_reset() callback at all, so
USB core has to unbind and bind driver for these devices. So it
is still not practical to resort to GFP_NOIO for solving the problem.

Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.

It is not a good idea to convert all these GFP_KERNEL in the
affected path into GFP_NOIO because these functions doing that may be
implemented as library and will be called in many other contexts.

In fact, memalloc_noio_flags() can convert some of current static GFP_NOIO
allocation into GFP_KERNEL back in other non-affected contexts, at least
almost all GFP_NOIO in USB subsystem can be converted into GFP_KERNEL
after applying the approach and make allocation with GFP_NOIO
only happen in runtime resume/bus reset/block I/O transfer contexts
generally.

[1], several GFP_KERNEL allocation examples in runtime resume path

- pci subsystem
acpi_os_allocate
	<-acpi_ut_allocate
		<-ACPI_ALLOCATE_ZEROED
			<-acpi_evaluate_object
				<-__acpi_bus_set_power
					<-acpi_bus_set_power
						<-acpi_pci_set_power_state
							<-platform_pci_set_power_state
								<-pci_platform_power_transition
									<-__pci_complete_power_transition
										<-pci_set_power_state
											<-pci_restore_standard_config
												<-pci_pm_runtime_resume
- usb subsystem
usb_get_status
	<-finish_port_resume
		<-usb_port_resume
			<-generic_resume
				<-usb_resume_device
					<-usb_resume_both
						<-usb_runtime_resume

- some individual usb drivers
usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....

That is just what I have found.  Unfortunately, this allocation can
only be found by human being now, and there should be many not found
since any function in the resume path(call tree) may allocate memory
with GFP_KERNEL.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
v5:
	- use inline instead of macro to define memalloc_noio_*
	- replace memalloc_noio() with memalloc_noio_flags() to
	make code neater
	- don't clear GFP_FS because no GFP_IO means
	that allocation won't enter device driver as pointed by
	Andrew Morton

v4:
	- fix comment
v3:
	- no change
v2:
        - remove changes on 'may_writepage' and 'may_swap' because that
          isn't related with the patchset, and can't introduce I/O in
          allocation path if GFP_IOFS is unset, so handing 'may_swap'
          and may_writepage on GFP_NOIO or GFP_NOFS  should be a
          mm internal thing, and let mm guys deal with that, :-).

          Looks clearing the two may_XXX flag only excludes dirty pages
	  	  and anon pages for relaiming, and the behaviour should be decided
          by GFP FLAG, IMO.

        - unset GFP_IOFS in try_to_free_pages() path since
          alloc_page_buffers()
          and dma_alloc_from_contiguous may drop into the path, as
          pointed by KAMEZAWA Hiroyuki
v1:
        - take Minchan's change to avoid the check in alloc_page hot
          path

        - change the helpers' style into save/restore as suggested by
          Alan Stern
---
 include/linux/sched.h |   22 ++++++++++++++++++++++
 mm/page_alloc.c       |    9 ++++++++-
 mm/vmscan.c           |    4 ++--
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f2ece18..527f2a4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -51,6 +51,7 @@ struct sched_param {
 #include <linux/cred.h>
 #include <linux/llist.h>
 #include <linux/uidgid.h>
+#include <linux/gfp.h>
 
 #include <asm/processor.h>
 
@@ -1807,6 +1808,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
 #define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
 #define PF_KSWAPD	0x00040000	/* I am kswapd */
+#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -1844,6 +1846,26 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+/* GFP_NOIO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags */
+static inline gfp_t memalloc_noio_flags(gfp_t flags)
+{
+	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
+		flags &= ~GFP_NOIO;
+	return flags;
+}
+
+static inline gfp_t memalloc_noio_save(void)
+{
+	gfp_t flags = current->flags & PF_MEMALLOC_NOIO;
+	current->flags |= PF_MEMALLOC_NOIO;
+	return flags;
+}
+
+static inline void memalloc_noio_restore(gfp_t flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
+}
+
 /*
  * task->jobctl flags
  */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6b990cb..b56f763 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2644,10 +2644,17 @@ retry_cpuset:
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
 			preferred_zone, migratetype);
-	if (unlikely(!page))
+	if (unlikely(!page)) {
+		/*
+		 * Runtime PM, block IO and its error handling path
+		 * can deadlock because I/O on the device might not
+		 * complete.
+		 */
+		gfp_mask = memalloc_noio_flags(gfp_mask);
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
+	}
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 370244c..f28919a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2301,7 +2301,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 {
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
-		.gfp_mask = gfp_mask,
+		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
 		.may_writepage = !laptop_mode,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.may_unmap = 1,
@@ -3308,7 +3308,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		.may_swap = 1,
 		.nr_to_reclaim = max_t(unsigned long, nr_pages,
 				       SWAP_CLUSTER_MAX),
-		.gfp_mask = gfp_mask,
+		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
 		.order = order,
 		.priority = ZONE_RECLAIM_PRIORITY,
 	};
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v5 2/6] PM / Runtime: introduce pm_runtime_set_memalloc_noio()
From: Ming Lei @ 2012-11-11 12:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alan Stern, Oliver Neukum, Minchan Kim, Greg Kroah-Hartman,
	Rafael J. Wysocki, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-pm, linux-mm, Ming Lei
In-Reply-To: <1352637278-19968-1-git-send-email-ming.lei@canonical.com>

The patch introduces the flag of memalloc_noio in 'struct dev_pm_info'
to help PM core to teach mm not allocating memory with GFP_KERNEL
flag for avoiding probable deadlock.

As explained in the comment, any GFP_KERNEL allocation inside
runtime_resume() or runtime_suspend() on any one of device in
the path from one block or network device to the root device
in the device tree may cause deadlock, the introduced
pm_runtime_set_memalloc_noio() sets or clears the flag on
device in the path recursively.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
v5:
	- fix code style error
	- add comment on clear the device memalloc_noio flag
v4:
	- rename memalloc_noio_resume as memalloc_noio
	- remove pm_runtime_get_memalloc_noio()
	- add comments on pm_runtime_set_memalloc_noio
v3:
	- introduce pm_runtime_get_memalloc_noio()
	- hold one global lock on pm_runtime_set_memalloc_noio
	- hold device power lock when accessing memalloc_noio_resume
	  flag suggested by Alan Stern
	- implement pm_runtime_set_memalloc_noio without recursion
	  suggested by Alan Stern
v2:
	- introduce pm_runtime_set_memalloc_noio()
---
 drivers/base/power/runtime.c |   60 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/pm.h           |    1 +
 include/linux/pm_runtime.h   |    3 +++
 3 files changed, 64 insertions(+)

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index 3148b10..3e198a0 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -124,6 +124,66 @@ unsigned long pm_runtime_autosuspend_expiration(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(pm_runtime_autosuspend_expiration);
 
+static int dev_memalloc_noio(struct device *dev, void *data)
+{
+	return dev->power.memalloc_noio;
+}
+
+/*
+ * pm_runtime_set_memalloc_noio - Set a device's memalloc_noio flag.
+ * @dev: Device to handle.
+ * @enable: True for setting the flag and False for clearing the flag.
+ *
+ * Set the flag for all devices in the path from the device to the
+ * root device in the device tree if @enable is true, otherwise clear
+ * the flag for devices in the path whose siblings don't set the flag.
+ *
+ * The function should only be called by block device, or network
+ * device driver for solving the deadlock problem during runtime
+ * resume/suspend:
+ *
+ *     If memory allocation with GFP_KERNEL is called inside runtime
+ *     resume/suspend callback of any one of its ancestors(or the
+ *     block device itself), the deadlock may be triggered inside the
+ *     memory allocation since it might not complete until the block
+ *     device becomes active and the involed page I/O finishes. The
+ *     situation is pointed out first by Alan Stern. Network device
+ *     are involved in iSCSI kind of situation.
+ *
+ * The lock of dev_hotplug_mutex is held in the function for handling
+ * hotplug race because pm_runtime_set_memalloc_noio() may be called
+ * in async probe().
+ *
+ * The function should be called between device_add() and device_del()
+ * on the affected device(block/network device).
+ */
+void pm_runtime_set_memalloc_noio(struct device *dev, bool enable)
+{
+	static DEFINE_MUTEX(dev_hotplug_mutex);
+
+	mutex_lock(&dev_hotplug_mutex);
+	for (;;) {
+		/* hold power lock since bitfield is not SMP-safe. */
+		spin_lock_irq(&dev->power.lock);
+		dev->power.memalloc_noio = enable;
+		spin_unlock_irq(&dev->power.lock);
+
+		dev = dev->parent;
+
+		/*
+		 * clear flag of the parent device only if all the
+		 * children don't set the flag because ancestor's
+		 * flag was set by any one of the descendants.
+		 */
+		if (!dev || (!enable &&
+			     device_for_each_child(dev, NULL,
+						   dev_memalloc_noio)))
+			break;
+	}
+	mutex_unlock(&dev_hotplug_mutex);
+}
+EXPORT_SYMBOL_GPL(pm_runtime_set_memalloc_noio);
+
 /**
  * rpm_check_suspend_allowed - Test whether a device may be suspended.
  * @dev: Device to test.
diff --git a/include/linux/pm.h b/include/linux/pm.h
index 03d7bb1..1a8a69d 100644
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -538,6 +538,7 @@ struct dev_pm_info {
 	unsigned int		irq_safe:1;
 	unsigned int		use_autosuspend:1;
 	unsigned int		timer_autosuspends:1;
+	unsigned int		memalloc_noio:1;
 	enum rpm_request	request;
 	enum rpm_status		runtime_status;
 	int			runtime_error;
diff --git a/include/linux/pm_runtime.h b/include/linux/pm_runtime.h
index f271860..775e063 100644
--- a/include/linux/pm_runtime.h
+++ b/include/linux/pm_runtime.h
@@ -47,6 +47,7 @@ extern void pm_runtime_set_autosuspend_delay(struct device *dev, int delay);
 extern unsigned long pm_runtime_autosuspend_expiration(struct device *dev);
 extern void pm_runtime_update_max_time_suspended(struct device *dev,
 						 s64 delta_ns);
+extern void pm_runtime_set_memalloc_noio(struct device *dev, bool enable);
 
 static inline bool pm_children_suspended(struct device *dev)
 {
@@ -149,6 +150,8 @@ static inline void pm_runtime_set_autosuspend_delay(struct device *dev,
 						int delay) {}
 static inline unsigned long pm_runtime_autosuspend_expiration(
 				struct device *dev) { return 0; }
+static inline void pm_runtime_set_memalloc_noio(struct device *dev,
+						bool enable){}
 
 #endif /* !CONFIG_PM_RUNTIME */
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v5 3/6] block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices
From: Ming Lei @ 2012-11-11 12:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alan Stern, Oliver Neukum, Minchan Kim, Greg Kroah-Hartman,
	Rafael J. Wysocki, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-pm, linux-mm, Ming Lei
In-Reply-To: <1352637278-19968-1-git-send-email-ming.lei@canonical.com>

This patch applyes the introduced pm_runtime_set_memalloc_noio on
block device so that PM core will teach mm to not allocate memory with
GFP_IOFS when calling the runtime_resume and runtime_suspend callback
for block devices and its ancestors.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
v5:
	- fix code style and one typo
v4:
	- call pm_runtime_set_memalloc_noio(ddev, true) after device_add
---
 block/genhd.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 9e02cd6..085cce4 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -18,6 +18,7 @@
 #include <linux/mutex.h>
 #include <linux/idr.h>
 #include <linux/log2.h>
+#include <linux/pm_runtime.h>
 
 #include "blk.h"
 
@@ -532,6 +533,14 @@ static void register_disk(struct gendisk *disk)
 			return;
 		}
 	}
+
+	/*
+	 * avoid probable deadlock caused by allocating memory with
+	 * GFP_KERNEL in runtime_resume callback of its all ancestor
+	 * devices
+	 */
+	pm_runtime_set_memalloc_noio(ddev, true);
+
 	disk->part0.holder_dir = kobject_create_and_add("holders", &ddev->kobj);
 	disk->slave_dir = kobject_create_and_add("slaves", &ddev->kobj);
 
@@ -661,6 +670,7 @@ void del_gendisk(struct gendisk *disk)
 	disk->driverfs_dev = NULL;
 	if (!sysfs_deprecated)
 		sysfs_remove_link(block_depr, dev_name(disk_to_dev(disk)));
+	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
 	device_del(disk_to_dev(disk));
 }
 EXPORT_SYMBOL(del_gendisk);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v5 4/6] net/core: apply pm_runtime_set_memalloc_noio on network devices
From: Ming Lei @ 2012-11-11 12:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alan Stern, Oliver Neukum, Minchan Kim, Greg Kroah-Hartman,
	Rafael J. Wysocki, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-pm, linux-mm, Ming Lei, Eric Dumazet,
	David Decotigny, Tom Herbert, Ingo Molnar
In-Reply-To: <1352637278-19968-1-git-send-email-ming.lei@canonical.com>

Deadlock might be caused by allocating memory with GFP_KERNEL in
runtime_resume and runtime_suspend callback of network devices in
iSCSI situation, so mark network devices and its ancestor as
'memalloc_noio' with the introduced pm_runtime_set_memalloc_noio().

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
v4:
	 - call pm_runtime_set_memalloc_noio(ddev, true) after
	   device_add
---
 net/core/net-sysfs.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bcf02f6..a55d255 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -22,6 +22,7 @@
 #include <linux/vmalloc.h>
 #include <linux/export.h>
 #include <linux/jiffies.h>
+#include <linux/pm_runtime.h>
 #include <net/wext.h>
 
 #include "net-sysfs.h"
@@ -1386,6 +1387,8 @@ void netdev_unregister_kobject(struct net_device * net)
 
 	remove_queue_kobjects(net);
 
+	pm_runtime_set_memalloc_noio(dev, false);
+
 	device_del(dev);
 }
 
@@ -1421,6 +1424,8 @@ int netdev_register_kobject(struct net_device *net)
 		return error;
 	}
 
+	pm_runtime_set_memalloc_noio(dev, true);
+
 	return error;
 }
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v5 5/6] PM / Runtime: force memory allocation with no I/O during Runtime PM callbcack
From: Ming Lei @ 2012-11-11 12:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alan Stern, Oliver Neukum, Minchan Kim, Greg Kroah-Hartman,
	Rafael J. Wysocki, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-pm, linux-mm, Ming Lei
In-Reply-To: <1352637278-19968-1-git-send-email-ming.lei@canonical.com>

This patch applies the introduced memalloc_noio_save() and
memalloc_noio_restore() to force memory allocation with no I/O
during runtime_resume/runtime_suspend callback on device with
the flag of 'memalloc_noio' set.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
v5:
	- use inline memalloc_noio_save()
v4:
	- runtime_suspend need this too because rpm_resume may wait for
	completion of concurrent runtime_suspend, so deadlock still may
	be triggered in runtime_suspend path.
---
 drivers/base/power/runtime.c |   32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index 3e198a0..96d99ea 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -371,6 +371,7 @@ static int rpm_suspend(struct device *dev, int rpmflags)
 	int (*callback)(struct device *);
 	struct device *parent = NULL;
 	int retval;
+	unsigned int noio_flag;
 
 	trace_rpm_suspend(dev, rpmflags);
 
@@ -480,7 +481,20 @@ static int rpm_suspend(struct device *dev, int rpmflags)
 	if (!callback && dev->driver && dev->driver->pm)
 		callback = dev->driver->pm->runtime_suspend;
 
-	retval = rpm_callback(callback, dev);
+	/*
+	 * Deadlock might be caused if memory allocation with GFP_KERNEL
+	 * happens inside runtime_suspend callback of one block device's
+	 * ancestor or the block device itself. Network device might be
+	 * thought as part of iSCSI block device, so network device and
+	 * its ancestor should be marked as memalloc_noio.
+	 */
+	if (dev->power.memalloc_noio) {
+		noio_flag = memalloc_noio_save();
+		retval = rpm_callback(callback, dev);
+		memalloc_noio_restore(noio_flag);
+	} else {
+		retval = rpm_callback(callback, dev);
+	}
 	if (retval)
 		goto fail;
 
@@ -563,6 +577,7 @@ static int rpm_resume(struct device *dev, int rpmflags)
 	int (*callback)(struct device *);
 	struct device *parent = NULL;
 	int retval = 0;
+	unsigned int noio_flag;
 
 	trace_rpm_resume(dev, rpmflags);
 
@@ -712,7 +727,20 @@ static int rpm_resume(struct device *dev, int rpmflags)
 	if (!callback && dev->driver && dev->driver->pm)
 		callback = dev->driver->pm->runtime_resume;
 
-	retval = rpm_callback(callback, dev);
+	/*
+	 * Deadlock might be caused if memory allocation with GFP_KERNEL
+	 * happens inside runtime_resume callback of one block device's
+	 * ancestor or the block device itself. Network device might be
+	 * thought as part of iSCSI block device, so network device and
+	 * its ancestor should be marked as memalloc_noio.
+	 */
+	if (dev->power.memalloc_noio) {
+		noio_flag = memalloc_noio_save();
+		retval = rpm_callback(callback, dev);
+		memalloc_noio_restore(noio_flag);
+	} else {
+		retval = rpm_callback(callback, dev);
+	}
 	if (retval) {
 		__update_runtime_status(dev, RPM_SUSPENDED);
 		pm_runtime_cancel_pending(dev);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v5 6/6] USB: forbid memory allocation with I/O during bus reset
From: Ming Lei @ 2012-11-11 12:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alan Stern, Oliver Neukum, Minchan Kim, Greg Kroah-Hartman,
	Rafael J. Wysocki, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-pm, linux-mm, Ming Lei
In-Reply-To: <1352637278-19968-1-git-send-email-ming.lei@canonical.com>

If one storage interface or usb network interface(iSCSI case)
exists in current configuration, memory allocation with
GFP_KERNEL during usb_device_reset() might trigger I/O transfer
on the storage interface itself and cause deadlock because
the 'us->dev_mutex' is held in .pre_reset() and the storage
interface can't do I/O transfer when the reset is triggered
by other interface, or the error handling can't be completed
if the reset is triggered by the storage itself(error handling path).

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
v5:
        - use inline memalloc_noio_save()
v4:
	- mark current memalloc_noio for every usb device reset
---
 drivers/usb/core/hub.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c
index 90accde..2d5cc1c 100644
--- a/drivers/usb/core/hub.c
+++ b/drivers/usb/core/hub.c
@@ -5040,6 +5040,7 @@ int usb_reset_device(struct usb_device *udev)
 {
 	int ret;
 	int i;
+	unsigned int noio_flag;
 	struct usb_host_config *config = udev->actconfig;
 
 	if (udev->state == USB_STATE_NOTATTACHED ||
@@ -5049,6 +5050,17 @@ int usb_reset_device(struct usb_device *udev)
 		return -EINVAL;
 	}
 
+	/*
+	 * Don't allocate memory with GFP_KERNEL in current
+	 * context to avoid possible deadlock if usb mass
+	 * storage interface or usbnet interface(iSCSI case)
+	 * is included in current configuration. The easist
+	 * approach is to do it for every device reset,
+	 * because the device 'memalloc_noio' flag may have
+	 * not been set before reseting the usb device.
+	 */
+	noio_flag = memalloc_noio_save();
+
 	/* Prevent autosuspend during the reset */
 	usb_autoresume_device(udev);
 
@@ -5093,6 +5105,7 @@ int usb_reset_device(struct usb_device *udev)
 	}
 
 	usb_autosuspend_device(udev);
+	memalloc_noio_restore(noio_flag);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(usb_reset_device);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH v5 1/6] mm: teach mm by current context info to not do I/O during memory allocation
From: Ming Lei @ 2012-11-11 12:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alan Stern, Oliver Neukum, Minchan Kim, Greg Kroah-Hartman,
	Rafael J. Wysocki, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-pm, linux-mm, Ming Lei, Jiri Kosina,
	Mel Gorman, KAMEZAWA Hiroyuki, Michal Hocko, Ingo Molnar,
	Peter Zijlstra
In-Reply-To: <1352637278-19968-2-git-send-email-ming.lei@canonical.com>

On Sun, Nov 11, 2012 at 8:34 PM, Ming Lei <ming.lei@canonical.com> wrote:
> +/* GFP_NOIO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags */
> +static inline gfp_t memalloc_noio_flags(gfp_t flags)
> +{
> +       if (unlikely(current->flags & PF_MEMALLOC_NOIO))
> +               flags &= ~GFP_NOIO;
> +       return flags;

Sorry, the above is wrong, and GFP_IO should be cleared, and I will
resend this one.


Thanks,
--
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH v5 1/6] mm: teach mm by current context info to not do I/O during memory allocation
From: Ming Lei @ 2012-11-11 12:42 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alan Stern, Oliver Neukum, Minchan Kim, Greg Kroah-Hartman,
	Rafael J. Wysocki, Jens Axboe, David S. Miller, Andrew Morton,
	netdev, linux-usb, linux-pm, linux-mm, Ming Lei, Jiri Kosina,
	Mel Gorman, KAMEZAWA Hiroyuki, Michal Hocko, Ingo Molnar,
	Peter Zijlstra

This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task
to avoid doing I/O inside memory allocation in the task's context.

The patch trys to solve one deadlock problem caused by block device,
and the problem may happen at least in the below situations:

- during block device runtime resume, if memory allocation with
GFP_KERNEL is called inside runtime resume callback of any one
of its ancestors(or the block device itself), the deadlock may be
triggered inside the memory allocation since it might not complete
until the block device becomes active and the involed page I/O finishes.
The situation is pointed out first by Alan Stern. It is not a good
approach to convert all GFP_KERNEL[1] in the path into GFP_NOIO because
several subsystems may be involved(for example, PCI, USB and SCSI may
be involved for usb mass stoarage device, network devices involved too
in the iSCSI case)

- during block device runtime suspend, because runtime resume need
to wait for completion of concurrent runtime suspend.

- during error handling of usb mass storage deivce, USB bus reset
will be put on the device, so there shouldn't have any
memory allocation with GFP_KERNEL during USB bus reset, otherwise
the deadlock similar with above may be triggered. Unfortunately, any
usb device may include one mass storage interface in theory, so it
requires all usb interface drivers to handle the situation. In fact,
most usb drivers don't know how to handle bus reset on the device
and don't provide .pre_set() and .post_reset() callback at all, so
USB core has to unbind and bind driver for these devices. So it
is still not practical to resort to GFP_NOIO for solving the problem.

Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.

It is not a good idea to convert all these GFP_KERNEL in the
affected path into GFP_NOIO because these functions doing that may be
implemented as library and will be called in many other contexts.

In fact, memalloc_noio_flags() can convert some of current static GFP_NOIO
allocation into GFP_KERNEL back in other non-affected contexts, at least
almost all GFP_NOIO in USB subsystem can be converted into GFP_KERNEL
after applying the approach and make allocation with GFP_NOIO
only happen in runtime resume/bus reset/block I/O transfer contexts
generally.

[1], several GFP_KERNEL allocation examples in runtime resume path

- pci subsystem
acpi_os_allocate
	<-acpi_ut_allocate
		<-ACPI_ALLOCATE_ZEROED
			<-acpi_evaluate_object
				<-__acpi_bus_set_power
					<-acpi_bus_set_power
						<-acpi_pci_set_power_state
							<-platform_pci_set_power_state
								<-pci_platform_power_transition
									<-__pci_complete_power_transition
										<-pci_set_power_state
											<-pci_restore_standard_config
												<-pci_pm_runtime_resume
- usb subsystem
usb_get_status
	<-finish_port_resume
		<-usb_port_resume
			<-generic_resume
				<-usb_resume_device
					<-usb_resume_both
						<-usb_runtime_resume

- some individual usb drivers
usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....

That is just what I have found.  Unfortunately, this allocation can
only be found by human being now, and there should be many not found
since any function in the resume path(call tree) may allocate memory
with GFP_KERNEL.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
v5:
	- use inline instead of macro to define memalloc_noio_*
	- replace memalloc_noio() with memalloc_noio_flags() to
	make code neater
	- don't clear GFP_FS because no GFP_IO means
	that allocation won't enter device driver as pointed by
	Andrew Morton

v4:
	- fix comment
v3:
	- no change
v2:
        - remove changes on 'may_writepage' and 'may_swap' because that
          isn't related with the patchset, and can't introduce I/O in
          allocation path if GFP_IOFS is unset, so handing 'may_swap'
          and may_writepage on GFP_NOIO or GFP_NOFS  should be a
          mm internal thing, and let mm guys deal with that, :-).

          Looks clearing the two may_XXX flag only excludes dirty pages
	  	  and anon pages for relaiming, and the behaviour should be decided
          by GFP FLAG, IMO.

        - unset GFP_IOFS in try_to_free_pages() path since
          alloc_page_buffers()
          and dma_alloc_from_contiguous may drop into the path, as
          pointed by KAMEZAWA Hiroyuki
v1:
        - take Minchan's change to avoid the check in alloc_page hot
          path

        - change the helpers' style into save/restore as suggested by
          Alan Stern
---
 include/linux/sched.h |   22 ++++++++++++++++++++++
 mm/page_alloc.c       |    9 ++++++++-
 mm/vmscan.c           |    4 ++--
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f2ece18..527f2a4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -51,6 +51,7 @@ struct sched_param {
 #include <linux/cred.h>
 #include <linux/llist.h>
 #include <linux/uidgid.h>
+#include <linux/gfp.h>
 
 #include <asm/processor.h>
 
@@ -1807,6 +1808,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
 #define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
 #define PF_KSWAPD	0x00040000	/* I am kswapd */
+#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -1844,6 +1846,26 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+/* GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags */
+static inline gfp_t memalloc_noio_flags(gfp_t flags)
+{
+	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
+		flags &= ~GFP_IO;
+	return flags;
+}
+
+static inline gfp_t memalloc_noio_save(void)
+{
+	gfp_t flags = current->flags & PF_MEMALLOC_NOIO;
+	current->flags |= PF_MEMALLOC_NOIO;
+	return flags;
+}
+
+static inline void memalloc_noio_restore(gfp_t flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
+}
+
 /*
  * task->jobctl flags
  */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6b990cb..b56f763 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2644,10 +2644,17 @@ retry_cpuset:
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
 			preferred_zone, migratetype);
-	if (unlikely(!page))
+	if (unlikely(!page)) {
+		/*
+		 * Runtime PM, block IO and its error handling path
+		 * can deadlock because I/O on the device might not
+		 * complete.
+		 */
+		gfp_mask = memalloc_noio_flags(gfp_mask);
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
 				preferred_zone, migratetype);
+	}
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 370244c..f28919a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2301,7 +2301,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 {
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
-		.gfp_mask = gfp_mask,
+		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
 		.may_writepage = !laptop_mode,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.may_unmap = 1,
@@ -3308,7 +3308,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		.may_swap = 1,
 		.nr_to_reclaim = max_t(unsigned long, nr_pages,
 				       SWAP_CLUSTER_MAX),
-		.gfp_mask = gfp_mask,
+		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
 		.order = order,
 		.priority = ZONE_RECLAIM_PRIORITY,
 	};
-- 
1.7.9.5


^ permalink raw reply related

* Re: [PATCH] PM / Qos: Ensure device not in PRM_SUSPENDED when pm qos flags request functions are invoked in the pm core.
From: Rafael J. Wysocki @ 2012-11-11 14:43 UTC (permalink / raw)
  To: Lan Tianyu; +Cc: stern, linux-pm, linux-acpi
In-Reply-To: <509F9550.70508@intel.com>

On Sunday, November 11, 2012 08:08:48 PM Lan Tianyu wrote:
> On 2012/11/4 23:09, Lan Tianyu wrote:
> > On 2012/11/3 4:11, Rafael J. Wysocki wrote:
> >>>>>    }
> >>>>> > >>   EXPORT_SYMBOL_GPL(dev_pm_qos_expose_flags);
> >>>>> > >>@@ -645,7 +649,9 @@ void dev_pm_qos_hide_flags(struct device *dev)
> >>>>> > >>   {
> >>>>> > >>       if (dev->power.qos && dev->power.qos->flags_req) {
> >>>>> > >>           pm_qos_sysfs_remove_flags(dev);
> >>>>> > >>+        pm_runtime_get_sync(dev);
> >>>>> > >>           __dev_pm_qos_drop_user_request(dev, DEV_PM_QOS_FLAGS);
> >>>>> > >>+        pm_runtime_put(dev);
> >>>> > >
> >>>> > >I'm not sure if these two are necessary.  If we remove a request,
> >>>> > >then what happens worst case is that some flags will be cleared
> >>>> > >effectively which means fewer restrictions on the next sleep state.
> >>>> > >Then, it shouldn't hurt that the current sleep state is more
> >>>> > >restricted.
> >>> >
> >>> >But this mean the device can be put into lower power state(power off).
> >>> >So why not do that? that can save more power, right?
> >> Correct.  On the other hand, though, if the device already is in the
> >> deepest low-power state available, we will resume it unnecessarily this
> >> way.  Which may not be a big deal, however, and since we do that in other
> >> cases, we may as well do it here.
> > Yeah. This seems not very reasonable. But we can optimize this
> > later.From my previous opinion, add notifier for flags and let device
> > driver or bus driver to determine when the device should be resumed.
> > Since you said at another email you would remove all notifiers in the pm
> > qos to make some functions able to be invoked in interrupt context. I
> > have a thought that check the context before call notifiers chain. If it
> > was in interrupt, not call notifier chain and trigger a work queue or
> > other choices to do that. If not, call the chain. Does this make sense? :)
> >
> Hi Rafael:
> 	Do you have some opinions?

First off, I've applied the last version of this patch. :-)

Second, I don't think we need notifiers at all in the case of device PM QoS
and, moreover, we'd actually be better off without them.

There generally are two reasons for the notifiers in that case.  One reason
is to prevent devices from sleeping too long if they were suspended before
a new PM QoS request has been added (or an existing one has been updated).
The second reason is to allow things like PM domains to recompute their
numbers taking the new request into account.

Now, if whoever modifies/adds PM QoS requests for certain device ensures
that the device is not RPM_SUSPENDED while the PM QoS constraints are
changing, that makes the first reason go away.  The second reason may be
taken care of by changing the PM core to set a (new) flag whenever PM QoS
constraints for a device change, so that whoever cares can take that into
account while the device is suspended next time.  This way all of the
additional computations will only need to happen when devices are suspended
and the code flow will be much easier to follow.

Thanks,
Rafael

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [PATCH 3/6 v4] cpufreq: tolerate inexact values when collecting stats
From: Borislav Petkov @ 2012-11-11 16:38 UTC (permalink / raw)
  To: Mark Langsdorf; +Cc: linux-kernel, cpufreq, linux-pm, MyungJoo Ham
In-Reply-To: <1352313166-28980-4-git-send-email-mark.langsdorf@calxeda.com>

On Wed, Nov 07, 2012 at 12:32:43PM -0600, Mark Langsdorf wrote:
> When collecting stats, if a frequency doesn't match the table, go through
> the table again with both the search frequency and table values shifted
> left by 10 bits.

Why would that second pass succeed?

And why is this in generic code (I'm assuming this is a Calxeda-specific
case)?

-- 
Regards/Gruss,
Boris.

^ permalink raw reply

* Re: Auto reboot when CPU at full load with X86_ACPI_CPUFREQ
From: Pavel Machek @ 2012-11-11 21:00 UTC (permalink / raw)
  To: Drunkard Zhang; +Cc: Rafael J. Wysocki, cpufreq, linux-pm, linux-kernel
In-Reply-To: <CAJiejCT8F18fH9yrtTDfHydFi6TDNdDWf8Ny7OATokf7EM-X1g@mail.gmail.com>

Hi!

> I'm using Intel Xeon X5570 x2 with Asus Z8PE-D18, and experiencing
> auto reboot when CPU full loaded for minutes, like building kernel
> with "make -j17". After a lot of bisecting of config file, I found the
> option leads to the reboot: X86_ACPI_CPUFREQ, both configed
> X86_ACPI_CPUFREQ as a module or built in will lead to reboot.
> 
> Config file finally bisected appended, config-3.7.0-rc3+-bad is the
> one leads to reboot, config-3.7.0-rc3+-ok works OK. Hardware info also
> appended.
> 
> I think it is a bug, anything I can do? When the bug triggered, the
> screen blanked immediately, any advice for me to debug? Happy to match
> to code :-)
> 
> This bug is CPU specific, with Xeon E5606 or E5620 it's all fine, just
> triggered with Xeon X5570, or maybe all Xeon X serial.
> 
> Tested kernel version: 3.1.x, 3.3.x, 3.5.x, 3.6.x, 3.7*, they are all affected.

What does temperature do during those runs? 
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Huang Ying @ 2012-11-12  0:37 UTC (permalink / raw)
  To: Alan Stern; +Cc: Rafael J. Wysocki, linux-kernel, linux-pm
In-Reply-To: <Pine.LNX.4.44L0.1211091131190.1702-100000@iolanthe.rowland.org>

On Fri, 2012-11-09 at 11:41 -0500, Alan Stern wrote:
> On Fri, 9 Nov 2012, Huang Ying wrote:
> 
> > On Thu, 2012-11-08 at 12:07 -0500, Alan Stern wrote:
> > > On Thu, 8 Nov 2012, Rafael J. Wysocki wrote:
> > > 
> > > > > > > is it a good idea to allow to set device state to SUSPENDED if the device
> > > > > > > is disabled?
> > > > > > 
> > > > > > No, it is not.  The status should always be ACTIVE as long as usage_count > 0.
> > > 
> > > That isn't strictly true, because pm_runtime_get_noresume violates this
> > > rule.  What the PM core actually does is prevent a transition from the
> > > ACTIVE state to the SUSPENDING/SUSPENDED state if usage_count > 0,
> > > _provided_ runtime PM is enabled.  There's no such restriction when it
> > > is disabled.
> > 
> > Usage count may be not a issue for the end user.  But "on" in "control"
> > sysfs file + SUSPENDED can be confusing for the end user.  Maybe we need
> > to check dev->power.runtime_auto in pm_runtime_set_suspended().
> 
> You are confusing the issue by raising two separate (though related)
> questions.

Thanks for clarify.

> The first question: How should the PCI subsystem prevent the parents of 
> driverless VGA devices from being runtime suspended while userspace is 
> accessing them?

I think Rafael's patch is good for that.

> The second question: Should the PM core allow devices that are disabled
> for runtime PM to be in the SUSPENDED state when
> dev->power.runtime_auto is clear?

I think that should not be allowed.

> Assuming we don't want to allow this, there's a third question: Should
> pm_runtime_allow call pm_runtime_set_suspended if the device is
> disabled?

Is it absolute necessary to call pm_runtime_set_suspended?  If the
device is disabled, the transition to SUSPENDED state will not be
triggered even if the device is ACTIVE.

Best Regards,
Huang Ying



^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Alan Stern @ 2012-11-12  2:36 UTC (permalink / raw)
  To: Huang Ying; +Cc: Rafael J. Wysocki, linux-kernel, linux-pm
In-Reply-To: <1352680649.7176.140.camel@yhuang-dev>

On Mon, 12 Nov 2012, Huang Ying wrote:

> > The first question: How should the PCI subsystem prevent the parents of 
> > driverless VGA devices from being runtime suspended while userspace is 
> > accessing them?
> 
> I think Rafael's patch is good for that.

But his patch isn't needed if we make these other changes.

> > The second question: Should the PM core allow devices that are disabled
> > for runtime PM to be in the SUSPENDED state when
> > dev->power.runtime_auto is clear?
> 
> I think that should not be allowed.

Disallowing it is okay with me too.  But it will require several 
changes to the code, more than what your patch did.

> > Assuming we don't want to allow this, there's a third question: Should
> > pm_runtime_allow call pm_runtime_set_suspended if the device is
> > disabled?
> 
> Is it absolute necessary to call pm_runtime_set_suspended?  If the
> device is disabled, the transition to SUSPENDED state will not be
> triggered even if the device is ACTIVE.

It's not absolutely necessary to do this, but we ought to because it 
will allow the device's parent to be suspended.  If we leave the device 
in the ACTIVE state then the parent can't be suspended, even when the 
device is disabled.

Alan Stern

^ permalink raw reply

* Re: Auto reboot when CPU at full load with X86_ACPI_CPUFREQ
From: Drunkard Zhang @ 2012-11-12  2:41 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Rafael J. Wysocki, cpufreq, linux-pm, linux-kernel
In-Reply-To: <20121111210053.GA6436@elf.ucw.cz>

2012/11/12 Pavel Machek <pavel@ucw.cz>:
> Hi!
>
>> I'm using Intel Xeon X5570 x2 with Asus Z8PE-D18, and experiencing
>> auto reboot when CPU full loaded for minutes, like building kernel
>> with "make -j17". After a lot of bisecting of config file, I found the
>> option leads to the reboot: X86_ACPI_CPUFREQ, both configed
>> X86_ACPI_CPUFREQ as a module or built in will lead to reboot.
>>
>> Config file finally bisected appended, config-3.7.0-rc3+-bad is the
>> one leads to reboot, config-3.7.0-rc3+-ok works OK. Hardware info also
>> appended.
>>
>> I think it is a bug, anything I can do? When the bug triggered, the
>> screen blanked immediately, any advice for me to debug? Happy to match
>> to code :-)
>>
>> This bug is CPU specific, with Xeon E5606 or E5620 it's all fine, just
>> triggered with Xeon X5570, or maybe all Xeon X serial.
>>
>> Tested kernel version: 3.1.x, 3.3.x, 3.5.x, 3.6.x, 3.7*, they are all affected.
>
> What does temperature do during those runs?
>                                                                 Pavel

CPU temperature didn't over 60 by degrees. I'm sure it's not hardware
faults, because without X86_ACPI_CPUFREQ it works fine, built in or
load as module it will reboot. I also configured some debug options to
catch something, but the reboot is too fast, dam. Any sugests?

^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Huang Ying @ 2012-11-12  5:55 UTC (permalink / raw)
  To: Alan Stern; +Cc: Rafael J. Wysocki, linux-kernel, linux-pm
In-Reply-To: <Pine.LNX.4.44L0.1211112131500.16611-100000@netrider.rowland.org>

On Sun, 2012-11-11 at 21:36 -0500, Alan Stern wrote:
> On Mon, 12 Nov 2012, Huang Ying wrote:
> 
> > > The first question: How should the PCI subsystem prevent the parents of 
> > > driverless VGA devices from being runtime suspended while userspace is 
> > > accessing them?
> > 
> > I think Rafael's patch is good for that.
> 
> But his patch isn't needed if we make these other changes.

Yes.

> > > The second question: Should the PM core allow devices that are disabled
> > > for runtime PM to be in the SUSPENDED state when
> > > dev->power.runtime_auto is clear?
> > 
> > I think that should not be allowed.
> 
> Disallowing it is okay with me too.  But it will require several 
> changes to the code, more than what your patch did.

Yes.  I think so too.

> > > Assuming we don't want to allow this, there's a third question: Should
> > > pm_runtime_allow call pm_runtime_set_suspended if the device is
> > > disabled?
> > 
> > Is it absolute necessary to call pm_runtime_set_suspended?  If the
> > device is disabled, the transition to SUSPENDED state will not be
> > triggered even if the device is ACTIVE.
> 
> It's not absolutely necessary to do this, but we ought to because it 
> will allow the device's parent to be suspended.  If we leave the device 
> in the ACTIVE state then the parent can't be suspended, even when the 
> device is disabled.

I think this is the hard part of the issue.  Now "disabled" and
SUSPENDED state is managed by hand.  IMHO, if we changed
pm_runtime_allow() as you said, we need to change the rule too.  Maybe
something as follow:

- remove pm_runtime_set_suspended/pm_runtime_set_active
- in pm_runtime_disable/pm_runtime_allow, put device into SUSPENDED
state if runtime PM is not forbidden.
- in pm_runtime_forbid/pm_runtime_enable, put device into ACTIVE state.

Best Regards,
Huang Ying



^ permalink raw reply

* Re: [PATCH V5 2/2] Thermal: Add ST-Ericsson DB8500 thermal properties and platform data.
From: Zhang Rui @ 2012-11-12  6:07 UTC (permalink / raw)
  To: hongbo.zhang
  Cc: linux-acpi, linux-kernel, linux-pm, amit.kachhap, patches,
	linaro-dev, linaro-kernel, STEricsson_nomadik_linux, kernel,
	hongbo.zhang
In-Reply-To: <1352460548-3494-3-git-send-email-hongbo.zhang@linaro.com>

On Fri, 2012-11-09 at 19:29 +0800, hongbo.zhang wrote:
> From: "hongbo.zhang" <hongbo.zhang@linaro.com>
> 
> This patch adds device tree properties for ST-Ericsson DB8500 thermal driver,
> also adds the platform data to support the old fashion.
> 
> Signed-off-by: hongbo.zhang <hongbo.zhang@linaro.com>
> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>

hmmm,

who should take this patch?
I'd like to see ACK from the maintainer of these code before applying
them to thermal tree.

thanks,
rui
> ---
>  arch/arm/boot/dts/dbx5x0.dtsi      | 14 +++++++++
>  arch/arm/boot/dts/snowball.dts     | 31 ++++++++++++++++++
>  arch/arm/configs/u8500_defconfig   |  2 ++
>  arch/arm/mach-ux500/board-mop500.c | 64 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 111 insertions(+)
> 
> diff --git a/arch/arm/boot/dts/dbx5x0.dtsi b/arch/arm/boot/dts/dbx5x0.dtsi
> index 4b0e0ca..731086b 100644
> --- a/arch/arm/boot/dts/dbx5x0.dtsi
> +++ b/arch/arm/boot/dts/dbx5x0.dtsi
> @@ -203,6 +203,14 @@
>  				reg = <0x80157450 0xC>;
>  			};
>  
> +			thermal@801573c0 {
> +				compatible = "stericsson,db8500-thermal";
> +				reg = <0x801573c0 0x40>;
> +				interrupts = <21 0x4>, <22 0x4>;
> +				interrupt-names = "IRQ_HOTMON_LOW", "IRQ_HOTMON_HIGH";
> +				status = "disabled";
> +			 };
> +
>  			db8500-prcmu-regulators {
>  				compatible = "stericsson,db8500-prcmu-regulator";
>  
> @@ -660,5 +668,11 @@
>  			ranges = <0 0x50000000 0x4000000>;
>  			status = "disabled";
>  		};
> +
> +		cpufreq-cooling {
> +			compatible = "stericsson,db8500-cpufreq-cooling";
> +			status = "disabled";
> +		 };
> +
>  	};
>  };
> diff --git a/arch/arm/boot/dts/snowball.dts b/arch/arm/boot/dts/snowball.dts
> index 702c0ba..c6f85f0 100644
> --- a/arch/arm/boot/dts/snowball.dts
> +++ b/arch/arm/boot/dts/snowball.dts
> @@ -99,6 +99,33 @@
>  			status = "okay";
>  		};
>  
> +		prcmu@80157000 {
> +			thermal@801573c0 {
> +				num-trips = <4>;
> +
> +				trip0-temp = <70000>;
> +				trip0-type = "active";
> +				trip0-cdev-num = <1>;
> +				trip0-cdev-name0 = "thermal-cpufreq-0";
> +
> +				trip1-temp = <75000>;
> +				trip1-type = "active";
> +				trip1-cdev-num = <1>;
> +				trip1-cdev-name0 = "thermal-cpufreq-0";
> +
> +				trip2-temp = <80000>;
> +				trip2-type = "active";
> +				trip2-cdev-num = <1>;
> +				trip2-cdev-name0 = "thermal-cpufreq-0";
> +
> +				trip3-temp = <85000>;
> +				trip3-type = "critical";
> +				trip3-cdev-num = <0>;
> +
> +				status = "okay";
> +			 };
> +		};
> +
>  		external-bus@50000000 {
>  			status = "okay";
>  
> @@ -183,5 +210,9 @@
>  				reg = <0x33>;
>  			};
>  		};
> +
> +		cpufreq-cooling {
> +			status = "okay";
> +		};
>  	};
>  };
> diff --git a/arch/arm/configs/u8500_defconfig b/arch/arm/configs/u8500_defconfig
> index da68454..250625d 100644
> --- a/arch/arm/configs/u8500_defconfig
> +++ b/arch/arm/configs/u8500_defconfig
> @@ -69,6 +69,8 @@ CONFIG_GPIO_TC3589X=y
>  CONFIG_POWER_SUPPLY=y
>  CONFIG_AB8500_BM=y
>  CONFIG_AB8500_BATTERY_THERM_ON_BATCTRL=y
> +CONFIG_THERMAL=y
> +CONFIG_CPU_THERMAL=y
>  CONFIG_MFD_STMPE=y
>  CONFIG_MFD_TC3589X=y
>  CONFIG_AB5500_CORE=y
> diff --git a/arch/arm/mach-ux500/board-mop500.c b/arch/arm/mach-ux500/board-mop500.c
> index 416d436..b03216b 100644
> --- a/arch/arm/mach-ux500/board-mop500.c
> +++ b/arch/arm/mach-ux500/board-mop500.c
> @@ -16,6 +16,7 @@
>  #include <linux/io.h>
>  #include <linux/i2c.h>
>  #include <linux/platform_data/i2c-nomadik.h>
> +#include <linux/platform_data/db8500_thermal.h>
>  #include <linux/gpio.h>
>  #include <linux/amba/bus.h>
>  #include <linux/amba/pl022.h>
> @@ -229,6 +230,67 @@ static struct ab8500_platform_data ab8500_platdata = {
>  };
>  
>  /*
> + * Thermal Sensor
> + */
> +
> +static struct resource db8500_thsens_resources[] = {
> +	{
> +		.name = "IRQ_HOTMON_LOW",
> +		.start  = IRQ_PRCMU_HOTMON_LOW,
> +		.end    = IRQ_PRCMU_HOTMON_LOW,
> +		.flags  = IORESOURCE_IRQ,
> +	},
> +	{
> +		.name = "IRQ_HOTMON_HIGH",
> +		.start  = IRQ_PRCMU_HOTMON_HIGH,
> +		.end    = IRQ_PRCMU_HOTMON_HIGH,
> +		.flags  = IORESOURCE_IRQ,
> +	},
> +};
> +
> +static struct db8500_thsens_platform_data db8500_thsens_data = {
> +	.trip_points[0] = {
> +		.temp = 70000,
> +		.type = THERMAL_TRIP_ACTIVE,
> +		.cdev_name = {
> +			[0] = "thermal-cpufreq-0",
> +		},
> +	},
> +	.trip_points[1] = {
> +		.temp = 75000,
> +		.type = THERMAL_TRIP_ACTIVE,
> +		.cdev_name = {
> +			[0] = "thermal-cpufreq-0",
> +		},
> +	},
> +	.trip_points[2] = {
> +		.temp = 80000,
> +		.type = THERMAL_TRIP_ACTIVE,
> +		.cdev_name = {
> +			[0] = "thermal-cpufreq-0",
> +		},
> +	},
> +	.trip_points[3] = {
> +		.temp = 85000,
> +		.type = THERMAL_TRIP_CRITICAL,
> +	},
> +	.num_trips = 4,
> +};
> +
> +static struct platform_device u8500_thsens_device = {
> +	.name           = "db8500-thermal",
> +	.resource       = db8500_thsens_resources,
> +	.num_resources  = ARRAY_SIZE(db8500_thsens_resources),
> +	.dev	= {
> +		.platform_data	= &db8500_thsens_data,
> +	},
> +};
> +
> +static struct platform_device u8500_cpufreq_cooling_device = {
> +	.name           = "db8500-cpufreq-cooling",
> +};
> +
> +/*
>   * TPS61052
>   */
>  
> @@ -583,6 +645,8 @@ static struct platform_device *snowball_platform_devs[] __initdata = {
>  	&snowball_key_dev,
>  	&snowball_sbnet_dev,
>  	&snowball_gpio_en_3v3_regulator_dev,
> +	&u8500_thsens_device,
> +	&u8500_cpufreq_cooling_device,
>  };
>  
>  static void __init mop500_init_machine(void)

^ permalink raw reply

* Re: [PATCH 1/9 v3] cgroup: add cgroup_subsys->post_create()
From: Daniel Wagner @ 2012-11-12 13:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan, mhocko, rjw, containers, cgroups, linux-kernel, linux-pm,
	fweisbec, Glauber Costa
In-Reply-To: <20121109172211.GB2711@htj.dyndns.org>

Hi Tejun,

On 09.11.2012 18:22, Tejun Heo wrote:
> Hey, Daniel.
>
> On Fri, Nov 09, 2012 at 12:09:38PM +0100, Daniel Wagner wrote:
>> On 08.11.2012 20:07, Tejun Heo wrote:> Subject: cgroup: add
>> cgroup_subsys->post_create()
>>>
>>> Currently, there's no way for a controller to find out whether a new
>>> cgroup finished all ->create() allocatinos successfully and is
>>> considered "live" by cgroup.
>>
>> I'd like add hierarchy support to net_prio and the first thing to
>> do is to get rid of get_prioidx(). It looks like it would be nice to
>
> Ooh, I'm already working on it.  I *think* I should be able to post
> the patches later today or early next week.

No problem. I didn't spend too much time in writing the patch. Getting
rid of get_prioidx() was simple. I just wanted to help out, but then
I might to slow to keep up with you :)

Though I have a question on the patch you are writing. When you disable
broken_hierarchy for the networking controllers, are you also 
considering to disable them on the root cgroup? Currently both
net_prio and net_cls will do work on the root cgroup. I think for
harmonizing the behavior or all controllers this should also be
disabled.

>> be able to use use_id and post_create() for this but as I read the
>> code this might not work because the netdev might access resources
>> allocated between create() and post_create(). So my question is
>> would it make sense to move
>>
>> cgroup_create():
>>
>> 		if (ss->use_id) {
>> 			err = alloc_css_id(ss, parent, cgrp);
>> 			if (err)
>> 				goto err_destroy;
>> 		}
>>
>> part before create() or add some protection between create() and
>> post_create() callback in net_prio. I have a patch but I see
>> I could drop it completely if post_create() is there.
>
> Glauber had about similar question about css_id and I need to think
> more about it but currently I think I want to phase out css IDs.  It's
> an id of the wrong thing (CSSes don't need IDs, cgroups do) and
> unnecessarily duplicates its own hierarchy when the hierarchy of
> cgroups already exists.  Once memcontrol moves away from walking using
> css_ids, I *think* I'll try to kill it.
>
> I'll add cgroup ID (no hierarchy funnies, just a single ida allocated
> number) so that it can be used for cgroup indexing.  Glauber, that
> should solve your problem too, right?

net_prio doesn't have any particular requirements on the indexing, 
unless there is one. So a global ida would work for net_prio.

cheers,
daniel


^ permalink raw reply

* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: Srivatsa S. Bhat @ 2012-11-12 16:14 UTC (permalink / raw)
  To: SrinivasPandruvada, akpm@linux-foundation.org, Mel Gorman, mjg59,
	Paul E. McKenney, Dave Hansen, maxime.coquelin, loic.pallardy,
	Arjan van de Ven, kmpark, kamezawa.hiroyu, Len Brown,
	Rafael J. Wysocki
  Cc: linux-pm, Ankita Garg, amit.kachhap, Vaidyanathan Srinivasan,
	thomas.abraham, Santosh Shilimkar, Srivatsa S. Bhat, linux-mm,
	linux-kernel@vger.kernel.org, andi
In-Reply-To: <loom.20121109T172910-394@post.gmane.org>

Hi Srinivas,

It looks like your email did not get delivered to the mailing
lists (and the people in the CC list) properly. So quoting your
entire mail as-it-is here. And thanks a lot for taking a look
at this patchset!

Regards,
Srivatsa S. Bhat

On 11/09/2012 10:18 PM, SrinivasPandruvada wrote:
> I did like this implementation and think it is valuable.
> I am experimenting with one of our HW. This type of partition does help in
> saving power. We believe we can save up-to 1W power per DIM with the help
> of some HW/BIOS changes. We are only talking about content preserving memory,
> so we don't have to be 100% correct.
> In my experiments, I tried two methods:
> - Similar to approach suggested by Mel Gorman. I have a special sticky
> migrate type like CMA.
> - Buddy buckets: Buddies are organized into memory region aware buckets.
> During allocation it prefers higher order buckets. I made sure that there is
> no affect of my change if there are no power saving memory DIMs. The advantage
> of this bucket is that I can keep the memory in close proximity for a related
> task groups by direct hashing to a bucket. The free list if organized as two
> dimensional array with bucket and migrate type for each order.
> 
> In both methods, currently reclaim is targetted to be done by a sysfs interface
> similar to memory compaction for a node allowing user space to initiate reclaim. 
> 
> Thanks,
> Srinivas Pandruvada
> Open Source Technology Center,
> Intel Corp.
> 


^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Alan Stern @ 2012-11-12 16:32 UTC (permalink / raw)
  To: Huang Ying; +Cc: Rafael J. Wysocki, linux-kernel, linux-pm
In-Reply-To: <1352699744.7176.169.camel@yhuang-dev>

On Mon, 12 Nov 2012, Huang Ying wrote:

> > > Is it absolute necessary to call pm_runtime_set_suspended?  If the
> > > device is disabled, the transition to SUSPENDED state will not be
> > > triggered even if the device is ACTIVE.
> > 
> > It's not absolutely necessary to do this, but we ought to because it 
> > will allow the device's parent to be suspended.  If we leave the device 
> > in the ACTIVE state then the parent can't be suspended, even when the 
> > device is disabled.
> 
> I think this is the hard part of the issue.  Now "disabled" and
> SUSPENDED state is managed by hand.  IMHO, if we changed
> pm_runtime_allow() as you said, we need to change the rule too.  Maybe
> something as follow:
> 
> - remove pm_runtime_set_suspended/pm_runtime_set_active

We can't remove them entirely.  They are needed for situations where 
the device's physical state is different from what the PM core thinks; 
they tell the PM core what the actual state is.

> - in pm_runtime_disable/pm_runtime_allow, put device into SUSPENDED
> state if runtime PM is not forbidden.

That doesn't make sense.  Runtime PM is never forbidden after 
pm_runtime_allow is called; that's its purpose.

> - in pm_runtime_forbid/pm_runtime_enable, put device into ACTIVE state.

Why should pm_runtime_enable put the device into the ACTIVE state?

No, I think a better approach is simply to say that the device will
never be allowed to be in the SUSPENDED state if runtime PM is
forbidden.  We already enforce this when the device is enabled for 
runtime PM, so we would have to start enforcing it when the device is 
disabled.

This means:

	pm_runtime_set_suspended should fail if dev->power.runtime_auto
	is clear.

	pm_runtime_forbid should call pm_runtime_set_active if
	dev->power.disable_depth > 0.  (This would run into a problem
	if the parent is suspended and disabled.  Maybe 
	pm_runtime_forbid should fail when this happens.)

Finally, we probably should make a third change even though it isn't
strictly necessary:

	pm_runtime_allow should call pm_runtime_set_suspended if
	dev->power.disable_depth > 0.

Rafael, what do you think?

Alan Stern

^ permalink raw reply

* Re: [PATCH 3/6 v4] cpufreq: tolerate inexact values when collecting stats
From: Mark Langsdorf @ 2012-11-12 16:35 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel@vger.kernel.org, cpufreq@vger.kernel.org,
	linux-pm@vger.kernel.org, MyungJoo Ham
In-Reply-To: <20121111163821.GA21635@x1.osrc.amd.com>

On 11/11/2012 10:38 AM, Borislav Petkov wrote:
> On Wed, Nov 07, 2012 at 12:32:43PM -0600, Mark Langsdorf wrote:
>> When collecting stats, if a frequency doesn't match the table, go through
>> the table again with both the search frequency and table values shifted
>> left by 10 bits.
> 
> Why would that second pass succeed?

It's effectively a divide by 1024 and minimizes any jitter in the
measured frequency value.

> And why is this in generic code (I'm assuming this is a Calxeda-specific
> case)?

The function is buried pretty deep in the cpufreq_stat code. It didn't
seem appropriate to make it a function pointer as part of struct
cpufreq_driver.

--Mark Langsdorf
Calxeda, Inc.

^ permalink raw reply

* Re: [PATCH v9 03/10] ata: zpodd: identify and init ZPODD devices
From: Tejun Heo @ 2012-11-12 18:53 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Jeff Garzik, James Bottomley, Rafael J. Wysocki, Alan Stern,
	Jeff Wu, Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi
In-Reply-To: <1352443922-13734-4-git-send-email-aaron.lu@intel.com>

Hello,

On Fri, Nov 09, 2012 at 02:51:55PM +0800, Aaron Lu wrote:
>  void ata_acpi_unbind(struct ata_device *dev)
>  {
> +	if (zpodd_dev_enabled(dev))
> +		zpodd_deinit(dev);

Maybe zpodd_exit() instead?

> +void zpodd_init(struct ata_device *dev)
> +{
> +	int ret;
> +	struct zpodd *zpodd;
> +
> +	if (dev->private_data)
> +		return;
> +
> +	if (!device_can_poweroff(dev))
> +		return;
> +
> +	if ((ret = check_loading_mechanism(dev)) == -ENODEV)
> +		return;
> +
> +	zpodd = kzalloc(sizeof(struct zpodd), GFP_KERNEL);
> +	if (!zpodd)
> +		return;
> +
> +	if (ret)
> +		zpodd->drawer = true;
> +	else
> +		zpodd->slot = true;
> +
> +	zpodd->dev = dev;
> +	dev->private_data = zpodd;

I don't think you're supposed to use dev->private_data from libata
core layer.  Just add a new field.  Nobody cares about adding 8 more
bytes to struct ata_device and spending 8 more bytes is way better
than muddying the ownership of ->private_data.

> +/* libata-zpodd.c */
> +#ifdef CONFIG_SATA_ZPODD
> +void zpodd_init(struct ata_device *dev);
> +void zpodd_deinit(struct ata_device *dev);
> +static inline bool zpodd_dev_enabled(struct ata_device *dev)
> +{
> +	if (dev->flags & ATA_DFLAG_DA && dev->private_data)
> +		return true;
> +	else
> +		return false;
> +}

And this gets completely wrong.  What if the device supports DA and
low level driver makes use of ->private_data?

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH v9 04/10] libata: acpi: move acpi notification code to zpodd
From: Tejun Heo @ 2012-11-12 18:55 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Jeff Garzik, James Bottomley, Rafael J. Wysocki, Alan Stern,
	Jeff Wu, Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi
In-Reply-To: <1352443922-13734-5-git-send-email-aaron.lu@intel.com>

On Fri, Nov 09, 2012 at 02:51:56PM +0800, Aaron Lu wrote:
> Since the ata acpi notification code introduced in commit
> 3bd46600a7a7e938c54df8cdbac9910668c7dfb0 is solely for ZPODD, and we
> now have a dedicated place for it, move these code there.
> 
> And the add/remove_pm_notifier code is simplified a little bit that it
> does not check things like if the handle is NULL and if a corresponding
> acpi_device is there for the handle as they are guaranteed by the
> device_can_poweroff already.

Please don't mix code movement with actual changes.  It makes it
difficult to track what's going on.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH v9 05/10] libata: separate ATAPI code
From: Tejun Heo @ 2012-11-12 18:57 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Jeff Garzik, James Bottomley, Rafael J. Wysocki, Alan Stern,
	Jeff Wu, Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi
In-Reply-To: <1352443922-13734-6-git-send-email-aaron.lu@intel.com>

On Fri, Nov 09, 2012 at 02:51:57PM +0800, Aaron Lu wrote:
> The atapi_eh_tur and atapi_eh_request_sense can be reused by ZPODD
> code, so separate them out to a file named libata-atapi.c, and the
> Makefile is modified accordingly. No functional changes should result
> from this commit.

Why is this change necessary?  This doesn't seem to help anything to
me.

Thanks.

-- 
tejun

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox