Linux Power Management development

Linux Power Management development
 help / color / mirror / Atom feed

* Re: [PATCH 1/9 v3] cgroup: add cgroup_subsys->post_create()
From: Li Zefan @ 2012-11-09  9:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: mhocko-AlSwsSmVLrQ, rjw-KKrjLPT3xs0,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-pm-u79uwXL29TY76Z2rM5mHXA, fweisbec-Re5JQEeQqe8AvxtiuMwx3w,
	Glauber Costa
In-Reply-To: <20121108190715.GD9672-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>

On 2012/11/9 3:07, Tejun Heo wrote:
> Subject: cgroup: add cgroup_subsys->post_create()
> 
> Currently, there's no way for a controller to find out whether a new
> cgroup finished all ->create() allocatinos successfully and is
> considered "live" by cgroup.
> 
> This becomes a problem later when we add generic descendants walking
> to cgroup which can be used by controllers as controllers don't have a
> synchronization point where it can synchronize against new cgroups
> appearing in such walks.
> 
> This patch adds ->post_create().  It's called after all ->create()
> succeeded and the cgroup is linked into the generic cgroup hierarchy.
> This plays the counterpart of ->pre_destroy().
> 
> When used in combination with the to-be-added generic descendant
> iterators, ->post_create() can be used to implement reliable state
> inheritance.  It will be explained with the descendant iterators.
> 
> v2: Added a paragraph about its future use w/ descendant iterators per
>     Michal.
> 
> v3: Forgot to add ->post_create() invocation to cgroup_load_subsys().
>     Fixed.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Acked-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Cc: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

Acked-by: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

^ permalink raw reply

* Re: [RFC PATCH 6/8] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
From: Srivatsa S. Bhat @ 2012-11-09  9:03 UTC (permalink / raw)
  To: Ankita Garg
  Cc: akpm, mgorman, mjg59, paulmck, dave, maxime.coquelin,
	loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw,
	amit.kachhap, svaidy, thomas.abraham, santosh.shilimkar, linux-pm,
	linux-mm, linux-kernel@vger.kernel.org, andi
In-Reply-To: <CAKD8Uxd=BguLj=4VvRRfKBDdqrz+p_6Sj6JF2UNEjLd-HNmHMw@mail.gmail.com>

Hi Ankita,

On 11/09/2012 11:31 AM, Ankita Garg wrote:
> Hi Srivatsa,
> 
> I understand that you are maintaining the page blocks in region sorted
> order. So that way, when the memory requests come in, you can hand out
> memory from the regions in that order.

Yes, that's right.

> However, do you take this
> scenario into account - in some bucket of the buddy allocator, there
> might not be any pages belonging to, lets say, region 0, while the next
> higher bucket has them. So, instead of handing out memory from whichever
> region thats present there, to probably go to the next bucket and split
> that region 0 pageblock there and allocate from it ? (Here, region 0 is
> just an example). Been a while since I looked at kernel code, so I might
> be missing something!
> 

This patchset doesn't attempt to do that because that can hurt the fast
path performance of page allocation (ie., because we could end up trying
to split pageblocks even when we already have pageblocks of the required
order ready at hand... and not to mention the searching involved in finding
out whether any higher order free lists really contain pageblocks belonging
to this region 0). In this patchset, I have consciously tried to keep the
overhead from memory regions as low as possible, and have moved most of
the overhead to the page free path.

But the scenario that you brought out is very relevant, because that would
help achieve more aggressive power-savings. I will try to implement
something to that end with least overhead in the next version and measure
whether its cost vs benefit really works out or not. Thank you very much
for pointing it out!

Regards,
Srivatsa S. Bhat

> 
> 
> On Tue, Nov 6, 2012 at 1:53 PM, Srivatsa S. Bhat
> <srivatsa.bhat@linux.vnet.ibm.com
> <mailto:srivatsa.bhat@linux.vnet.ibm.com>> wrote:
> 
>     The zones' freelists need to be made region-aware, in order to influence
>     page allocation and freeing algorithms. So in every free list in the
>     zone, we
>     would like to demarcate the pageblocks belonging to different memory
>     regions
>     (we can do this using a set of pointers, and thus avoid splitting up the
>     freelists).
> 
>     Also, we would like to keep the pageblocks in the freelists sorted in
>     region-order. That is, pageblocks belonging to region-0 would come
>     first,
>     followed by pageblocks belonging to region-1 and so on, within a given
>     freelist. Of course, a set of pageblocks belonging to the same
>     region need
>     not be sorted; it is sufficient if we maintain the pageblocks in
>     region-sorted-order, rather than a full address-sorted-order.
> 
>     For each freelist within the zone, we maintain a set of pointers to
>     pageblocks belonging to the various memory regions in that zone.
> 
>     Eg:
> 
>         |<---Region0--->|   |<---Region1--->|   |<-------Region2--------->|
>          ____      ____      ____      ____      ____      ____      ____
>     --> |____|--> |____|--> |____|--> |____|--> |____|--> |____|-->
>     |____|-->
> 
>                      ^                  ^                              ^
>                      |                  |                              |
>                     Reg0               Reg1                          Reg2
> 
> 
>     Page allocation will proceed as usual - pick the first item on the
>     free list.
>     But we don't want to keep updating these region pointers every time
>     we allocate
>     a pageblock from the freelist. So, instead of pointing to the
>     *first* pageblock
>     of that region, we maintain the region pointers such that they point
>     to the
>     *last* pageblock in that region, as shown in the figure above. That
>     way, as
>     long as there are > 1 pageblocks in that region in that freelist,
>     that region
>     pointer doesn't need to be updated.
> 
> 
>     Page allocation algorithm:
>     -------------------------
> 
>     The heart of the page allocation algorithm remains it is - pick the
>     first
>     item on the appropriate freelist and return it.
> 
> 
>     Pageblock order in the zone freelists:
>     -------------------------------------
> 
>     This is the main change - we keep the pageblocks in region-sorted order,
>     where pageblocks belonging to region-0 come first, followed by those
>     belonging
>     to region-1 and so on. But the pageblocks within a given region need
>     *not* be
>     sorted, since we need them to be only region-sorted and not fully
>     address-sorted.
> 
>     This sorting is performed when adding pages back to the freelists, thus
>     avoiding any region-related overhead in the critical page allocation
>     paths.
> 
>     Page reclaim [Todo]:
>     --------------------
> 
>     Page allocation happens in the order of increasing region number. We
>     would
>     like to do page reclaim in the reverse order, to keep allocated
>     pages within
>     a minimal number of regions (approximately).
> 
>     ---------------------------- Increasing region
>     number---------------------->
> 
>     Direction of allocation--->                         <---Direction of
>     reclaim
> 
>     Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: Mel Gorman @ 2012-11-09  9:00 UTC (permalink / raw)
  To: Vaidyanathan Srinivasan
  Cc: Srivatsa S. Bhat, akpm, mjg59, paulmck, dave, maxime.coquelin,
	loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw,
	gargankita, amit.kachhap, thomas.abraham, santosh.shilimkar,
	linux-pm, linux-mm, linux-kernel
In-Reply-To: <20121109051247.GA499@dirshya.in.ibm.com>

On Fri, Nov 09, 2012 at 10:44:16AM +0530, Vaidyanathan Srinivasan wrote:
> * Mel Gorman <mgorman@suse.de> [2012-11-08 18:02:57]:
> 
> > On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
> > > ------------------------------------------------------------
> 
> Hi Mel,
> 
> Thanks for detailed review and comments.  The goal of this patch
> series is to brainstorm on ideas that enable Linux VM to record and
> exploit memory region boundaries.
> 

I see.

> The first approach that we had last year (hierarchy) has more runtime
> overhead.  This approach of sorted-buddy was one of the alternative
> discussed earlier and we are trying to find out if simple requirements
> of biasing memory allocations can be achieved with this approach.
> 
> Smart reclaim based on this approach is a key piece we still need to
> design.  Ideas from compaction will certainly help.
> 
> > > Today memory subsystems are offer a wide range of capabilities for managing
> > > memory power consumption. As a quick example, if a block of memory is not
> > > referenced for a threshold amount of time, the memory controller can decide to
> > > put that chunk into a low-power content-preserving state. And the next
> > > reference to that memory chunk would bring it back to full power for read/write.
> > > With this capability in place, it becomes important for the OS to understand
> > > the boundaries of such power-manageable chunks of memory and to ensure that
> > > references are consolidated to a minimum number of such memory power management
> > > domains.
> > > 
> > 
> > How much power is saved?
> 
> On embedded platform the savings could be around 5% as discussed in
> the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935
> 
> On larger servers with large amounts of memory the savings could be
> more.  We do not yet have all the pieces together to evaluate.
> 

Ok, it's something to keep an eye on because if memory power savings
require large amounts of CPU (for smart placement or migration) or more
disk accesses (due to reclaim) then the savings will be offset by
increased power usage elsehwere.

> > > ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
> > > the firmware can expose information regarding the boundaries of such memory
> > > power management domains to the OS in a standard way.
> > > 
> > 
> > I'm not familiar with the ACPI spec but is there support for parsing of
> > MPST and interpreting the associated ACPI events? For example, if ACPI
> > fires an event indicating that a memory power node is to enter a low
> > state then presumably the OS should actively migrate pages away -- even
> > if it's going into a state where the contents are still refreshed
> > as exiting that state could take a long time.
> > 
> > I did not look closely at the patchset at all because it looked like the
> > actual support to use it and measure the benefit is missing.
> 
> Correct.  The platform interface part is not included in this patch
> set mainly because there is not much design required there.  Each
> platform can have code to collect the memory region boundaries from
> BIOS/firmware and load it into the Linux VM.  The goal of this patch
> is to brainstorm on the idea of hos core VM should used the region
> information.
>  

Ok. It does mean that the patches should not be merged until there is
some platform support that can take advantage of them.

> > > How can Linux VM help memory power savings?
> > > 
> > > o Consolidate memory allocations and/or references such that they are
> > > not spread across the entire memory address space.  Basically area of memory
> > > that is not being referenced, can reside in low power state.
> > > 
> > 
> > Which the series does not appear to do.
> 
> Correct.  We need to design the correct reclaim strategy for this to
> work.  However having buddy list sorted by region address could get us
> one step closer to shaping the allocations.
> 

If you reclaim, it means that the information is going to disk and will
have to be refaulted in sooner rather than later. If you concentrate on
reclaiming low memory regions and memory is almost full, it will lead to
a situation where you almost always reclaim newer pages and increase
faulting. You will save a few milliwatts on memory and lose way more
than that on increase disk traffic and CPU usage.

> > > o Support targeted memory reclaim, where certain areas of memory that can be
> > > easily freed can be offlined, allowing those areas of memory to be put into
> > > lower power states.
> > > 
> > 
> > Which the series does not appear to do judging from this;
> > 
> >   include/linux/mm.h     |   38 +++++++
> >   include/linux/mmzone.h |   52 +++++++++
> >   mm/compaction.c        |    8 +
> >   mm/page_alloc.c        |  263 ++++++++++++++++++++++++++++++++++++++++++++----
> >   mm/vmstat.c            |   59 ++++++++++-
> > 
> > This does not appear to be doing anything with reclaim and not enough with
> > compaction to indicate that the series actively manages memory placement
> > in response to ACPI events.
> 
> Correct.  Evaluating different ideas for reclaim will be next step
> before getting into the platform interface parts.
> 
> > Further in section 5.2.21.4 the spec says that power node regions can
> > overlap (but are not hierarchal for some reason) but have no gaps yet the
> > structure you use to represent is assumes there can be gaps and there are
> > no overlaps. Again, this is just glancing at the spec and a quick skim of
> > the patches so maybe I missed something that explains why this structure
> > is suitable.
> 
> This patch is roughly based on the idea that ACPI MPST will give us
> memory region boundaries.  It is not designed to implement all options
> defined in the spec. 

Ok, but as it is the only potential consumer of this interface that you
mentioned then it should at least be able to handle it. The spec talks about
overlapping memory regions where the regions potentially have differnet
power states. This is pretty damn remarkable and hard to see how it could
be interpreted in a sensible way but it forces your implementation to take
it into account.

> We have taken a general case of regions do not
> overlap while memory addresses itself can be discontinuous.
> 

Why is the general case? You referred to the ACPI spec where it is not
the case and no other examples.

> > It seems to me that superficially the VM implementation for the support
> > would have
> > 
> > a) Involved a tree that managed the overlapping regions (even if it's
> >    not hierarchal it feels more sensible) and picked the highest-power-state
> >    common denominator in the tree. This would only be allocated if support
> >    for MPST is available.
> > b) Leave memory allocations and reclaim as they are in the active state.
> > c) Use a "sticky" migrate list MIGRATE_LOWPOWER for regions that are in lower
> >    power but still usable with a latency penalty. This might be a single
> >    migrate type but could also be a parallel set of free_area called
> >    free_area_lowpower that is only used when free_area is depleted and in
> >    the very slow path of the allocator.
> > d) Use memory hot-remove for power states where the refresh rates were
> >    not constant
> > 
> > and only did anything expensive in response to an ACPI event -- none of
> > the fast paths should be touched.
> > 
> > When transitioning to the low power state, memory should be migrated in
> > a vaguely similar fashion to what CMA does. For low-power, migration
> > failure is acceptable. If contents are not preserved, ACPI needs to know
> > if the migration failed because it cannot enter that power state.
> > 
> > For any of this to be worthwhile, low power states would need to be achieved
> > for long periods of time because that migration is not free.
> 
> In this patch series we are assuming the simple case of hardware
> managing the actual power states and OS facilitates them by keeping
> the allocations in less number of memory regions.  As we keep
> allocations and references low to a regions, it becomes case (c)
> above. We are addressing only a small subset of the above list.
> 
> > > Memory Regions:
> > > ---------------
> > > 
> > > "Memory Regions" is a way of capturing the boundaries of power-managable
> > > chunks of memory, within the MM subsystem.
> > > 
> > > Short description of the "Sorted-buddy" design:
> > > -----------------------------------------------
> > > 
> > > In this design, the memory region boundaries are captured in a parallel
> > > data-structure instead of fitting regions between nodes and zones in the
> > > hierarchy. Further, the buddy allocator is altered, such that we maintain the
> > > zones' freelists in region-sorted-order and thus do page allocation in the
> > > order of increasing memory regions.
> > 
> > Implying that this sorting has to happen in the either the alloc or free
> > fast path.
> 
> Yes, in the free path. This optimization can be actually be delayed in
> the free fast path and completely avoided if our memory is full and we
> are doing direct reclaim during allocations.
> 

Hurting the free fast path is a bad idea as there are workloads that depend
on it (buffer allocation and free) even though many workloads do *not*
notice it because the bulk of the cost is incurred at exit time. As
memory low power usage has many caveats (may be impossible if a page
table is allocated in the region for example) but CPU usage has less
restrictions it is more important that the CPU usage be kept low.

That means, little or no modification to the fastpath. Sorting or linear
searches should be minimised or avoided.

> > > <SNIPPED where I pointed out that compaction will bust sorting>
> > 
> > Compile-time exclusion is pointless because it'll be always activated by
> > distribution configs. Support for MPST should be detected at runtime and
> > 
> > 3. ACPI support to actually use this thing and validate the design is
> >    compatible with the spec and actually works in hardware
> 
> This is required to actually evaluate power saving benefit once we
> have candidate implementations in the VM.
> 
> At this point we want to look at overheads of having region
> infrastructure in VM and how does that trade off in terms of
> requirements that we can meet.
> 
> The first goal is to have memory allocations fill as few regions as
> possible when system's memory usage is significantly lower. 

While it's a reasonable starting objective, the fast path overhead is very
unfortunate and such a strategy can be easily defeated by running sometime
metadata intensive (like find over the entire system) while a large memory
user starts at the same time to spread kernel and user space allocations
throughout the address space. This will spread the allocations throughout
the address space and persist even after the two processes exit due to
the page cache usage from the metadata intensive workload.

Basically, it'll only work as long as the system is idle or never uses
much memory during the lifetime of the system.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply

* [PATCH v4 1/5] sd: put to stopped power state when runtime suspend
From: Aaron Lu @ 2012-11-09  7:27 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alan Stern, Rafael J. Wysocki, linux-pm, linux-scsi, Aaron Lu,
	Aaron Lu
In-Reply-To: <1352446075-1814-1-git-send-email-aaron.lu@intel.com>

When device is runtime suspended, put it to stopped power state to save
some power.

This will also make the behaviour consistent with what the scsi_pm.c
thinks about sd as the comment says:
sd treats runtime suspend, system suspend and system hibernate identical.
With this patch, it is now identical.
And sd_shutdown will also do nothing when it finds the device has been
runtime suspended, if we do not spin down the disk in runtime suspend
by putting it into stopped power state, the disk will be shut down
incorrectly.
And the the same problem can be solved for runtime power off after
runtime suspended case by this change.

With the current runtime scheme for disk, it will only be runtime
suspended when no process opens the disk, so this shouldn't happen a
lot, which makes it acceptable to spin down the disk when runtime
suspended. If some day a more aggressive runtime scheme is used, like
the 'request based runtime pm for disk' that Alan Stern and Lin Ming
has been working, we can introduce some policy to control this. But for
now, make it simple and correct by spinning down the disk.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 drivers/scsi/sd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 12f6fdf..8b6e004 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2911,7 +2911,8 @@ static int sd_suspend(struct device *dev, pm_message_t mesg)
 			goto done;
 	}

-	if ((mesg.event & PM_EVENT_SLEEP) && sdkp->device->manage_start_stop) {
+	if (((mesg.event & PM_EVENT_SLEEP) || PMSG_IS_AUTO(mesg)) &&
+			sdkp->device->manage_start_stop) {
 		sd_printk(KERN_NOTICE, sdkp, "Stopping disk\n");
 		ret = sd_start_stop_device(sdkp, 0);
 	}
-- 
1.7.12.21.g871e293

^ permalink raw reply related

* [PATCH RESEND v4 0/5] Migrate SCSI drivers to use dev_pm_ops
From: Aaron Lu @ 2012-11-09  7:27 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alan Stern, Rafael J. Wysocki, linux-pm, linux-scsi, Aaron Lu,
	Aaron Lu

This patchset has been quiet for a while, so resend them.

v4:
Only Patch 4 is modified:
Fixed a line over 80 characters warning by checkpatch.pl;
Update the changelog so that it is no more a try :-)

v3:
Only patch 4 is modified:
Remove the special case for system freeze in scsi_bus_suspend_common
as pointed out by Alan Stern;
Updated some comments;
Removed the use of typedef (*pm_callback_t)(struct device *).

v2:
Change the runtime suspend behaviour of sd driver by putting the device
into stopped power state.
Revert 2 patches which are no longer needed as pointed out by Alan Stern.
Find out device callbacks in bus callbacks as suggested by Alan Stern.

Due to these changes, patch number grows from 2 -> 5.

v1:
The 2 patches will migrate SCSI drivers to use the pm callbacks defined
in dev_pm_ops as pm_message is deprecated and should not be used by driver.
Bus level callback is changed to use callbacks defined in dev_pm_ops when
needed and sd's pm callback is updated to use what are defined in dev_pm_ops.

Aaron Lu (5):
  sd: put to stopped power state when runtime suspend
  Revert "[SCSI] scsi_pm: set device runtime state before parent
    suspended"
  Revert "[SCSI] runtime resume parent for child's system-resume"
  pm: use callbacks from dev_pm_ops for scsi devices
  sd: update sd to use the new pm callbacks

 drivers/scsi/scsi_pm.c | 98 +++++++++++++++++++++++++++-----------------------
 drivers/scsi/sd.c      | 18 +++++++---
 2 files changed, 67 insertions(+), 49 deletions(-)

-- 
1.7.12.21.g871e293

^ permalink raw reply

* [PATCH v4 5/5] sd: update sd to use the new pm callbacks
From: Aaron Lu @ 2012-11-09  7:27 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alan Stern, Rafael J. Wysocki, linux-pm, linux-scsi, Aaron Lu,
	Aaron Lu
In-Reply-To: <1352446075-1814-1-git-send-email-aaron.lu@intel.com>

Update sd driver to use the callbacks defined in dev_pm_ops.

sd_freeze is NULL, the bus level callback has taken care of quiescing
the device so there should be nothing needs to be done here.
Consequently, sd_thaw is not needed here either.

suspend, poweroff and runtime suspend share the same routine sd_suspend,
which will sync flush and then stop the drive, this is the same as before.

resume, restore and runtime resume share the same routine sd_resume,
which will start the drive by putting it into active power state, this
is also the same as before.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 drivers/scsi/sd.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 8b6e004..6564305 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -104,7 +104,7 @@ static void sd_unlock_native_capacity(struct gendisk *disk);
 static int  sd_probe(struct device *);
 static int  sd_remove(struct device *);
 static void sd_shutdown(struct device *);
-static int sd_suspend(struct device *, pm_message_t state);
+static int sd_suspend(struct device *);
 static int sd_resume(struct device *);
 static void sd_rescan(struct device *);
 static int sd_done(struct scsi_cmnd *);
@@ -423,15 +423,23 @@ static struct class sd_disk_class = {
 	.dev_attrs	= sd_disk_attrs,
 };
 
+static const struct dev_pm_ops sd_pm_ops = {
+	.suspend		= sd_suspend,
+	.resume			= sd_resume,
+	.poweroff		= sd_suspend,
+	.restore		= sd_resume,
+	.runtime_suspend	= sd_suspend,
+	.runtime_resume		= sd_resume,
+};
+
 static struct scsi_driver sd_template = {
 	.owner			= THIS_MODULE,
 	.gendrv = {
 		.name		= "sd",
 		.probe		= sd_probe,
 		.remove		= sd_remove,
-		.suspend	= sd_suspend,
-		.resume		= sd_resume,
 		.shutdown	= sd_shutdown,
+		.pm		= &sd_pm_ops,
 	},
 	.rescan			= sd_rescan,
 	.done			= sd_done,
@@ -2896,7 +2904,7 @@ exit:
 	scsi_disk_put(sdkp);
 }
 
-static int sd_suspend(struct device *dev, pm_message_t mesg)
+static int sd_suspend(struct device *dev)
 {
 	struct scsi_disk *sdkp = scsi_disk_get_from_dev(dev);
 	int ret = 0;
@@ -2911,8 +2919,7 @@ static int sd_suspend(struct device *dev, pm_message_t mesg)
 			goto done;
 	}
 
-	if (((mesg.event & PM_EVENT_SLEEP) || PMSG_IS_AUTO(mesg)) &&
-			sdkp->device->manage_start_stop) {
+	if (sdkp->device->manage_start_stop) {
 		sd_printk(KERN_NOTICE, sdkp, "Stopping disk\n");
 		ret = sd_start_stop_device(sdkp, 0);
 	}
-- 
1.7.12.21.g871e293


^ permalink raw reply related

* [PATCH v4 4/5] pm: use callbacks from dev_pm_ops for scsi devices
From: Aaron Lu @ 2012-11-09  7:27 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alan Stern, Rafael J. Wysocki, linux-pm, linux-scsi, Aaron Lu,
	Aaron Lu
In-Reply-To: <1352446075-1814-1-git-send-email-aaron.lu@intel.com>

Use of pm_message_t is deprecated and device driver is not supposed
to use that. This patch migrates the SCSI bus level pm callbacks
to call device's pm callbacks defined in its driver's dev_pm_ops.

This is achieved by finding out which device pm callback should be used
in bus callback function, and then pass that callback function pointer
as a param to the scsi_bus_{suspend,resume}_common routine, which will
further pass that callback to scsi_dev_type_{suspend,resume} after
proper handling.

The special case for freeze in scsi_bus_suspend_common is not necessary
since there is no high level SCSI driver has implemented freeze, so no
need to runtime resume the device if it is in runtime suspended state
for system freeze, just return like the system suspend/hibernate case.

Since only sd has implemented drv->suspend/drv->resume, and I'll update
sd driver to use the new callbacks in the following patch, there is no
need to fallback to call drv->suspend/drv->resume if dev_pm_ops is NULL.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 drivers/scsi/scsi_pm.c | 86 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 53 insertions(+), 33 deletions(-)

diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
index 9923b26..8f6b12c 100644
--- a/drivers/scsi/scsi_pm.c
+++ b/drivers/scsi/scsi_pm.c
@@ -16,16 +16,14 @@
 
 #include "scsi_priv.h"
 
-static int scsi_dev_type_suspend(struct device *dev, pm_message_t msg)
+static int scsi_dev_type_suspend(struct device *dev, int (*cb)(struct device *))
 {
-	struct device_driver *drv;
 	int err;
 
 	err = scsi_device_quiesce(to_scsi_device(dev));
 	if (err == 0) {
-		drv = dev->driver;
-		if (drv && drv->suspend) {
-			err = drv->suspend(dev, msg);
+		if (cb) {
+			err = cb(dev);
 			if (err)
 				scsi_device_resume(to_scsi_device(dev));
 		}
@@ -34,14 +32,12 @@ static int scsi_dev_type_suspend(struct device *dev, pm_message_t msg)
 	return err;
 }
 
-static int scsi_dev_type_resume(struct device *dev)
+static int scsi_dev_type_resume(struct device *dev, int (*cb)(struct device *))
 {
-	struct device_driver *drv;
 	int err = 0;
 
-	drv = dev->driver;
-	if (drv && drv->resume)
-		err = drv->resume(dev);
+	if (cb)
+		err = cb(dev);
 	scsi_device_resume(to_scsi_device(dev));
 	dev_dbg(dev, "scsi resume: %d\n", err);
 	return err;
@@ -49,35 +45,33 @@ static int scsi_dev_type_resume(struct device *dev)
 
 #ifdef CONFIG_PM_SLEEP
 
-static int scsi_bus_suspend_common(struct device *dev, pm_message_t msg)
+static int
+scsi_bus_suspend_common(struct device *dev, int (*cb)(struct device *))
 {
 	int err = 0;
 
 	if (scsi_is_sdev_device(dev)) {
 		/*
-		 * sd is the only high-level SCSI driver to implement runtime
-		 * PM, and sd treats runtime suspend, system suspend, and
-		 * system hibernate identically (but not system freeze).
+		 * All the high-level SCSI drivers that implement runtime
+		 * PM treat runtime suspend, system suspend, and system
+		 * hibernate identically.
 		 */
-		if (pm_runtime_suspended(dev)) {
-			if (msg.event == PM_EVENT_SUSPEND ||
-			    msg.event == PM_EVENT_HIBERNATE)
-				return 0;	/* already suspended */
+		if (pm_runtime_suspended(dev))
+			return 0;
 
-			/* wake up device so that FREEZE will succeed */
-			pm_runtime_resume(dev);
-		}
-		err = scsi_dev_type_suspend(dev, msg);
+		err = scsi_dev_type_suspend(dev, cb);
 	}
+
 	return err;
 }
 
-static int scsi_bus_resume_common(struct device *dev)
+static int
+scsi_bus_resume_common(struct device *dev, int (*cb)(struct device *))
 {
 	int err = 0;
 
 	if (scsi_is_sdev_device(dev))
-		err = scsi_dev_type_resume(dev);
+		err = scsi_dev_type_resume(dev, cb);
 
 	if (err == 0) {
 		pm_runtime_disable(dev);
@@ -102,26 +96,49 @@ static int scsi_bus_prepare(struct device *dev)
 
 static int scsi_bus_suspend(struct device *dev)
 {
-	return scsi_bus_suspend_common(dev, PMSG_SUSPEND);
+	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
+	return scsi_bus_suspend_common(dev, pm ? pm->suspend : NULL);
+}
+
+static int scsi_bus_resume(struct device *dev)
+{
+	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
+	return scsi_bus_resume_common(dev, pm ? pm->resume : NULL);
 }
 
 static int scsi_bus_freeze(struct device *dev)
 {
-	return scsi_bus_suspend_common(dev, PMSG_FREEZE);
+	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
+	return scsi_bus_suspend_common(dev, pm ? pm->freeze : NULL);
+}
+
+static int scsi_bus_thaw(struct device *dev)
+{
+	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
+	return scsi_bus_resume_common(dev, pm ? pm->thaw : NULL);
 }
 
 static int scsi_bus_poweroff(struct device *dev)
 {
-	return scsi_bus_suspend_common(dev, PMSG_HIBERNATE);
+	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
+	return scsi_bus_suspend_common(dev, pm ? pm->poweroff : NULL);
+}
+
+static int scsi_bus_restore(struct device *dev)
+{
+	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
+	return scsi_bus_resume_common(dev, pm ? pm->restore : NULL);
 }
 
 #else /* CONFIG_PM_SLEEP */
 
-#define scsi_bus_resume_common		NULL
 #define scsi_bus_prepare		NULL
 #define scsi_bus_suspend		NULL
+#define scsi_bus_resume			NULL
 #define scsi_bus_freeze			NULL
+#define scsi_bus_thaw			NULL
 #define scsi_bus_poweroff		NULL
+#define scsi_bus_restore		NULL
 
 #endif /* CONFIG_PM_SLEEP */
 
@@ -130,10 +147,12 @@ static int scsi_bus_poweroff(struct device *dev)
 static int scsi_runtime_suspend(struct device *dev)
 {
 	int err = 0;
+	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
 
 	dev_dbg(dev, "scsi_runtime_suspend\n");
 	if (scsi_is_sdev_device(dev)) {
-		err = scsi_dev_type_suspend(dev, PMSG_AUTO_SUSPEND);
+		err = scsi_dev_type_suspend(dev,
+				pm ? pm->runtime_suspend : NULL);
 		if (err == -EAGAIN)
 			pm_schedule_suspend(dev, jiffies_to_msecs(
 				round_jiffies_up_relative(HZ/10)));
@@ -147,10 +166,11 @@ static int scsi_runtime_suspend(struct device *dev)
 static int scsi_runtime_resume(struct device *dev)
 {
 	int err = 0;
+	const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
 
 	dev_dbg(dev, "scsi_runtime_resume\n");
 	if (scsi_is_sdev_device(dev))
-		err = scsi_dev_type_resume(dev);
+		err = scsi_dev_type_resume(dev, pm ? pm->runtime_resume : NULL);
 
 	/* Insert hooks here for targets, hosts, and transport classes */
 
@@ -229,11 +249,11 @@ void scsi_autopm_put_host(struct Scsi_Host *shost)
 const struct dev_pm_ops scsi_bus_pm_ops = {
 	.prepare =		scsi_bus_prepare,
 	.suspend =		scsi_bus_suspend,
-	.resume =		scsi_bus_resume_common,
+	.resume =		scsi_bus_resume,
 	.freeze =		scsi_bus_freeze,
-	.thaw =			scsi_bus_resume_common,
+	.thaw =			scsi_bus_thaw,
 	.poweroff =		scsi_bus_poweroff,
-	.restore =		scsi_bus_resume_common,
+	.restore =		scsi_bus_restore,
 	.runtime_suspend =	scsi_runtime_suspend,
 	.runtime_resume =	scsi_runtime_resume,
 	.runtime_idle =		scsi_runtime_idle,
-- 
1.7.12.21.g871e293


^ permalink raw reply related

* [PATCH v4 3/5] Revert "[SCSI] runtime resume parent for child's system-resume"
From: Aaron Lu @ 2012-11-09  7:27 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alan Stern, Rafael J. Wysocki, linux-pm, linux-scsi, Aaron Lu,
	Aaron Lu
In-Reply-To: <1352446075-1814-1-git-send-email-aaron.lu@intel.com>

This reverts commit 28fd00d42cca178638f51c08efa986a777c24a4b.

With commit 88d26136a256576e444db312179e17af6dd0ea87 (PM: Prevent
runtime suspend during system resume), this patch is no longer needed.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 drivers/scsi/scsi_pm.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
index d4201de..9923b26 100644
--- a/drivers/scsi/scsi_pm.c
+++ b/drivers/scsi/scsi_pm.c
@@ -76,17 +76,8 @@ static int scsi_bus_resume_common(struct device *dev)
 {
 	int err = 0;
 
-	if (scsi_is_sdev_device(dev)) {
-		/*
-		 * Parent device may have runtime suspended as soon as
-		 * it is woken up during the system resume.
-		 *
-		 * Resume it on behalf of child.
-		 */
-		pm_runtime_get_sync(dev->parent);
+	if (scsi_is_sdev_device(dev))
 		err = scsi_dev_type_resume(dev);
-		pm_runtime_put_sync(dev->parent);
-	}
 
 	if (err == 0) {
 		pm_runtime_disable(dev);
-- 
1.7.12.21.g871e293


^ permalink raw reply related

* [PATCH v4 2/5] Revert "[SCSI] scsi_pm: set device runtime state before parent suspended"
From: Aaron Lu @ 2012-11-09  7:27 UTC (permalink / raw)
  To: James Bottomley
  Cc: Alan Stern, Rafael J. Wysocki, linux-pm, linux-scsi, Aaron Lu,
	Aaron Lu
In-Reply-To: <1352446075-1814-1-git-send-email-aaron.lu@intel.com>

This reverts commit 33a2285d96b5e7b9500612ec623bf4313397bb53.

With commit 88d26136a256576e444db312179e17af6dd0ea87 (PM: Prevent
runtime suspend during system resume), this patch is no longer needed.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
---
 drivers/scsi/scsi_pm.c | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
index dc0ad85..d4201de 100644
--- a/drivers/scsi/scsi_pm.c
+++ b/drivers/scsi/scsi_pm.c
@@ -76,24 +76,23 @@ static int scsi_bus_resume_common(struct device *dev)
 {
 	int err = 0;
 
-	/*
-	 * Parent device may have runtime suspended as soon as
-	 * it is woken up during the system resume.
-	 *
-	 * Resume it on behalf of child.
-	 */
-	pm_runtime_get_sync(dev->parent);
-
-	if (scsi_is_sdev_device(dev))
+	if (scsi_is_sdev_device(dev)) {
+		/*
+		 * Parent device may have runtime suspended as soon as
+		 * it is woken up during the system resume.
+		 *
+		 * Resume it on behalf of child.
+		 */
+		pm_runtime_get_sync(dev->parent);
 		err = scsi_dev_type_resume(dev);
+		pm_runtime_put_sync(dev->parent);
+	}
+
 	if (err == 0) {
 		pm_runtime_disable(dev);
 		pm_runtime_set_active(dev);
 		pm_runtime_enable(dev);
 	}
-
-	pm_runtime_put_sync(dev->parent);
-
 	return err;
 }
 
-- 
1.7.12.21.g871e293


^ permalink raw reply related

* [PATCH v9 10/10] ata: expose pm qos flags to user space for ata device
From: Aaron Lu @ 2012-11-09  6:52 UTC (permalink / raw)
  To: Jeff Garzik, James Bottomley, Rafael J. Wysocki, Alan Stern,
	Tejun Heo
  Cc: Jeff Wu, Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi,
	Aaron Lu
In-Reply-To: <1352443922-13734-1-git-send-email-aaron.lu@intel.com>

Expose pm qos flags to user space so that user has a chance to disable
pm features like power off, if he/she has a broken platform or devices
or simply does not like this pm feature.

This flag is exposed to user space only for ata device or atapi device
that is zero power capable. For normal atapi device, it will never be
powered off.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 drivers/ata/libata-acpi.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/ata/libata-acpi.c b/drivers/ata/libata-acpi.c
index 91f3405..3984290 100644
--- a/drivers/ata/libata-acpi.c
+++ b/drivers/ata/libata-acpi.c
@@ -1031,6 +1031,8 @@ void ata_acpi_bind(struct ata_device *dev)
 	ata_acpi_register_power_resource(dev);
 	dev_pm_qos_add_request(&dev->sdev->sdev_gendev, &dev->poweroff_req,
 				DEV_PM_QOS_FLAGS, value);
+	if (dev->class == ATA_DEV_ATA || zpodd_dev_enabled(dev))
+		dev_pm_qos_expose_flags(&dev->sdev->sdev_gendev, 0);
 }
 
 void ata_acpi_unbind(struct ata_device *dev)
-- 
1.7.12.4


^ permalink raw reply related

* [PATCH v9 08/10] scsi: sr: support (un)block events
From: Aaron Lu @ 2012-11-09  6:52 UTC (permalink / raw)
  To: Jeff Garzik, James Bottomley, Rafael J. Wysocki, Alan Stern,
	Tejun Heo
  Cc: Jeff Wu, Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi,
	Aaron Lu
In-Reply-To: <1352443922-13734-1-git-send-email-aaron.lu@intel.com>

2 interfaces are added to block/unblock events for the disk sr manages.
This is used by SATA ZPODD, when ODD is runtime powered off, the events
poll is no longer needed so better be blocked. And once powered on,
events poll will be unblocked.

These 2 interfaces are needed here because SATA layer does not have
access to the gendisk structure sr manages.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 drivers/scsi/Makefile   |  1 +
 drivers/scsi/sr_zpodd.c | 21 +++++++++++++++++++++
 drivers/scsi/sr_zpodd.h |  9 +++++++++
 3 files changed, 31 insertions(+)
 create mode 100644 drivers/scsi/sr_zpodd.c
 create mode 100644 drivers/scsi/sr_zpodd.h

diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index 888f73a..474efe2 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -177,6 +177,7 @@ sd_mod-objs	:= sd.o
 sd_mod-$(CONFIG_BLK_DEV_INTEGRITY) += sd_dif.o
 
 sr_mod-objs	:= sr.o sr_ioctl.o sr_vendor.o
+sr_mod-$(CONFIG_SATA_ZPODD) += sr_zpodd.o
 ncr53c8xx-flags-$(CONFIG_SCSI_ZALON) \
 		:= -DCONFIG_NCR53C8XX_PREFETCH -DSCSI_NCR_BIG_ENDIAN \
 			-DCONFIG_SCSI_NCR53C8XX_NO_WORD_TRANSFERS
diff --git a/drivers/scsi/sr_zpodd.c b/drivers/scsi/sr_zpodd.c
new file mode 100644
index 0000000..0686e8c
--- /dev/null
+++ b/drivers/scsi/sr_zpodd.c
@@ -0,0 +1,21 @@
+#include <linux/module.h>
+#include <linux/genhd.h>
+#include <linux/cdrom.h>
+#include "sr.h"
+
+bool sr_block_events(struct device *dev)
+{
+	struct scsi_cd *cd = dev_get_drvdata(dev);
+	if (cd)
+		return disk_try_block_events(cd->disk);
+	return false;
+}
+EXPORT_SYMBOL(sr_block_events);
+
+void sr_unblock_events(struct device *dev)
+{
+	struct scsi_cd *cd = dev_get_drvdata(dev);
+	if (cd)
+		disk_unblock_events(cd->disk);
+}
+EXPORT_SYMBOL(sr_unblock_events);
diff --git a/drivers/scsi/sr_zpodd.h b/drivers/scsi/sr_zpodd.h
new file mode 100644
index 0000000..49c6a55
--- /dev/null
+++ b/drivers/scsi/sr_zpodd.h
@@ -0,0 +1,9 @@
+#ifndef __SR_ZPODD_H__
+#define __SR_ZPODD_H__
+
+#include <linux/device.h>
+
+extern bool sr_block_events(struct device *dev);
+extern void sr_unblock_events(struct device *dev);
+
+#endif
-- 
1.7.12.4


^ permalink raw reply related

* [PATCH v9 07/10] block: add a new interface to block events
From: Aaron Lu @ 2012-11-09  6:51 UTC (permalink / raw)
  To: Jeff Garzik, James Bottomley, Rafael J. Wysocki, Alan Stern,
	Tejun Heo
  Cc: Jeff Wu, Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi,
	Aaron Lu
In-Reply-To: <1352443922-13734-1-git-send-email-aaron.lu@intel.com>

A new interface to block disk events is added, this interface is
meant to eliminate a race between PM runtime callback and disk events
checking.

Suppose the following device tree:
device_sata_port  (the parent)
  device_ODD      (the child)

When ODD is runtime suspended, sata port will have a chance to enter
runtime suspended state. And in sata port's runtime suspend callback,
it will check if it is OK to omit the power of the ODD. And if yes, the
periodically running events checking work will be stopped, as the ODD
will be waken up by that check and cancel it can make the ODD stay in
zero power state much longer(no worry about how the ODD gets media
change event in ZPODD's case).

I used disk_block_events to do the events blocking, but there is a race
and can lead to a deadlock: when I call disk_block_events in sata port's
runtime suspend callback, the events checking work may already be running
and it will resume the ODD synchronously, and PM core will try to resume
its parent, the sata port first. The PM core makes sure that runtime
resume callback does not run concurrently with runtime suspend callback,
and if the runtime suspend callback is running, the runtime resume
callback will wait for it done. So the events checking work will wait
for sata port's runtime suspend callback, while the sata port's runtime
suspend callback is waiting for the disk events work to finish. Deadlock.

ODD_suspend                        disk_events_workfn
  ata_port_suspend                   check_events
    disk_block_events                  resume ODD
      cancel_delayed_work_sync           resume parent
      (waiting for disk_events_workfn)   (waiting for suspend callback)

So a new function disk_try_block_events is added, it will try to
cancel the delayed work if it is pending. If succeed, disk_block_events
will be called and we are done; if failed, false is returned without
doing anything. In this way, the race can be avoided.

The newly added interface and the disk_unblock_events are exported, as
sr driver will need to use them to block/unblock disk events.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 block/genhd.c         | 26 ++++++++++++++++++++++++++
 include/linux/genhd.h |  1 +
 2 files changed, 27 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 6cace66..8632fd3 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1469,6 +1469,31 @@ void disk_block_events(struct gendisk *disk)
 	mutex_unlock(&ev->block_mutex);
 }

+/*
+ * Under some circumstances, there is a race between the calling thread
+ * of disk_block_events and the events checking function. To avoid such a race,
+ * this function will check if the delayed work is pending. If not, it means
+ * the work is either not queued or is already running, false is returned.
+ * And if yes, try to cancel the delayed work. If succedded, disk_block_events
+ * will be called and there is no worry that cancel_delayed_work_sync will
+ * deadlock the events checking function. And if failed, false is returned.
+ */
+bool disk_try_block_events(struct gendisk *disk)
+{
+	struct disk_events *ev = disk->ev;
+
+	if (!ev)
+		return false;
+
+	if (cancel_delayed_work(&disk->ev->dwork)) {
+		disk_block_events(disk);
+		return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL(disk_try_block_events);
+
 static void __disk_unblock_events(struct gendisk *disk, bool check_now)
 {
 	struct disk_events *ev = disk->ev;
@@ -1512,6 +1537,7 @@ void disk_unblock_events(struct gendisk *disk)
 	if (disk->ev)
 		__disk_unblock_events(disk, false);
 }
+EXPORT_SYMBOL(disk_unblock_events);

 /**
  * disk_flush_events - schedule immediate event checking and flushing
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 4f440b3..b67247f 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -420,6 +420,7 @@ static inline int get_disk_ro(struct gendisk *disk)
 }

 extern void disk_block_events(struct gendisk *disk);
+extern bool disk_try_block_events(struct gendisk *disk);
 extern void disk_unblock_events(struct gendisk *disk);
 extern void disk_flush_events(struct gendisk *disk, unsigned int mask);
 extern unsigned int disk_clear_events(struct gendisk *disk, unsigned int mask);
-- 
1.7.12.4

^ permalink raw reply related

* [PATCH v9 05/10] libata: separate ATAPI code
From: Aaron Lu @ 2012-11-09  6:51 UTC (permalink / raw)
  To: Jeff Garzik, James Bottomley, Rafael J. Wysocki, Alan Stern,
	Tejun Heo
  Cc: Jeff Wu, Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi,
	Aaron Lu
In-Reply-To: <1352443922-13734-1-git-send-email-aaron.lu@intel.com>

The atapi_eh_tur and atapi_eh_request_sense can be reused by ZPODD
code, so separate them out to a file named libata-atapi.c, and the
Makefile is modified accordingly. No functional changes should result
from this commit.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 drivers/ata/Makefile       |  2 +-
 drivers/ata/libata-atapi.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/ata/libata-eh.c    | 85 --------------------------------------------
 drivers/ata/libata.h       |  4 +++
 4 files changed, 93 insertions(+), 86 deletions(-)
 create mode 100644 drivers/ata/libata-atapi.c

diff --git a/drivers/ata/Makefile b/drivers/ata/Makefile
index 85e3de4..36377f9 100644
--- a/drivers/ata/Makefile
+++ b/drivers/ata/Makefile
@@ -103,7 +103,7 @@ obj-$(CONFIG_ATA_GENERIC)	+= ata_generic.o
 # Should be last libata driver
 obj-$(CONFIG_PATA_LEGACY)	+= pata_legacy.o
 
-libata-y	:= libata-core.o libata-scsi.o libata-eh.o libata-transport.o
+libata-y	:= libata-core.o libata-scsi.o libata-eh.o libata-atapi.o libata-transport.o
 libata-$(CONFIG_ATA_SFF)	+= libata-sff.o
 libata-$(CONFIG_SATA_PMP)	+= libata-pmp.o
 libata-$(CONFIG_ATA_ACPI)	+= libata-acpi.o
diff --git a/drivers/ata/libata-atapi.c b/drivers/ata/libata-atapi.c
new file mode 100644
index 0000000..28684ae
--- /dev/null
+++ b/drivers/ata/libata-atapi.c
@@ -0,0 +1,88 @@
+#include <linux/libata.h>
+#include <scsi/scsi_cmnd.h>
+#include "libata.h"
+
+/**
+ *	atapi_eh_tur - perform ATAPI TEST_UNIT_READY
+ *	@dev: target ATAPI device
+ *	@r_sense_key: out parameter for sense_key
+ *
+ *	Perform ATAPI TEST_UNIT_READY.
+ *
+ *	LOCKING:
+ *	EH context (may sleep).
+ *
+ *	RETURNS:
+ *	0 on success, AC_ERR_* mask on failure.
+ */
+unsigned int atapi_eh_tur(struct ata_device *dev, u8 *r_sense_key)
+{
+	u8 cdb[ATAPI_CDB_LEN] = { TEST_UNIT_READY, 0, 0, 0, 0, 0 };
+	struct ata_taskfile tf;
+	unsigned int err_mask;
+
+	ata_tf_init(dev, &tf);
+
+	tf.flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE;
+	tf.command = ATA_CMD_PACKET;
+	tf.protocol = ATAPI_PROT_NODATA;
+
+	err_mask = ata_exec_internal(dev, &tf, cdb, DMA_NONE, NULL, 0, 0);
+	if (err_mask == AC_ERR_DEV)
+		*r_sense_key = tf.feature >> 4;
+	return err_mask;
+}
+
+/**
+ *	atapi_eh_request_sense - perform ATAPI REQUEST_SENSE
+ *	@dev: device to perform REQUEST_SENSE to
+ *	@sense_buf: result sense data buffer (SCSI_SENSE_BUFFERSIZE bytes long)
+ *	@dfl_sense_key: default sense key to use
+ *
+ *	Perform ATAPI REQUEST_SENSE after the device reported CHECK
+ *	SENSE.  This function is EH helper.
+ *
+ *	LOCKING:
+ *	Kernel thread context (may sleep).
+ *
+ *	RETURNS:
+ *	0 on success, AC_ERR_* mask on failure
+ */
+unsigned int atapi_eh_request_sense(struct ata_device *dev,
+				    u8 *sense_buf, u8 dfl_sense_key)
+{
+	u8 cdb[ATAPI_CDB_LEN] =	{
+		REQUEST_SENSE, 0, 0, 0, SCSI_SENSE_BUFFERSIZE, 0 };
+	struct ata_port *ap = dev->link->ap;
+	struct ata_taskfile tf;
+
+	DPRINTK("ATAPI request sense\n");
+
+	/* FIXME: is this needed? */
+	memset(sense_buf, 0, SCSI_SENSE_BUFFERSIZE);
+
+	/* initialize sense_buf with the error register,
+	 * for the case where they are -not- overwritten
+	 */
+	sense_buf[0] = 0x70;
+	sense_buf[2] = dfl_sense_key;
+
+	/* some devices time out if garbage left in tf */
+	ata_tf_init(dev, &tf);
+
+	tf.flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE;
+	tf.command = ATA_CMD_PACKET;
+
+	/* is it pointless to prefer PIO for "safety reasons"? */
+	if (ap->flags & ATA_FLAG_PIO_DMA) {
+		tf.protocol = ATAPI_PROT_DMA;
+		tf.feature |= ATAPI_PKT_DMA;
+	} else {
+		tf.protocol = ATAPI_PROT_PIO;
+		tf.lbam = SCSI_SENSE_BUFFERSIZE;
+		tf.lbah = 0;
+	}
+
+	return ata_exec_internal(dev, &tf, cdb, DMA_FROM_DEVICE,
+				 sense_buf, SCSI_SENSE_BUFFERSIZE, 0);
+}
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index e60437c..6487b88 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -1579,91 +1579,6 @@ static int ata_eh_read_log_10h(struct ata_device *dev,
 }
 
 /**
- *	atapi_eh_tur - perform ATAPI TEST_UNIT_READY
- *	@dev: target ATAPI device
- *	@r_sense_key: out parameter for sense_key
- *
- *	Perform ATAPI TEST_UNIT_READY.
- *
- *	LOCKING:
- *	EH context (may sleep).
- *
- *	RETURNS:
- *	0 on success, AC_ERR_* mask on failure.
- */
-static unsigned int atapi_eh_tur(struct ata_device *dev, u8 *r_sense_key)
-{
-	u8 cdb[ATAPI_CDB_LEN] = { TEST_UNIT_READY, 0, 0, 0, 0, 0 };
-	struct ata_taskfile tf;
-	unsigned int err_mask;
-
-	ata_tf_init(dev, &tf);
-
-	tf.flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE;
-	tf.command = ATA_CMD_PACKET;
-	tf.protocol = ATAPI_PROT_NODATA;
-
-	err_mask = ata_exec_internal(dev, &tf, cdb, DMA_NONE, NULL, 0, 0);
-	if (err_mask == AC_ERR_DEV)
-		*r_sense_key = tf.feature >> 4;
-	return err_mask;
-}
-
-/**
- *	atapi_eh_request_sense - perform ATAPI REQUEST_SENSE
- *	@dev: device to perform REQUEST_SENSE to
- *	@sense_buf: result sense data buffer (SCSI_SENSE_BUFFERSIZE bytes long)
- *	@dfl_sense_key: default sense key to use
- *
- *	Perform ATAPI REQUEST_SENSE after the device reported CHECK
- *	SENSE.  This function is EH helper.
- *
- *	LOCKING:
- *	Kernel thread context (may sleep).
- *
- *	RETURNS:
- *	0 on success, AC_ERR_* mask on failure
- */
-static unsigned int atapi_eh_request_sense(struct ata_device *dev,
-					   u8 *sense_buf, u8 dfl_sense_key)
-{
-	u8 cdb[ATAPI_CDB_LEN] =
-		{ REQUEST_SENSE, 0, 0, 0, SCSI_SENSE_BUFFERSIZE, 0 };
-	struct ata_port *ap = dev->link->ap;
-	struct ata_taskfile tf;
-
-	DPRINTK("ATAPI request sense\n");
-
-	/* FIXME: is this needed? */
-	memset(sense_buf, 0, SCSI_SENSE_BUFFERSIZE);
-
-	/* initialize sense_buf with the error register,
-	 * for the case where they are -not- overwritten
-	 */
-	sense_buf[0] = 0x70;
-	sense_buf[2] = dfl_sense_key;
-
-	/* some devices time out if garbage left in tf */
-	ata_tf_init(dev, &tf);
-
-	tf.flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE;
-	tf.command = ATA_CMD_PACKET;
-
-	/* is it pointless to prefer PIO for "safety reasons"? */
-	if (ap->flags & ATA_FLAG_PIO_DMA) {
-		tf.protocol = ATAPI_PROT_DMA;
-		tf.feature |= ATAPI_PKT_DMA;
-	} else {
-		tf.protocol = ATAPI_PROT_PIO;
-		tf.lbam = SCSI_SENSE_BUFFERSIZE;
-		tf.lbah = 0;
-	}
-
-	return ata_exec_internal(dev, &tf, cdb, DMA_FROM_DEVICE,
-				 sense_buf, SCSI_SENSE_BUFFERSIZE, 0);
-}
-
-/**
  *	ata_eh_analyze_serror - analyze SError for a failed port
  *	@link: ATA link to analyze SError for
  *
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index 55ad37e..5d68210 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -247,4 +247,8 @@ static inline void zpodd_deinit(struct ata_device *dev) {}
 static inline bool zpodd_dev_enabled(struct ata_device *dev) { return false; }
 #endif /* CONFIG_SATA_ZPODD */
 
+/* libata-atapi.c */
+unsigned int atapi_eh_tur(struct ata_device *dev, u8 *r_sense_key);
+unsigned int atapi_eh_request_sense(struct ata_device *dev, u8 *sense_buf, u8 dfl_sense_key);
+
 #endif /* __LIBATA_H__ */
-- 
1.7.12.4


^ permalink raw reply related

* [PATCH v9 00/10] ZPODD Patches
From: Aaron Lu @ 2012-11-09  6:51 UTC (permalink / raw)
  To: Jeff Garzik, James Bottomley, Rafael J. Wysocki, Alan Stern,
	Tejun Heo
  Cc: Jeff Wu, Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi,
	Aaron Lu

v9:
Build ZPODD as part of libata instead of another standalone module
as it is tightly related to other libata files.
Identify and init ZPODD during probe time instead of after SCSI
device is created as suggested by Tejun Heo.
Make use of pm qos flag to give ACPI hint when choosing ACPI state.
Expose qos flag to give user control of whether power off is allowed.

This patchset used Rafael's pm-qos work:
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git pm-qos

v8:
This version is a redesign, it doesn't have much to do with previous
versions. The ZPODD implementation is done almost entirely in ATA layer
now, except 2 helper functions from SCSI sr driver to block disk events.

The basic idea is that, when ata port is runtime suspended, it will
check if the ODD is ready to be powered off. And if yes, events is
blocked and power omitted; if not, ODD's power supply remains unchanged
by keeping ACPI state at D0.

Some background knowledge about ZPODD is added below v1 history log.

v7:
Re work of runtime pm of sr driver, based on ideas of Alan Stern and
Oliver Neukum.

Jeff, due to the ready_to_power_off flag added, there is a small
change in [PATCH v7 6/6] libata: acpi: respect may_power_off flag,
please check if I can still get your ack, thanks.

v6:
When user changes may_power_off flag through sysfs entry and if device
is already runtime suspended, resume resume it so that it can respect
this flag next time it is runtime suspended as suggested by Alan Stern.
Call scsi_autopm_get/put_device once in sr_check_events as suggested by
Alan Stern.

v5:
Add may_power_off flag to scsi device.
Alan Stern suggested that I should not mess runtime suspend with
runtime power off, but the current zpodd implementation made it not
easy to seperate. So I re-wrote the zpodd implementation, the end
result is, normal ODD can also enter runtime suspended state, but
their power won't be removed.

v4:
Rebase on top of Linus' tree, due to this, the problem of a missing
flag in v3 is gone;
Add a new function scsi_autopm_put_device_autosuspend to first mark
last busy for the device and then put autosuspend it as suggested by
Oliver Neukum.
Typo fix as pointed by Sergei Shtylyov.
Check can_power_off flag before any runtime pm operations in sr.

v3:
Rebase on top of scsi-misc tree;
Add the sr related patches previously in Jeff's libata tree;
Re-organize the sr patches.
A problem for now: for patch
scsi: sr: support zero power ODD(ZPODD)
I can't set a flag in libata-acpi.c since a related function is
missing in scsi-misc tree. Will fix this when 3.6-rc1 released.

v2:
Bug fix for v1;
Use scsi_autopm_* in sr driver instead of pm_runtime_*;

v1:
Here are some patches to make ZPODD easier to use for end users and
a fix for using ZPODD with system suspend.

Some background knowledge about ZPODD:
ODD means Optical Disc Drive.
ZPODD means Zero Power ODD, it is a mechanism to place the ODD into
zero power state when the system is running at S0 system state without
user's awareness.
It achieved this by ACPI and SATA device attention pin. For power off,
normal ACPI control method is used to place the device into D3 cold
ACPI device state, aka. device power supply omitted. For power on, when
user press the eject button of a drawer type ODD or when user inserts
an ODD into a slot type ODD, the device attention pin will trigger. In
the current x86 implementation, this pin will connect to a GPE, and the
GPE will trigger an ACPI interrupt. With our pre-registered ACPI
notification code, the device can be runtime resumed, and we place the
device back to full power state by setting its ACPI state to D0. The
whole process is transparent to the end user.

Aaron Lu (10):
  scsi: sr: support runtime pm
  ata: zpodd: Add CONFIG_SATA_ZPODD
  ata: zpodd: identify and init ZPODD devices
  libata: acpi: move acpi notification code to zpodd
  libata: separate ATAPI code
  ata: zpodd: check zero power ready status
  block: add a new interface to block events
  scsi: sr: support (un)block events
  ata: zpodd: handle power transition of ODD
  ata: expose pm qos flags to user space for ata device

 block/genhd.c              |  26 ++++
 drivers/ata/Kconfig        |  12 ++
 drivers/ata/Makefile       |   3 +-
 drivers/ata/libata-acpi.c  | 123 ++++++-----------
 drivers/ata/libata-atapi.c |  88 ++++++++++++
 drivers/ata/libata-core.c  |   4 +-
 drivers/ata/libata-eh.c    |  87 +-----------
 drivers/ata/libata-scsi.c  |   4 +
 drivers/ata/libata-zpodd.c | 333 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/ata/libata.h       |  33 +++++
 drivers/scsi/Makefile      |   1 +
 drivers/scsi/sr.c          |  30 +++-
 drivers/scsi/sr_zpodd.c    |  21 +++
 drivers/scsi/sr_zpodd.h    |   9 ++
 include/linux/genhd.h      |   1 +
 include/linux/libata.h     |   2 +
 include/uapi/linux/cdrom.h |  34 +++++
 17 files changed, 638 insertions(+), 173 deletions(-)
 create mode 100644 drivers/ata/libata-atapi.c
 create mode 100644 drivers/ata/libata-zpodd.c
 create mode 100644 drivers/scsi/sr_zpodd.c
 create mode 100644 drivers/scsi/sr_zpodd.h

-- 
1.7.12.4

^ permalink raw reply

* [PATCH v9 09/10] ata: zpodd: handle power transition of ODD
From: Aaron Lu @ 2012-11-09  6:52 UTC (permalink / raw)
  To: Jeff Garzik, James Bottomley, Rafael J. Wysocki, Alan Stern,
	Tejun Heo
  Cc: Jeff Wu, Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi,
	Aaron Lu
In-Reply-To: <1352443922-13734-1-git-send-email-aaron.lu@intel.com>

When ata port is runtime suspended, it will check if the ODD attched to
it is in zero power ready state by checking the zp_ready field. And if
this field indicates it is not ready to be powered off, NO_POWEROFF qos
flag will be set to avoid choosing ACPI_STATE_D3_COLD for it.

Once powered off, disk events will be blocked to avoid waking it up
every two seconds.

And on resume, it will re-gain power and go through the recovery
process. When reset for the ata port is done, the ODD is considered
functional, and post processing like eject tray if the ODD is drawer
type is done there. And disk events is unblocked here.

For normal ODDs that do not support zero power state, NO_POWEROFF qos
flag will always be set.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 drivers/ata/libata-acpi.c  | 40 ++++++++++++++++++++++-------
 drivers/ata/libata-eh.c    |  2 ++
 drivers/ata/libata-zpodd.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/ata/libata.h       |  8 ++++++
 include/linux/libata.h     |  2 ++
 5 files changed, 106 insertions(+), 9 deletions(-)

diff --git a/drivers/ata/libata-acpi.c b/drivers/ata/libata-acpi.c
index 13ee178..91f3405 100644
--- a/drivers/ata/libata-acpi.c
+++ b/drivers/ata/libata-acpi.c
@@ -838,6 +838,24 @@ void ata_acpi_on_resume(struct ata_port *ap)
 	}
 }
 
+static int ata_acpi_choose_state(struct ata_device *dev)
+{
+	/* Always choose D3 for PATA devices */
+	if (!(dev->link->ap->flags & ATA_FLAG_ACPI_SATA))
+		return ACPI_STATE_D3;
+
+	if (zpodd_dev_enabled(dev)) {
+		if (zpodd_poweroff_ready(dev))
+			dev_pm_qos_update_request(&dev->poweroff_req, 0);
+		else
+			dev_pm_qos_update_request(&dev->poweroff_req,
+						  PM_QOS_FLAG_NO_POWER_OFF);
+	}
+
+	return acpi_pm_device_sleep_state(&dev->sdev->sdev_gendev,
+					  NULL, ACPI_STATE_D3_COLD);
+}
+
 /**
  * ata_acpi_set_state - set the port power state
  * @ap: target ATA port
@@ -864,17 +882,16 @@ void ata_acpi_set_state(struct ata_port *ap, pm_message_t state)
 			continue;
 
 		if (state.event != PM_EVENT_ON) {
-			acpi_state = acpi_pm_device_sleep_state(
-				&dev->sdev->sdev_gendev, NULL, ACPI_STATE_D3);
-			if (acpi_state > 0)
+			acpi_state = ata_acpi_choose_state(dev);
+			if (acpi_state > 0) {
 				acpi_bus_set_power(handle, acpi_state);
-			/* TBD: need to check if it's runtime pm request */
-			acpi_pm_device_run_wake(
-				&dev->sdev->sdev_gendev, true);
+				if (zpodd_dev_enabled(dev) &&
+				    acpi_state == ACPI_STATE_D3_COLD)
+					zpodd_post_poweroff(dev);
+			}
 		} else {
-			/* Ditto */
-			acpi_pm_device_run_wake(
-				&dev->sdev->sdev_gendev, false);
+			if (zpodd_dev_enabled(dev))
+				zpodd_pre_poweron(dev);
 			acpi_bus_set_power(handle, ACPI_STATE_D0);
 		}
 	}
@@ -1008,7 +1025,12 @@ static void ata_acpi_unregister_power_resource(struct ata_device *dev)
 
 void ata_acpi_bind(struct ata_device *dev)
 {
+	/* ODD can't be put to D3 cold state, unless it is zero power capable */
+	s32 value = dev->class == ATA_DEV_ATAPI ? PM_QOS_FLAG_NO_POWER_OFF : 0;
+
 	ata_acpi_register_power_resource(dev);
+	dev_pm_qos_add_request(&dev->sdev->sdev_gendev, &dev->poweroff_req,
+				DEV_PM_QOS_FLAGS, value);
 }
 
 void ata_acpi_unbind(struct ata_device *dev)
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 6487b88..1348e7c 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -3771,6 +3771,8 @@ int ata_eh_recover(struct ata_port *ap, ata_prereset_fn_t prereset,
 				rc = atapi_eh_clear_ua(dev);
 				if (rc)
 					goto rest_fail;
+				if (zpodd_dev_enabled(dev))
+					zpodd_post_resume(dev);
 			}
 		}
 
diff --git a/drivers/ata/libata-zpodd.c b/drivers/ata/libata-zpodd.c
index 533a39e..777f9c7 100644
--- a/drivers/ata/libata-zpodd.c
+++ b/drivers/ata/libata-zpodd.c
@@ -5,6 +5,7 @@
 #include <scsi/scsi_cmnd.h>
 
 #include "libata.h"
+#include "../scsi/sr_zpodd.h"
 
 #define POWEROFF_DELAY  (30 * 1000)     /* 30 seconds for power off delay */
 
@@ -15,6 +16,7 @@ struct zpodd {
 	bool status_ready:1;	/* ready status derived from media event poll,
 				   it is not accurate, but serves as a hint */
 	bool zp_ready:1;	/* zero power ready state */
+	bool powered_off:1;	/* ODD is powered off */
 
 	unsigned long last_ready; /* last zero power ready timestamp */
 
@@ -40,6 +42,17 @@ static int run_atapi_cmd(struct ata_device *dev, const char *cdb,
 			buf ? DMA_FROM_DEVICE : DMA_NONE, buf, buf_len, 0);
 }
 
+static int eject_tray(struct ata_device *dev)
+{
+	const char cdb[] = {  GPCMD_START_STOP_UNIT,
+			      0, 0, 0,
+			      0x02,     /* LoEj */
+			      0, 0, 0, 0, 0, 0, 0,
+	};
+
+	return run_atapi_cmd(dev, cdb, sizeof(cdb), NULL, 0);
+}
+
 /*
  * Per the spec, only slot type and drawer type ODD can be supported
  *
@@ -209,6 +222,56 @@ void zpodd_check_zpready(struct ata_device *dev)
 		zpodd->last_ready = 0;
 }
 
+/*
+ * Test if ODD is ready to be powered off.
+ * Determined by zp_ready and if events is successfully blocked
+ */
+bool zpodd_poweroff_ready(struct ata_device *dev)
+{
+	struct zpodd *zpodd = dev->private_data;
+	return zpodd->zp_ready;
+}
+
+void zpodd_post_poweroff(struct ata_device *dev)
+{
+	struct zpodd *zpodd = dev->private_data;
+
+	sr_block_events(&dev->sdev->sdev_gendev);
+
+	zpodd->powered_off = true;
+	device_set_run_wake(&dev->sdev->sdev_gendev, true);
+	acpi_pm_device_run_wake(&dev->sdev->sdev_gendev, true);
+}
+
+void zpodd_pre_poweron(struct ata_device *dev)
+{
+	struct zpodd *zpodd = dev->private_data;
+	if (zpodd->powered_off) {
+		acpi_pm_device_run_wake(&dev->sdev->sdev_gendev, false);
+		device_set_run_wake(&dev->sdev->sdev_gendev, false);
+	}
+}
+
+void zpodd_post_resume(struct ata_device *dev)
+{
+	struct zpodd *zpodd = dev->private_data;
+
+	if (!zpodd->powered_off)
+		return;
+
+	zpodd->powered_off = false;
+
+	if (zpodd->from_notify) {
+		zpodd->from_notify = false;
+		if (zpodd->drawer)
+			eject_tray(dev);
+	}
+
+	zpodd->last_ready = 0;
+	zpodd->zp_ready = false;
+	sr_unblock_events(&dev->sdev->sdev_gendev);
+}
+
 static void zpodd_wake_dev(acpi_handle handle, u32 event, void *context)
 {
 	struct ata_device *ata_dev = context;
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index 2b46703..5e4baf9 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -243,12 +243,20 @@ static inline bool zpodd_dev_enabled(struct ata_device *dev)
 }
 void zpodd_snoop_status(struct ata_device *dev, struct scsi_cmnd *scmd);
 void zpodd_check_zpready(struct ata_device *dev);
+bool zpodd_poweroff_ready(struct ata_device *dev);
+void zpodd_post_poweroff(struct ata_device *dev);
+void zpodd_pre_poweron(struct ata_device *dev);
+void zpodd_post_resume(struct ata_device *dev);
 #else /* CONFIG_SATA_ZPODD */
 static inline void zpodd_init(struct ata_device *dev) {}
 static inline void zpodd_deinit(struct ata_device *dev) {}
 static inline bool zpodd_dev_enabled(struct ata_device *dev) { return false; }
 static inline void zpodd_snoop_status(struct ata_device *dev, struct scsi_cmnd *scmd) {}
 static inline void zpodd_check_zpready(struct ata_device *dev) {}
+static inline bool zpodd_poweroff_ready(struct ata_device *dev) { return false; }
+static inline void zpodd_post_poweroff(struct ata_device *dev) {}
+static inline void zpodd_pre_poweron(struct ata_device *dev) {}
+static inline void zpodd_post_resume(struct ata_device *dev) {}
 #endif /* CONFIG_SATA_ZPODD */
 
 /* libata-atapi.c */
diff --git a/include/linux/libata.h b/include/linux/libata.h
index 77eeeda..dc98912 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -38,6 +38,7 @@
 #include <linux/acpi.h>
 #include <linux/cdrom.h>
 #include <linux/sched.h>
+#include <linux/pm_qos.h>
 
 /*
  * Define if arch has non-standard setup.  This is a _PCI_ standard
@@ -618,6 +619,7 @@ struct ata_device {
 #ifdef CONFIG_ATA_ACPI
 	union acpi_object	*gtf_cache;
 	unsigned int		gtf_filter;
+	struct dev_pm_qos_request poweroff_req;
 #endif
 	struct device		tdev;
 	/* n_sector is CLEAR_BEGIN, read comment above CLEAR_BEGIN */
-- 
1.7.12.4


^ permalink raw reply related

* [PATCH v9 06/10] ata: zpodd: check zero power ready status
From: Aaron Lu @ 2012-11-09  6:51 UTC (permalink / raw)
  To: Jeff Garzik, James Bottomley, Rafael J. Wysocki, Alan Stern,
	Tejun Heo
  Cc: Jeff Wu, Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi,
	Aaron Lu
In-Reply-To: <1352443922-13734-1-git-send-email-aaron.lu@intel.com>

Per the Mount Fuji spec, the ODD is considered zero power ready when:
- For slot type ODD, no media inside;
- For tray type ODD, no media inside and tray closed.

The information can be retrieved by either the returned information of
command GET_EVENT_STATUS_NOTIFICATION(the command is used to poll for
media event) or sense code.

In this implementation, the zero power ready status is determined by
the following factors:
1 polled media status byte, and this info is recorded in status_ready
  field of zpodd structure;
2 sense code by issuing a TEST_UNIT_READY command after status_ready
  is found to be true.

The information provided by the media status byte is not accurate, it is
possible that after a new disc is just inserted, the status byte still
returns media not present. So this information can not be used as the
final deciding factor. But since SCSI ODD driver sr will always poll the
ODD every 2 seconds, this information is readily available without any
much cost. So it is used as a hint: when we find zero power ready status
in the media status byte, we will see if it is really the case with the
sense code method. This way, we can avoid sending too many
TEST_UNIT_READY commands to the ODD.

When we first sensed the ODD in the zero power ready state, the
timestamp will be recoreded. And after ODD stayed in this state for some
pre-defined period, the ODD is considered as power off ready and the
zp_ready flag will be set. The zp_ready flag serves as the deciding
factor other code will use to see if power off is OK for the ODD. The
Mount Fuji spec suggests a delay should be used here, to avoid the case
user ejects the ODD and then instantly inserts a new one again, so that
we can avoid a power transition. And some ODDs may be slow to place its
head to the home position after disc is ejected, so a delay here is
generally a good idea.

The zero power ready status check is performed in the ata port's runtime
suspend code path, when port is not frozen yet, as we need to issue some
IOs to the ODD.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 drivers/ata/libata-acpi.c  |   8 +++-
 drivers/ata/libata-scsi.c  |   4 ++
 drivers/ata/libata-zpodd.c | 116 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/ata/libata.h       |   4 ++
 4 files changed, 131 insertions(+), 1 deletion(-)

diff --git a/drivers/ata/libata-acpi.c b/drivers/ata/libata-acpi.c
index 6b6819c..13ee178 100644
--- a/drivers/ata/libata-acpi.c
+++ b/drivers/ata/libata-acpi.c
@@ -784,7 +784,13 @@ static int ata_acpi_push_id(struct ata_device *dev)
  */
 int ata_acpi_on_suspend(struct ata_port *ap)
 {
-	/* nada */
+	struct ata_device *dev;
+
+	ata_for_each_dev(dev, &ap->link, ENABLED) {
+		if (zpodd_dev_enabled(dev))
+			zpodd_check_zpready(dev);
+	}
+
 	return 0;
 }
 
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index e3bda07..6f235b9 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2665,6 +2665,10 @@ static void atapi_qc_complete(struct ata_queued_cmd *qc)
 			ata_scsi_rbuf_put(cmd, true, &flags);
 		}
 
+		if (zpodd_dev_enabled(qc->dev) &&
+				scsicmd[0] == GET_EVENT_STATUS_NOTIFICATION)
+			zpodd_snoop_status(qc->dev, cmd);
+
 		cmd->result = SAM_STAT_GOOD;
 	}
 
diff --git a/drivers/ata/libata-zpodd.c b/drivers/ata/libata-zpodd.c
index ba8c985..533a39e 100644
--- a/drivers/ata/libata-zpodd.c
+++ b/drivers/ata/libata-zpodd.c
@@ -2,13 +2,21 @@
 #include <linux/cdrom.h>
 #include <linux/pm_runtime.h>
 #include <scsi/scsi_device.h>
+#include <scsi/scsi_cmnd.h>
 
 #include "libata.h"
 
+#define POWEROFF_DELAY  (30 * 1000)     /* 30 seconds for power off delay */
+
 struct zpodd {
 	bool slot:1;
 	bool drawer:1;
 	bool from_notify:1;	/* resumed as a result of acpi notification */
+	bool status_ready:1;	/* ready status derived from media event poll,
+				   it is not accurate, but serves as a hint */
+	bool zp_ready:1;	/* zero power ready state */
+
+	unsigned long last_ready; /* last zero power ready timestamp */
 
 	struct ata_device *dev;
 };
@@ -93,6 +101,114 @@ static bool device_can_poweroff(struct ata_device *ata_dev)
 		return false;
 }
 
+/*
+ * Snoop the result of GET_STATUS_NOTIFICATION_EVENT, the media
+ * status byte has information on media present/door closed.
+ *
+ * This information serves only as a hint, as it is not accurate.
+ * The sense code method will be used when deciding if the ODD is
+ * really zero power ready.
+ */
+void zpodd_snoop_status(struct ata_device *dev, struct scsi_cmnd *scmd)
+{
+	bool ready;
+	char buf[8];
+	struct event_header *eh = (void *)buf;
+	struct media_event_desc *med = (void *)(buf + 4);
+	struct sg_table *table = &scmd->sdb.table;
+	struct zpodd *zpodd = dev->private_data;
+
+	if (sg_copy_to_buffer(table->sgl, table->nents, buf, 8) != 8)
+		return;
+
+	if (be16_to_cpu(eh->data_len) < sizeof(*med))
+		return;
+
+	if (eh->nea || eh->notification_class != 0x4)
+		return;
+
+	if (zpodd->slot)
+		ready = !med->media_present;
+	else
+		ready = !(med->media_present || med->door_open);
+
+	zpodd->status_ready = ready;
+}
+
+/* Test if ODD is zero power ready by sense code */
+static bool zpready(struct ata_device *dev)
+{
+	u8 sense_key, *sense_buf;
+	unsigned int ret, asc, ascq, add_len;
+	struct zpodd *zpodd = dev->private_data;
+
+	ret = atapi_eh_tur(dev, &sense_key);
+
+	if (!ret || sense_key != NOT_READY)
+		return false;
+
+	sense_buf = dev->link->ap->sector_buf;
+	ret = atapi_eh_request_sense(dev, sense_buf, sense_key);
+	if (ret)
+		return false;
+
+	/* sense valid */
+	if ((sense_buf[0] & 0x7f) != 0x70)
+		return false;
+
+	add_len = sense_buf[7];
+	/* has asc and ascq */
+	if (add_len < 6)
+		return false;
+
+	asc = sense_buf[12];
+	ascq = sense_buf[13];
+
+	if (zpodd->slot)
+		/* no media inside */
+		return asc == 0x3a;
+	else
+		/* no media inside and door closed */
+		return asc == 0x3a && ascq == 0x01;
+}
+
+/*
+ * Check ODD's zero power ready status.
+ *
+ * This function is called during ATA port's suspend path,
+ * when the port is not frozen yet, so that we can still make
+ * some IO to the ODD to decide if it is zero power ready.
+ *
+ * The ODD is regarded as zero power ready when it is in zero
+ * power ready state for some time(defined by POWEROFF_DELAY).
+ */
+void zpodd_check_zpready(struct ata_device *dev)
+{
+	bool zp_ready;
+	unsigned long expires;
+	struct zpodd *zpodd = dev->private_data;
+
+	if (!zpodd->status_ready) {
+		zpodd->last_ready = 0;
+		return;
+	}
+
+	if (!zpodd->last_ready) {
+		zp_ready = zpready(dev);
+		if (zp_ready)
+			zpodd->last_ready = jiffies;
+		return;
+	}
+
+	expires = zpodd->last_ready + msecs_to_jiffies(POWEROFF_DELAY);
+	if (time_before(jiffies, expires))
+		return;
+
+	zpodd->zp_ready = zpready(dev);
+	if (!zpodd->zp_ready)
+		zpodd->last_ready = 0;
+}
+
 static void zpodd_wake_dev(acpi_handle handle, u32 event, void *context)
 {
 	struct ata_device *ata_dev = context;
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index 5d68210..2b46703 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -241,10 +241,14 @@ static inline bool zpodd_dev_enabled(struct ata_device *dev)
 	else
 		return false;
 }
+void zpodd_snoop_status(struct ata_device *dev, struct scsi_cmnd *scmd);
+void zpodd_check_zpready(struct ata_device *dev);
 #else /* CONFIG_SATA_ZPODD */
 static inline void zpodd_init(struct ata_device *dev) {}
 static inline void zpodd_deinit(struct ata_device *dev) {}
 static inline bool zpodd_dev_enabled(struct ata_device *dev) { return false; }
+static inline void zpodd_snoop_status(struct ata_device *dev, struct scsi_cmnd *scmd) {}
+static inline void zpodd_check_zpready(struct ata_device *dev) {}
 #endif /* CONFIG_SATA_ZPODD */
 
 /* libata-atapi.c */
-- 
1.7.12.4


^ permalink raw reply related

* [PATCH v9 04/10] libata: acpi: move acpi notification code to zpodd
From: Aaron Lu @ 2012-11-09  6:51 UTC (permalink / raw)
  To: Jeff Garzik, James Bottomley, Rafael J. Wysocki, Alan Stern,
	Tejun Heo
  Cc: Jeff Wu, Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi,
	Aaron Lu
In-Reply-To: <1352443922-13734-1-git-send-email-aaron.lu@intel.com>

Since the ata acpi notification code introduced in commit
3bd46600a7a7e938c54df8cdbac9910668c7dfb0 is solely for ZPODD, and we
now have a dedicated place for it, move these code there.

And the add/remove_pm_notifier code is simplified a little bit that it
does not check things like if the handle is NULL and if a corresponding
acpi_device is there for the handle as they are guaranteed by the
device_can_poweroff already.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 drivers/ata/libata-acpi.c  | 71 ----------------------------------------------
 drivers/ata/libata-zpodd.c | 30 ++++++++++++++++++++
 2 files changed, 30 insertions(+), 71 deletions(-)

diff --git a/drivers/ata/libata-acpi.c b/drivers/ata/libata-acpi.c
index 3c61100..6b6819c 100644
--- a/drivers/ata/libata-acpi.c
+++ b/drivers/ata/libata-acpi.c
@@ -970,57 +970,6 @@ void ata_acpi_on_disable(struct ata_device *dev)
 	ata_acpi_clear_gtf(dev);
 }
 
-static void ata_acpi_wake_dev(acpi_handle handle, u32 event, void *context)
-{
-	struct ata_device *ata_dev = context;
-
-	if (event == ACPI_NOTIFY_DEVICE_WAKE && ata_dev &&
-			pm_runtime_suspended(&ata_dev->sdev->sdev_gendev))
-		scsi_autopm_get_device(ata_dev->sdev);
-}
-
-static void ata_acpi_add_pm_notifier(struct ata_device *dev)
-{
-	struct acpi_device *acpi_dev;
-	acpi_handle handle;
-	acpi_status status;
-
-	handle = ata_dev_acpi_handle(dev);
-	if (!handle)
-		return;
-
-	status = acpi_bus_get_device(handle, &acpi_dev);
-	if (ACPI_FAILURE(status))
-		return;
-
-	if (dev->sdev->can_power_off) {
-		acpi_install_notify_handler(handle, ACPI_SYSTEM_NOTIFY,
-			ata_acpi_wake_dev, dev);
-		device_set_run_wake(&dev->sdev->sdev_gendev, true);
-	}
-}
-
-static void ata_acpi_remove_pm_notifier(struct ata_device *dev)
-{
-	struct acpi_device *acpi_dev;
-	acpi_handle handle;
-	acpi_status status;
-
-	handle = ata_dev_acpi_handle(dev);
-	if (!handle)
-		return;
-
-	status = acpi_bus_get_device(handle, &acpi_dev);
-	if (ACPI_FAILURE(status))
-		return;
-
-	if (dev->sdev->can_power_off) {
-		device_set_run_wake(&dev->sdev->sdev_gendev, false);
-		acpi_remove_notify_handler(handle, ACPI_SYSTEM_NOTIFY,
-			ata_acpi_wake_dev);
-	}
-}
-
 static void ata_acpi_register_power_resource(struct ata_device *dev)
 {
 	struct scsi_device *sdev = dev->sdev;
@@ -1053,7 +1002,6 @@ static void ata_acpi_unregister_power_resource(struct ata_device *dev)
 
 void ata_acpi_bind(struct ata_device *dev)
 {
-	ata_acpi_add_pm_notifier(dev);
 	ata_acpi_register_power_resource(dev);
 }
 
@@ -1061,7 +1009,6 @@ void ata_acpi_unbind(struct ata_device *dev)
 {
 	if (zpodd_dev_enabled(dev))
 		zpodd_deinit(dev);
-	ata_acpi_remove_pm_notifier(dev);
 	ata_acpi_unregister_power_resource(dev);
 }
 
@@ -1103,9 +1050,6 @@ static int ata_acpi_bind_device(struct ata_port *ap, struct scsi_device *sdev,
 				acpi_handle *handle)
 {
 	struct ata_device *ata_dev;
-	acpi_status status;
-	struct acpi_device *acpi_dev;
-	struct acpi_device_power_state *states;
 
 	if (ap->flags & ATA_FLAG_ACPI_SATA)
 		ata_dev = &ap->link.device[sdev->channel];
@@ -1117,21 +1061,6 @@ static int ata_acpi_bind_device(struct ata_port *ap, struct scsi_device *sdev,
 	if (!*handle)
 		return -ENODEV;
 
-	status = acpi_bus_get_device(*handle, &acpi_dev);
-	if (ACPI_FAILURE(status))
-		return 0;
-
-	/*
-	 * If firmware has _PS3 or _PR3 for this device,
-	 * and this ata ODD device support device attention,
-	 * it means this device can be powered off
-	 */
-	states = acpi_dev->power.states;
-	if ((states[ACPI_STATE_D3_HOT].flags.valid ||
-			states[ACPI_STATE_D3_COLD].flags.explicit_set) &&
-			ata_dev->flags & ATA_DFLAG_DA)
-		sdev->can_power_off = 1;
-
 	return 0;
 }
 
diff --git a/drivers/ata/libata-zpodd.c b/drivers/ata/libata-zpodd.c
index fce6ea6..ba8c985 100644
--- a/drivers/ata/libata-zpodd.c
+++ b/drivers/ata/libata-zpodd.c
@@ -1,11 +1,14 @@
 #include <linux/libata.h>
 #include <linux/cdrom.h>
+#include <linux/pm_runtime.h>
+#include <scsi/scsi_device.h>
 
 #include "libata.h"
 
 struct zpodd {
 	bool slot:1;
 	bool drawer:1;
+	bool from_notify:1;	/* resumed as a result of acpi notification */
 
 	struct ata_device *dev;
 };
@@ -90,6 +93,31 @@ static bool device_can_poweroff(struct ata_device *ata_dev)
 		return false;
 }
 
+static void zpodd_wake_dev(acpi_handle handle, u32 event, void *context)
+{
+	struct ata_device *ata_dev = context;
+	struct zpodd *zpodd = ata_dev->private_data;
+	struct device *dev = &ata_dev->sdev->sdev_gendev;
+
+	if (event == ACPI_NOTIFY_DEVICE_WAKE && pm_runtime_suspended(dev)) {
+		zpodd->from_notify = true;
+		pm_runtime_resume(dev);
+	}
+}
+
+static void acpi_add_pm_notifier(struct ata_device *dev)
+{
+	acpi_handle handle = ata_dev_acpi_handle(dev);
+	acpi_install_notify_handler(handle, ACPI_SYSTEM_NOTIFY,
+				    zpodd_wake_dev, dev);
+}
+
+static void acpi_remove_pm_notifier(struct ata_device *dev)
+{
+	acpi_handle handle = DEVICE_ACPI_HANDLE(&dev->sdev->sdev_gendev);
+	acpi_remove_notify_handler(handle, ACPI_SYSTEM_NOTIFY, zpodd_wake_dev);
+}
+
 void zpodd_init(struct ata_device *dev)
 {
 	int ret;
@@ -113,12 +141,14 @@ void zpodd_init(struct ata_device *dev)
 	else
 		zpodd->slot = true;
 
+	acpi_add_pm_notifier(dev);
 	zpodd->dev = dev;
 	dev->private_data = zpodd;
 }
 
 void zpodd_deinit(struct ata_device *dev)
 {
+	acpi_remove_pm_notifier(dev);
 	kfree(dev->private_data);
 	dev->private_data = NULL;
 }
-- 
1.7.12.4


^ permalink raw reply related

* [PATCH v9 03/10] ata: zpodd: identify and init ZPODD devices
From: Aaron Lu @ 2012-11-09  6:51 UTC (permalink / raw)
  To: Jeff Garzik, James Bottomley, Rafael J. Wysocki, Alan Stern,
	Tejun Heo
  Cc: Jeff Wu, Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi,
	Aaron Lu
In-Reply-To: <1352443922-13734-1-git-send-email-aaron.lu@intel.com>

The ODD can be enabled for ZPODD if the following three conditions are
satisfied:
1 The ODD supports device attention;
2 The platform can runtime power off the ODD through ACPI;
3 The ODD is either slot type or drawer type.
For such ODDs, zpodd_init is called and a new structure is allocated for
it to store ZPODD related stuffs.

And the zpodd_dev_enabled function is used to test if ZPODD is currently
enabled for this ODD.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 drivers/ata/libata-acpi.c  |   2 +
 drivers/ata/libata-core.c  |   4 +-
 drivers/ata/libata-zpodd.c | 124 +++++++++++++++++++++++++++++++++++++++++++++
 drivers/ata/libata.h       |  17 +++++++
 include/uapi/linux/cdrom.h |  34 +++++++++++++
 5 files changed, 180 insertions(+), 1 deletion(-)

diff --git a/drivers/ata/libata-acpi.c b/drivers/ata/libata-acpi.c
index fd9ecf7..3c61100 100644
--- a/drivers/ata/libata-acpi.c
+++ b/drivers/ata/libata-acpi.c
@@ -1059,6 +1059,8 @@ void ata_acpi_bind(struct ata_device *dev)
 
 void ata_acpi_unbind(struct ata_device *dev)
 {
+	if (zpodd_dev_enabled(dev))
+		zpodd_deinit(dev);
 	ata_acpi_remove_pm_notifier(dev);
 	ata_acpi_unregister_power_resource(dev);
 }
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 3cc7096..a2293b6 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -2395,8 +2395,10 @@ int ata_dev_configure(struct ata_device *dev)
 			dma_dir_string = ", DMADIR";
 		}
 
-		if (ata_id_has_da(dev->id))
+		if (ata_id_has_da(dev->id)) {
 			dev->flags |= ATA_DFLAG_DA;
+			zpodd_init(dev);
+		}
 
 		/* print device info to dmesg */
 		if (ata_msg_drv(ap) && print_info)
diff --git a/drivers/ata/libata-zpodd.c b/drivers/ata/libata-zpodd.c
index e69de29..fce6ea6 100644
--- a/drivers/ata/libata-zpodd.c
+++ b/drivers/ata/libata-zpodd.c
@@ -0,0 +1,124 @@
+#include <linux/libata.h>
+#include <linux/cdrom.h>
+
+#include "libata.h"
+
+struct zpodd {
+	bool slot:1;
+	bool drawer:1;
+
+	struct ata_device *dev;
+};
+
+static int run_atapi_cmd(struct ata_device *dev, const char *cdb,
+		unsigned short cdb_len, char *buf, unsigned int buf_len)
+{
+	struct ata_taskfile tf = {0};
+
+	tf.flags |= ATA_TFLAG_ISADDR | ATA_TFLAG_DEVICE;
+	tf.command = ATA_CMD_PACKET;
+
+	if (buf) {
+		tf.protocol = ATAPI_PROT_PIO;
+		tf.lbam = buf_len;
+	} else {
+		tf.protocol = ATAPI_PROT_NODATA;
+	}
+
+	return ata_exec_internal(dev, &tf, cdb,
+			buf ? DMA_FROM_DEVICE : DMA_NONE, buf, buf_len, 0);
+}
+
+/*
+ * Per the spec, only slot type and drawer type ODD can be supported
+ *
+ * Return 0 for slot type, 1 for drawer, -ENODEV for other types or on error.
+ */
+static int check_loading_mechanism(struct ata_device *dev)
+{
+	char buf[16];
+	unsigned int ret;
+	struct rm_feature_desc *desc = (void *)(buf + 8);
+
+	char cdb[] = {  GPCMD_GET_CONFIGURATION,
+			2,      /* only 1 feature descriptor requested */
+			0, 3,   /* 3, removable medium feature */
+			0, 0, 0,/* reserved */
+			0, sizeof(buf),
+			0, 0, 0,
+	};
+
+	ret = run_atapi_cmd(dev, cdb, sizeof(cdb), buf, sizeof(buf));
+	if (ret)
+		return -ENODEV;
+
+	if (be16_to_cpu(desc->feature_code) != 3)
+		return -ENODEV;
+
+	if (desc->mech_type == 0 && desc->load == 0 && desc->eject == 1)
+		return 0; /* slot */
+	else if (desc->mech_type == 1 && desc->load == 0 && desc->eject == 1)
+		return 1; /* drawer */
+	else
+		return -ENODEV;
+}
+
+static bool device_can_poweroff(struct ata_device *ata_dev)
+{
+	acpi_handle handle;
+	acpi_status status;
+	struct acpi_device_power_state *states;
+	struct acpi_device *acpi_dev;
+
+	handle = ata_dev_acpi_handle(ata_dev);
+	if (!handle)
+		return false;
+
+	status = acpi_bus_get_device(handle, &acpi_dev);
+	if (ACPI_FAILURE(status))
+		return false;
+
+	/*
+	 * If firmware has _PS3 or _PR3 for this device,
+	 * it means this device can be runtime powered off
+	 */
+	states = acpi_dev->power.states;
+	if (states[ACPI_STATE_D3_HOT].flags.valid ||
+	    states[ACPI_STATE_D3_COLD].flags.explicit_set)
+		return true;
+	else
+		return false;
+}
+
+void zpodd_init(struct ata_device *dev)
+{
+	int ret;
+	struct zpodd *zpodd;
+
+	if (dev->private_data)
+		return;
+
+	if (!device_can_poweroff(dev))
+		return;
+
+	if ((ret = check_loading_mechanism(dev)) == -ENODEV)
+		return;
+
+	zpodd = kzalloc(sizeof(struct zpodd), GFP_KERNEL);
+	if (!zpodd)
+		return;
+
+	if (ret)
+		zpodd->drawer = true;
+	else
+		zpodd->slot = true;
+
+	zpodd->dev = dev;
+	dev->private_data = zpodd;
+}
+
+void zpodd_deinit(struct ata_device *dev)
+{
+	kfree(dev->private_data);
+	dev->private_data = NULL;
+}
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index 7148a58..55ad37e 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -230,4 +230,21 @@ static inline void ata_sff_exit(void)
 { }
 #endif /* CONFIG_ATA_SFF */
 
+/* libata-zpodd.c */
+#ifdef CONFIG_SATA_ZPODD
+void zpodd_init(struct ata_device *dev);
+void zpodd_deinit(struct ata_device *dev);
+static inline bool zpodd_dev_enabled(struct ata_device *dev)
+{
+	if (dev->flags & ATA_DFLAG_DA && dev->private_data)
+		return true;
+	else
+		return false;
+}
+#else /* CONFIG_SATA_ZPODD */
+static inline void zpodd_init(struct ata_device *dev) {}
+static inline void zpodd_deinit(struct ata_device *dev) {}
+static inline bool zpodd_dev_enabled(struct ata_device *dev) { return false; }
+#endif /* CONFIG_SATA_ZPODD */
+
 #endif /* __LIBATA_H__ */
diff --git a/include/uapi/linux/cdrom.h b/include/uapi/linux/cdrom.h
index 898b866..bd17ad5 100644
--- a/include/uapi/linux/cdrom.h
+++ b/include/uapi/linux/cdrom.h
@@ -908,5 +908,39 @@ struct mode_page_header {
 	__be16 desc_length;
 };
 
+/* removable medium feature descriptor */
+struct rm_feature_desc {
+	__be16 feature_code;
+#if defined(__BIG_ENDIAN_BITFIELD)
+	__u8 reserved1:2;
+	__u8 feature_version:4;
+	__u8 persistent:1;
+	__u8 curr:1;
+#elif defined(__LITTLE_ENDIAN_BITFIELD)
+	__u8 curr:1;
+	__u8 persistent:1;
+	__u8 feature_version:4;
+	__u8 reserved1:2;
+#endif
+	__u8 add_len;
+#if defined(__BIG_ENDIAN_BITFIELD)
+	__u8 mech_type:3;
+	__u8 load:1;
+	__u8 eject:1;
+	__u8 pvnt_jmpr:1;
+	__u8 dbml:1;
+	__u8 lock:1;
+#elif defined(__LITTLE_ENDIAN_BITFIELD)
+	__u8 lock:1;
+	__u8 dbml:1;
+	__u8 pvnt_jmpr:1;
+	__u8 eject:1;
+	__u8 load:1;
+	__u8 mech_type:3;
+#endif
+	__u8 reserved2;
+	__u8 reserved3;
+	__u8 reserved4;
+};
 
 #endif /* _UAPI_LINUX_CDROM_H */
-- 
1.7.12.4


^ permalink raw reply related

* [PATCH v9 02/10] ata: zpodd: Add CONFIG_SATA_ZPODD
From: Aaron Lu @ 2012-11-09  6:51 UTC (permalink / raw)
  To: Jeff Garzik, James Bottomley, Rafael J. Wysocki, Alan Stern,
	Tejun Heo
  Cc: Jeff Wu, Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi,
	Aaron Lu
In-Reply-To: <1352443922-13734-1-git-send-email-aaron.lu@intel.com>

Added a new config CONFIG_SATA_ZPODD, which is used to support
SATA based zero power ODD. It depends on ATA_ACPI, and selects
BLK_DEV_SR as the implementation of ZPODD depends on SCSI sr driver.

A new file libata-zpodd.c is added, which will be used to host ZPODD
related code. It is empty for this commit.

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 drivers/ata/Kconfig  | 12 ++++++++++++
 drivers/ata/Makefile |  1 +
 2 files changed, 13 insertions(+)
 create mode 100644 drivers/ata/libata-zpodd.c

diff --git a/drivers/ata/Kconfig b/drivers/ata/Kconfig
index e08d322..9bcb8fb 100644
--- a/drivers/ata/Kconfig
+++ b/drivers/ata/Kconfig
@@ -58,6 +58,18 @@ config ATA_ACPI
 	  You can disable this at kernel boot time by using the
 	  option libata.noacpi=1
 
+config SATA_ZPODD
+	bool "SATA Zero Power ODD Support"
+	depends on ATA_ACPI
+	select BLK_DEV_SR
+	default n
+	help
+	  This option adds support for SATA ZPODD. It requires both
+	  ODD and the platform support, and if enabled, will automatically
+	  power on/off the ODD when certain condition is satisfied. This
+	  does not impact user's experience of the ODD, only power is saved
+	  when ODD is not in use(i.e. no disc inside).
+
 config SATA_PMP
 	bool "SATA Port Multiplier support"
 	default y
diff --git a/drivers/ata/Makefile b/drivers/ata/Makefile
index 9329daf..85e3de4 100644
--- a/drivers/ata/Makefile
+++ b/drivers/ata/Makefile
@@ -107,3 +107,4 @@ libata-y	:= libata-core.o libata-scsi.o libata-eh.o libata-transport.o
 libata-$(CONFIG_ATA_SFF)	+= libata-sff.o
 libata-$(CONFIG_SATA_PMP)	+= libata-pmp.o
 libata-$(CONFIG_ATA_ACPI)	+= libata-acpi.o
+libata-$(CONFIG_SATA_ZPODD)	+= libata-zpodd.o
diff --git a/drivers/ata/libata-zpodd.c b/drivers/ata/libata-zpodd.c
new file mode 100644
index 0000000..e69de29
-- 
1.7.12.4


^ permalink raw reply related

* [PATCH v9 01/10] scsi: sr: support runtime pm
From: Aaron Lu @ 2012-11-09  6:51 UTC (permalink / raw)
  To: Jeff Garzik, James Bottomley, Rafael J. Wysocki, Alan Stern,
	Tejun Heo
  Cc: Jeff Wu, Aaron Lu, linux-ide, linux-pm, linux-scsi, linux-acpi,
	Aaron Lu
In-Reply-To: <1352443922-13734-1-git-send-email-aaron.lu@intel.com>

This patch adds runtime pm support for sr.

It did this by increasing the runtime usage_count of the device when:
- its block device is opened;
- the events checking is to run.

And decreasing the runtime usage_count of the device when:
- its block device is closed;
- After the events checking is done.

The idea is discussed in this mail thread:
http://thread.gmane.org/gmane.linux.acpi.devel/55243/focus=52703

Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 drivers/scsi/sr.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 5fc97d2..4d1a610 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -45,6 +45,7 @@
 #include <linux/blkdev.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
+#include <linux/pm_runtime.h>
 #include <asm/uaccess.h>
 
 #include <scsi/scsi.h>
@@ -146,7 +147,8 @@ static inline struct scsi_cd *scsi_cd_get(struct gendisk *disk)
 	kref_get(&cd->kref);
 	if (scsi_device_get(cd->device))
 		goto out_put;
-	goto out;
+	if (!scsi_autopm_get_device(cd->device))
+		goto out;
 
  out_put:
 	kref_put(&cd->kref, sr_kref_release);
@@ -162,6 +164,7 @@ static void scsi_cd_put(struct scsi_cd *cd)
 
 	mutex_lock(&sr_ref_mutex);
 	kref_put(&cd->kref, sr_kref_release);
+	scsi_autopm_put_device(sdev);
 	scsi_device_put(sdev);
 	mutex_unlock(&sr_ref_mutex);
 }
@@ -211,7 +214,7 @@ static unsigned int sr_check_events(struct cdrom_device_info *cdi,
 				    unsigned int clearing, int slot)
 {
 	struct scsi_cd *cd = cdi->handle;
-	bool last_present;
+	bool last_present = cd->media_present;
 	struct scsi_sense_hdr sshdr;
 	unsigned int events;
 	int ret;
@@ -220,6 +223,8 @@ static unsigned int sr_check_events(struct cdrom_device_info *cdi,
 	if (CDSL_CURRENT != slot)
 		return 0;
 
+	scsi_autopm_get_device(cd->device);
+
 	events = sr_get_events(cd->device);
 	cd->get_event_changed |= events & DISK_EVENT_MEDIA_CHANGE;
 
@@ -246,10 +251,9 @@ static unsigned int sr_check_events(struct cdrom_device_info *cdi,
 	}
 
 	if (!(clearing & DISK_EVENT_MEDIA_CHANGE))
-		return events;
+		goto out;
 do_tur:
 	/* let's see whether the media is there with TUR */
-	last_present = cd->media_present;
 	ret = scsi_test_unit_ready(cd->device, SR_TIMEOUT, MAX_RETRIES, &sshdr);
 
 	/*
@@ -270,7 +274,7 @@ do_tur:
 	}
 
 	if (cd->ignore_get_event)
-		return events;
+		goto out;
 
 	/* check whether GET_EVENT is reporting spurious MEDIA_CHANGE */
 	if (!cd->tur_changed) {
@@ -287,6 +291,18 @@ do_tur:
 	cd->tur_changed = false;
 	cd->get_event_changed = false;
 
+out:
+	/*
+	 * If there is no medium detected or the medium has been there
+	 * since last poll, try to suspend the device. Otherwise, keep
+	 * it active for one more poll interval so that if user space
+	 * application opens the block device, we can avoid a runtime
+	 * status change.
+	 */
+	pm_runtime_put_noidle(&cd->device->sdev_gendev);
+	if (!cd->media_present || last_present)
+		pm_runtime_suspend(&cd->device->sdev_gendev);
+
 	return events;
 }
 
@@ -718,6 +734,8 @@ static int sr_probe(struct device *dev)
 
 	sdev_printk(KERN_DEBUG, sdev,
 		    "Attached scsi CD-ROM %s\n", cd->cdi.name);
+	scsi_autopm_put_device(cd->device);
+
 	return 0;
 
 fail_put:
@@ -965,6 +983,8 @@ static int sr_remove(struct device *dev)
 {
 	struct scsi_cd *cd = dev_get_drvdata(dev);
 
+	scsi_autopm_get_device(cd->device);
+
 	blk_queue_prep_rq(cd->device->request_queue, scsi_prep_fn);
 	del_gendisk(cd->disk);
 
-- 
1.7.12.4


^ permalink raw reply related

* Re: [RFC PATCH 6/8] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
From: Ankita Garg @ 2012-11-09  6:22 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Dave Hansen, akpm, mgorman, mjg59, paulmck, maxime.coquelin,
	loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw,
	amit.kachhap, svaidy, thomas.abraham, santosh.shilimkar, linux-pm,
	linux-mm, linux-kernel
In-Reply-To: <509AC164.1050403@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 2475 bytes --]

Hi Srivatsa,

I understand that you are maintaining the page blocks in region sorted
order. So that way, when the memory requests come in, you can hand out
memory from the regions in that order. However, do you take this scenario
into account - in some bucket of the buddy allocator, there might not be
any pages belonging to, lets say, region 0, while the next higher bucket
has them. So, instead of handing out memory from whichever region thats
present there, to probably go to the next bucket and split that region 0
pageblock there and allocate from it ? (Here, region 0 is just an example).
Been a while since I looked at kernel code, so I might be missing something!

Regards,
Ankita



On Wed, Nov 7, 2012 at 2:15 PM, Srivatsa S. Bhat <
srivatsa.bhat@linux.vnet.ibm.com> wrote:

> On 11/07/2012 03:19 AM, Dave Hansen wrote:
> > On 11/06/2012 11:53 AM, Srivatsa S. Bhat wrote:
> >> This is the main change - we keep the pageblocks in region-sorted order,
> >> where pageblocks belonging to region-0 come first, followed by those
> belonging
> >> to region-1 and so on. But the pageblocks within a given region need
> *not* be
> >> sorted, since we need them to be only region-sorted and not fully
> >> address-sorted.
> >>
> >> This sorting is performed when adding pages back to the freelists, thus
> >> avoiding any region-related overhead in the critical page allocation
> >> paths.
> >
> > It's probably _better_ to do it at free time than alloc, but it's still
> > pretty bad to be doing a linear walk over a potentially 256-entry array
> > holding the zone lock.  The overhead is going to show up somewhere.  How
> > does this do with a kernel compile?  Looks like exit() when a process
> > has a bunch of memory might get painful.
> >
>
> As I mentioned in the cover-letter, kernbench numbers haven't shown any
> observable performance degradation. On the contrary, (as unbelievable as it
> may sound), they actually indicate a slight performance *improvement* with
> my
> patchset! I'm trying to figure out what could be the reason behind that.
>
> Going forward, we could try to optimize the sorting logic in the free()
> part, but in any case, IMHO that's the right place to push the overhead to,
> since the performance of free() is not expected to be _that_ critical
> (unlike
> alloc()) for overall system performance.
>
> Regards,
> Srivatsa S. Bhat
>
>


-- 
Regards,
Ankita
Graduate Student
Department of Computer Science
University of Texas at Austin

[-- Attachment #2: Type: text/html, Size: 3654 bytes --]

^ permalink raw reply

* Re: [RFC PATCH 6/8] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
From: Ankita Garg @ 2012-11-09  6:01 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: akpm, mgorman, mjg59, paulmck, dave, maxime.coquelin,
	loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw,
	amit.kachhap, svaidy, thomas.abraham, santosh.shilimkar, linux-pm,
	linux-mm, linux-kernel
In-Reply-To: <20121106195342.6941.94892.stgit@srivatsabhat.in.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 13438 bytes --]

Hi Srivatsa,

I understand that you are maintaining the page blocks in region sorted
order. So that way, when the memory requests come in, you can hand out
memory from the regions in that order. However, do you take this scenario
into account - in some bucket of the buddy allocator, there might not be
any pages belonging to, lets say, region 0, while the next higher bucket
has them. So, instead of handing out memory from whichever region thats
present there, to probably go to the next bucket and split that region 0
pageblock there and allocate from it ? (Here, region 0 is just an example).
Been a while since I looked at kernel code, so I might be missing something!

Regards,
Ankita


On Tue, Nov 6, 2012 at 1:53 PM, Srivatsa S. Bhat <
srivatsa.bhat@linux.vnet.ibm.com> wrote:

> The zones' freelists need to be made region-aware, in order to influence
> page allocation and freeing algorithms. So in every free list in the zone,
> we
> would like to demarcate the pageblocks belonging to different memory
> regions
> (we can do this using a set of pointers, and thus avoid splitting up the
> freelists).
>
> Also, we would like to keep the pageblocks in the freelists sorted in
> region-order. That is, pageblocks belonging to region-0 would come first,
> followed by pageblocks belonging to region-1 and so on, within a given
> freelist. Of course, a set of pageblocks belonging to the same region need
> not be sorted; it is sufficient if we maintain the pageblocks in
> region-sorted-order, rather than a full address-sorted-order.
>
> For each freelist within the zone, we maintain a set of pointers to
> pageblocks belonging to the various memory regions in that zone.
>
> Eg:
>
>     |<---Region0--->|   |<---Region1--->|   |<-------Region2--------->|
>      ____      ____      ____      ____      ____      ____      ____
> --> |____|--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|-->
>
>                  ^                  ^                              ^
>                  |                  |                              |
>                 Reg0               Reg1                          Reg2
>
>
> Page allocation will proceed as usual - pick the first item on the free
> list.
> But we don't want to keep updating these region pointers every time we
> allocate
> a pageblock from the freelist. So, instead of pointing to the *first*
> pageblock
> of that region, we maintain the region pointers such that they point to the
> *last* pageblock in that region, as shown in the figure above. That way, as
> long as there are > 1 pageblocks in that region in that freelist, that
> region
> pointer doesn't need to be updated.
>
>
> Page allocation algorithm:
> -------------------------
>
> The heart of the page allocation algorithm remains it is - pick the first
> item on the appropriate freelist and return it.
>
>
> Pageblock order in the zone freelists:
> -------------------------------------
>
> This is the main change - we keep the pageblocks in region-sorted order,
> where pageblocks belonging to region-0 come first, followed by those
> belonging
> to region-1 and so on. But the pageblocks within a given region need *not*
> be
> sorted, since we need them to be only region-sorted and not fully
> address-sorted.
>
> This sorting is performed when adding pages back to the freelists, thus
> avoiding any region-related overhead in the critical page allocation
> paths.
>
> Page reclaim [Todo]:
> --------------------
>
> Page allocation happens in the order of increasing region number. We would
> like to do page reclaim in the reverse order, to keep allocated pages
> within
> a minimal number of regions (approximately).
>
> ---------------------------- Increasing region
> number---------------------->
>
> Direction of allocation--->                         <---Direction of
> reclaim
>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
>
>  mm/page_alloc.c |  128
> +++++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 113 insertions(+), 15 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 62d0a9a..52ff914 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -502,6 +502,79 @@ static inline int page_is_buddy(struct page *page,
> struct page *buddy,
>         return 0;
>  }
>
> +static void add_to_freelist(struct page *page, struct list_head *lru,
> +                           struct free_list *free_list)
> +{
> +       struct mem_region_list *region;
> +       struct list_head *prev_region_list;
> +       int region_id, i;
> +
> +       region_id = page_zone_region_id(page);
> +
> +       region = &free_list->mr_list[region_id];
> +       region->nr_free++;
> +
> +       if (region->page_block) {
> +               list_add_tail(lru, region->page_block);
> +               return;
> +       }
> +
> +       if (!list_empty(&free_list->list)) {
> +               for (i = region_id - 1; i >= 0; i--) {
> +                       if (free_list->mr_list[i].page_block) {
> +                               prev_region_list =
> +                                       free_list->mr_list[i].page_block;
> +                               goto out;
> +                       }
> +               }
> +       }
> +
> +       /* This is the first region, so add to the head of the list */
> +       prev_region_list = &free_list->list;
> +
> +out:
> +       list_add(lru, prev_region_list);
> +
> +       /* Save pointer to page block of this region */
> +       region->page_block = lru;
> +}
> +
> +static void del_from_freelist(struct page *page, struct list_head *lru,
> +                             struct free_list *free_list)
> +{
> +       struct mem_region_list *region;
> +       struct list_head *prev_page_lru;
> +       int region_id;
> +
> +       region_id = page_zone_region_id(page);
> +       region = &free_list->mr_list[region_id];
> +       region->nr_free--;
> +
> +       if (lru != region->page_block) {
> +               list_del(lru);
> +               return;
> +       }
> +
> +       prev_page_lru = lru->prev;
> +       list_del(lru);
> +
> +       if (region->nr_free == 0)
> +               region->page_block = NULL;
> +       else
> +               region->page_block = prev_page_lru;
> +}
> +
> +/**
> + * Move pages of a given order from freelist of one migrate-type to
> another.
> + */
> +static void move_pages_freelist(struct page *page, struct list_head *lru,
> +                               struct free_list *old_list,
> +                               struct free_list *new_list)
> +{
> +       del_from_freelist(page, lru, old_list);
> +       add_to_freelist(page, lru, new_list);
> +}
> +
>  /*
>   * Freeing function for a buddy system allocator.
>   *
> @@ -534,6 +607,7 @@ static inline void __free_one_page(struct page *page,
>         unsigned long combined_idx;
>         unsigned long uninitialized_var(buddy_idx);
>         struct page *buddy;
> +       struct free_area *area;
>
>         if (unlikely(PageCompound(page)))
>                 if (unlikely(destroy_compound_page(page, order)))
> @@ -561,8 +635,10 @@ static inline void __free_one_page(struct page *page,
>                         __mod_zone_freepage_state(zone, 1 << order,
>                                                   migratetype);
>                 } else {
> -                       list_del(&buddy->lru);
> -                       zone->free_area[order].nr_free--;
> +                       area = &zone->free_area[order];
> +                       del_from_freelist(buddy, &buddy->lru,
> +                                         &area->free_list[migratetype]);
> +                       area->nr_free--;
>                         rmv_page_order(buddy);
>                 }
>                 combined_idx = buddy_idx & page_idx;
> @@ -587,14 +663,23 @@ static inline void __free_one_page(struct page *page,
>                 buddy_idx = __find_buddy_index(combined_idx, order + 1);
>                 higher_buddy = higher_page + (buddy_idx - combined_idx);
>                 if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
> -                       list_add_tail(&page->lru,
> -
> &zone->free_area[order].free_list[migratetype].list);
> +
> +                       /*
> +                        * Implementing an add_to_freelist_tail() won't be
> +                        * very useful because both of them (almost) add to
> +                        * the tail within the region. So we could
> potentially
> +                        * switch off this entire "is next-higher buddy
> free?"
> +                        * logic when memory regions are used.
> +                        */
> +                       area = &zone->free_area[order];
> +                       add_to_freelist(page, &page->lru,
> +                                       &area->free_list[migratetype]);
>                         goto out;
>                 }
>         }
>
> -       list_add(&page->lru,
> -               &zone->free_area[order].free_list[migratetype].list);
> +       add_to_freelist(page, &page->lru,
> +                       &zone->free_area[order].free_list[migratetype]);
>  out:
>         zone->free_area[order].nr_free++;
>  }
> @@ -812,7 +897,8 @@ static inline void expand(struct zone *zone, struct
> page *page,
>                         continue;
>                 }
>  #endif
> -               list_add(&page[size].lru,
> &area->free_list[migratetype].list);
> +               add_to_freelist(&page[size], &page[size].lru,
> +                                       &area->free_list[migratetype]);
>                 area->nr_free++;
>                 set_page_order(&page[size], high);
>         }
> @@ -879,7 +965,8 @@ struct page *__rmqueue_smallest(struct zone *zone,
> unsigned int order,
>
>                 page = list_entry(area->free_list[migratetype].list.next,
>                                                         struct page, lru);
> -               list_del(&page->lru);
> +               del_from_freelist(page, &page->lru,
> +                                 &area->free_list[migratetype]);
>                 rmv_page_order(page);
>                 area->nr_free--;
>                 expand(zone, page, order, current_order, area,
> migratetype);
> @@ -918,7 +1005,8 @@ int move_freepages(struct zone *zone,
>  {
>         struct page *page;
>         unsigned long order;
> -       int pages_moved = 0;
> +       struct free_area *area;
> +       int pages_moved = 0, old_mt;
>
>  #ifndef CONFIG_HOLES_IN_ZONE
>         /*
> @@ -946,8 +1034,11 @@ int move_freepages(struct zone *zone,
>                 }
>
>                 order = page_order(page);
> -               list_move(&page->lru,
> -
> &zone->free_area[order].free_list[migratetype].list);
> +               old_mt = get_freepage_migratetype(page);
> +               area = &zone->free_area[order];
> +               move_pages_freelist(page, &page->lru,
> +                                   &area->free_list[old_mt],
> +                                   &area->free_list[migratetype]);
>                 set_freepage_migratetype(page, migratetype);
>                 page += 1 << order;
>                 pages_moved += 1 << order;
> @@ -1045,7 +1136,8 @@ __rmqueue_fallback(struct zone *zone, int order, int
> start_migratetype)
>                         }
>
>                         /* Remove the page from the freelists */
> -                       list_del(&page->lru);
> +                       del_from_freelist(page, &page->lru,
> +                                         &area->free_list[migratetype]);
>                         rmv_page_order(page);
>
>                         /* Take ownership for orders >= pageblock_order */
> @@ -1399,12 +1491,14 @@ int capture_free_page(struct page *page, int
> alloc_order, int migratetype)
>         if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
>                 return 0;
>
> +       mt = get_pageblock_migratetype(page);
> +
>         /* Remove page from free list */
> -       list_del(&page->lru);
> +       del_from_freelist(page, &page->lru,
> +                         &zone->free_area[order].free_list[mt]);
>         zone->free_area[order].nr_free--;
>         rmv_page_order(page);
>
> -       mt = get_pageblock_migratetype(page);
>         if (unlikely(mt != MIGRATE_ISOLATE))
>                 __mod_zone_freepage_state(zone, -(1UL << order), mt);
>
> @@ -6040,6 +6134,8 @@ __offline_isolated_pages(unsigned long start_pfn,
> unsigned long end_pfn)
>         int order, i;
>         unsigned long pfn;
>         unsigned long flags;
> +       int mt;
> +
>         /* find the first valid pfn */
>         for (pfn = start_pfn; pfn < end_pfn; pfn++)
>                 if (pfn_valid(pfn))
> @@ -6062,7 +6158,9 @@ __offline_isolated_pages(unsigned long start_pfn,
> unsigned long end_pfn)
>                 printk(KERN_INFO "remove from free list %lx %d %lx\n",
>                        pfn, 1 << order, end_pfn);
>  #endif
> -               list_del(&page->lru);
> +               mt = get_freepage_migratetype(page);
> +               del_from_freelist(page, &page->lru,
> +                                 &zone->free_area[order].free_list[mt]);
>                 rmv_page_order(page);
>                 zone->free_area[order].nr_free--;
>                 __mod_zone_page_state(zone, NR_FREE_PAGES,
>
>


-- 
Regards,
Ankita
Graduate Student
Department of Computer Science
University of Texas at Austin

[-- Attachment #2: Type: text/html, Size: 15325 bytes --]

^ permalink raw reply

* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: Vaidyanathan Srinivasan @ 2012-11-09  5:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srivatsa S. Bhat, akpm, mjg59, paulmck, dave, maxime.coquelin,
	loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw,
	gargankita, amit.kachhap, thomas.abraham, santosh.shilimkar,
	linux-pm, linux-mm, linux-kernel
In-Reply-To: <20121108180257.GC8218@suse.de>

* Mel Gorman <mgorman@suse.de> [2012-11-08 18:02:57]:

> On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
> > ------------------------------------------------------------

Hi Mel,

Thanks for detailed review and comments.  The goal of this patch
series is to brainstorm on ideas that enable Linux VM to record and
exploit memory region boundaries.

The first approach that we had last year (hierarchy) has more runtime
overhead.  This approach of sorted-buddy was one of the alternative
discussed earlier and we are trying to find out if simple requirements
of biasing memory allocations can be achieved with this approach.

Smart reclaim based on this approach is a key piece we still need to
design.  Ideas from compaction will certainly help.

> > Today memory subsystems are offer a wide range of capabilities for managing
> > memory power consumption. As a quick example, if a block of memory is not
> > referenced for a threshold amount of time, the memory controller can decide to
> > put that chunk into a low-power content-preserving state. And the next
> > reference to that memory chunk would bring it back to full power for read/write.
> > With this capability in place, it becomes important for the OS to understand
> > the boundaries of such power-manageable chunks of memory and to ensure that
> > references are consolidated to a minimum number of such memory power management
> > domains.
> > 
> 
> How much power is saved?

On embedded platform the savings could be around 5% as discussed in
the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935

On larger servers with large amounts of memory the savings could be
more.  We do not yet have all the pieces together to evaluate.

> > ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
> > the firmware can expose information regarding the boundaries of such memory
> > power management domains to the OS in a standard way.
> > 
> 
> I'm not familiar with the ACPI spec but is there support for parsing of
> MPST and interpreting the associated ACPI events? For example, if ACPI
> fires an event indicating that a memory power node is to enter a low
> state then presumably the OS should actively migrate pages away -- even
> if it's going into a state where the contents are still refreshed
> as exiting that state could take a long time.
> 
> I did not look closely at the patchset at all because it looked like the
> actual support to use it and measure the benefit is missing.

Correct.  The platform interface part is not included in this patch
set mainly because there is not much design required there.  Each
platform can have code to collect the memory region boundaries from
BIOS/firmware and load it into the Linux VM.  The goal of this patch
is to brainstorm on the idea of hos core VM should used the region
information.
 
> > How can Linux VM help memory power savings?
> > 
> > o Consolidate memory allocations and/or references such that they are
> > not spread across the entire memory address space.  Basically area of memory
> > that is not being referenced, can reside in low power state.
> > 
> 
> Which the series does not appear to do.

Correct.  We need to design the correct reclaim strategy for this to
work.  However having buddy list sorted by region address could get us
one step closer to shaping the allocations.

> > o Support targeted memory reclaim, where certain areas of memory that can be
> > easily freed can be offlined, allowing those areas of memory to be put into
> > lower power states.
> > 
> 
> Which the series does not appear to do judging from this;
> 
>   include/linux/mm.h     |   38 +++++++
>   include/linux/mmzone.h |   52 +++++++++
>   mm/compaction.c        |    8 +
>   mm/page_alloc.c        |  263 ++++++++++++++++++++++++++++++++++++++++++++----
>   mm/vmstat.c            |   59 ++++++++++-
> 
> This does not appear to be doing anything with reclaim and not enough with
> compaction to indicate that the series actively manages memory placement
> in response to ACPI events.

Correct.  Evaluating different ideas for reclaim will be next step
before getting into the platform interface parts.

> Further in section 5.2.21.4 the spec says that power node regions can
> overlap (but are not hierarchal for some reason) but have no gaps yet the
> structure you use to represent is assumes there can be gaps and there are
> no overlaps. Again, this is just glancing at the spec and a quick skim of
> the patches so maybe I missed something that explains why this structure
> is suitable.

This patch is roughly based on the idea that ACPI MPST will give us
memory region boundaries.  It is not designed to implement all options
defined in the spec.  We have taken a general case of regions do not
overlap while memory addresses itself can be discontinuous.

> It seems to me that superficially the VM implementation for the support
> would have
> 
> a) Involved a tree that managed the overlapping regions (even if it's
>    not hierarchal it feels more sensible) and picked the highest-power-state
>    common denominator in the tree. This would only be allocated if support
>    for MPST is available.
> b) Leave memory allocations and reclaim as they are in the active state.
> c) Use a "sticky" migrate list MIGRATE_LOWPOWER for regions that are in lower
>    power but still usable with a latency penalty. This might be a single
>    migrate type but could also be a parallel set of free_area called
>    free_area_lowpower that is only used when free_area is depleted and in
>    the very slow path of the allocator.
> d) Use memory hot-remove for power states where the refresh rates were
>    not constant
> 
> and only did anything expensive in response to an ACPI event -- none of
> the fast paths should be touched.
> 
> When transitioning to the low power state, memory should be migrated in
> a vaguely similar fashion to what CMA does. For low-power, migration
> failure is acceptable. If contents are not preserved, ACPI needs to know
> if the migration failed because it cannot enter that power state.
> 
> For any of this to be worthwhile, low power states would need to be achieved
> for long periods of time because that migration is not free.

In this patch series we are assuming the simple case of hardware
managing the actual power states and OS facilitates them by keeping
the allocations in less number of memory regions.  As we keep
allocations and references low to a regions, it becomes case (c)
above.  We are addressing only a small subset of the above list.

> > Memory Regions:
> > ---------------
> > 
> > "Memory Regions" is a way of capturing the boundaries of power-managable
> > chunks of memory, within the MM subsystem.
> > 
> > Short description of the "Sorted-buddy" design:
> > -----------------------------------------------
> > 
> > In this design, the memory region boundaries are captured in a parallel
> > data-structure instead of fitting regions between nodes and zones in the
> > hierarchy. Further, the buddy allocator is altered, such that we maintain the
> > zones' freelists in region-sorted-order and thus do page allocation in the
> > order of increasing memory regions.
> 
> Implying that this sorting has to happen in the either the alloc or free
> fast path.

Yes, in the free path. This optimization can be actually be delayed in
the free fast path and completely avoided if our memory is full and we
are doing direct reclaim during allocations.

> > (The freelists need not be fully
> > address-sorted, they just need to be region-sorted. Patch 6 explains this
> > in more detail).
> > 
> > The idea is to do page allocation in increasing order of memory regions
> > (within a zone) and perform page reclaim in the reverse order, as illustrated
> > below.
> > 
> > ---------------------------- Increasing region number---------------------->
> > 
> > Direction of allocation--->                         <---Direction of reclaim
> > 
> 
> Compaction will work against this because it uses a PFN walker to isolate
> free pages and will ignore memory regions. If pageblocks were used, it
> could take that into account at least.
> 
> > The sorting logic (to maintain freelist pageblocks in region-sorted-order)
> > lies in the page-free path and not the page-allocation path and hence the
> > critical page allocation paths remain fast.
> 
> Page free can be a critical path for application performance as well.
> Think network buffer heavy alloc and freeing of buffers.
> 
> However, migratetype information is already looked up for THP so ideally
> power awareness would piggyback on it.
> 
> > Moreover, the heart of the page
> > allocation algorithm itself remains largely unchanged, and the region-related
> > data-structures are optimized to avoid unnecessary updates during the
> > page-allocator's runtime.
> > 
> > Advantages of this design:
> > --------------------------
> > 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and
> >    hence we avoid its associated problems (like too many zones, extra page
> >    reclaim threads, question of choosing watermarks etc).
> >    [This is an advantage over the "Hierarchy" design]
> > 
> > 2. Performance overhead is expected to be low: Since we retain the simplicity
> >    of the algorithm in the page allocation path, page allocation can
> >    potentially remain as fast as it would be without memory regions. The
> >    overhead is pushed to the page-freeing paths which are not that critical.
> > 
> > 
> > Results:
> > =======
> > 
> > Test setup:
> > -----------
> > This patchset applies cleanly on top of 3.7-rc3.
> > 
> > x86 dual-socket quad core HT-enabled machine booted with mem=8G
> > Memory region size = 512 MB
> > 
> > Functional testing:
> > -------------------
> > 
> > Ran pagetest, a simple C program that allocates and touches a required number
> > of pages.
> > 
> > Below is the statistics from the regions within ZONE_NORMAL, at various sizes
> > of allocations from pagetest.
> > 
> > 	     Present pages   |	Free pages at various allocations        |
> > 			     |  start	|  512 MB  |  1024 MB | 2048 MB  |
> >   Region 0      16	     |   0      |    0     |     0    |    0     |
> >   Region 1      131072       |  87219   |  8066    |   7892   |  7387    |
> >   Region 2      131072       | 131072   |  79036   |     0    |    0     |
> >   Region 3      131072       | 131072   | 131072   |   79061  |    0     |
> >   Region 4      131072       | 131072   | 131072   |  131072  |    0     |
> >   Region 5      131072       | 131072   | 131072   |  131072  |  79051   |
> >   Region 6      131072       | 131072   | 131072   |  131072  |  131072  |
> >   Region 7      131072       | 131072   | 131072   |  131072  |  131072  |
> >   Region 8      131056       | 105475   | 105472   |  105472  |  105472  |
> > 
> > This shows that page allocation occurs in the order of increasing region
> > numbers, as intended in this design.
> > 
> > Performance impact:
> > -------------------
> > 
> > Kernbench results didn't show much of a difference between the performance
> > of vanilla 3.7-rc3 and this patchset.
> > 
> > 
> > Todos:
> > =====
> > 
> > 1. Memory-region aware page-reclamation:
> > ----------------------------------------
> > 
> > We would like to do page reclaim in the reverse order of page allocation
> > within a zone, ie., in the order of decreasing region numbers.
> > To achieve that, while scanning lru pages to reclaim, we could potentially
> > look for pages belonging to higher regions (considering region boundaries)
> > or perhaps simply prefer pages of higher pfns (and skip lower pfns) as
> > reclaim candidates.
> > 
> 
> This would disrupting LRU ordering and if those pages were recently
> allocated and you force a situation where swap has to be used then any
> saving in low memory will be lost by having to access the disk instead.
> 
> > 2. Compile-time exclusion of Memory Power Management, and extending the
> > support to also work with other features such as Mem cgroups, kexec etc.
> >  
> 
> Compile-time exclusion is pointless because it'll be always activated by
> distribution configs. Support for MPST should be detected at runtime and
> 
> 3. ACPI support to actually use this thing and validate the design is
>    compatible with the spec and actually works in hardware

This is required to actually evaluate power saving benefit once we
have candidate implementations in the VM.

At this point we want to look at overheads of having region
infrastructure in VM and how does that trade off in terms of
requirements that we can meet.

The first goal is to have memory allocations fill as few regions as
possible when system's memory usage is significantly lower.  Next we
would like VM to actively move pages around to cooperate with platform
memory power saving features like notifications or policy changes.

--Vaidy


^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Huang Ying @ 2012-11-09  2:36 UTC (permalink / raw)
  To: Alan Stern; +Cc: Rafael J. Wysocki, linux-kernel, linux-pm
In-Reply-To: <Pine.LNX.4.44L0.1211081125470.1280-100000@iolanthe.rowland.org>

On Thu, 2012-11-08 at 12:07 -0500, Alan Stern wrote:
> On Thu, 8 Nov 2012, Rafael J. Wysocki wrote:
> 
> > > > > is it a good idea to allow to set device state to SUSPENDED if the device
> > > > > is disabled?
> > > > 
> > > > No, it is not.  The status should always be ACTIVE as long as usage_count > 0.
> 
> That isn't strictly true, because pm_runtime_get_noresume violates this
> rule.  What the PM core actually does is prevent a transition from the
> ACTIVE state to the SUSPENDING/SUSPENDED state if usage_count > 0,
> _provided_ runtime PM is enabled.  There's no such restriction when it
> is disabled.

Usage count may be not a issue for the end user.  But "on" in "control"
sysfs file + SUSPENDED can be confusing for the end user.  Maybe we need
to check dev->power.runtime_auto in pm_runtime_set_suspended().

> BTW, do we need to think about what happens in the case where the
> device _does_ have a driver and for some reason the driver has disabled
> the device for runtime PM?  I would just as soon ignore the issue.
> 
> > > > However, in some cases we actually would like to change the status to
> > > > SUSPENDED when usage_count becomes equal to 0, because that means we can
> > > > suspend (I mean really suspend) the parents of the devices in question
> > > > (and we want to notify the parents in those cases).
> > > 
> > > So do you think Alan Stern's suggestion about forbidden and disabled is
> > > the right way to go?
> > 
> > I'm not really sure about that.
> > 
> > My original idea was that the runtime PM status and usage counter would
> > only matter when runtime PM of a device was enabled.  That leads to
> > problems, though, when we enable runtime PM of a device whose usage
> > counter is greater from zero and status is SUSPENDED.
> 
> That doesn't seem to be a problem.  It can arise without disabling
> runtime PM at all -- just call pm_runtime_get_noresume.

I think pm_runtime_get_noresume can not fix the issue.
pm_runtiem_set_active() should be invoked before pm_runtime_enable() if
necessary.  That is, the invoker should be responsible for the
consistence between usage_count and SUSPENDED/ACTIVE status.  And the
API may be a little low level and error-prone to the invoker (mainly bus
code).

Best Regards,
Huang Ying

> >  Also when the
> > device's status is ACTIVE, but its parent's child count is 0.
> 
> __pm_runtime_set_status prevents this situation from arising.  When the 
> device's status is set to ACTIVE, the parent's child count is 
> incremented.  So this isn't a problem either.
> 
> > It's not very easy to fix this at the core level, though, because we
> > depend on the current behavior in some places.  I'm thinking that
> > perhaps pm_runtime_enable() should just WARN() if things are obviously
> > inconsistent (although there still may be problems, for example, if the
> > parent's child count is 2 when we enable runtime PM for its child, but that
> > child is the only one it actually has).
> 
> I think we should continue the original strategy of ignoring the status
> and usage counter when runtime PM is disabled.  This is definitely the
> easiest and most straightforward approach.  Fixing the problem at hand
> (VGA controllers) by changing the PCI subsystem seems like the simplest
> solution.
> 
> Your revised patch does do the job, except for a few problems.  
> Namely, while local_pci_probe() and pci_device_remove() are running,
> the device _does_ have a driver.  This means that local_pci_probe()
> should not call pm_runtime_get_sync(), for example.  Doing so would
> invoke the driver's runtime_resume routine before calling the driver's
> probe routine!
> 
> The USB subsystem solves this problem by carefully keeping track of the 
> state of the device-driver binding:
> 
> 	Originally the device is UNBOUND.
> 
> 	At the start of the subsystem's probe routine, the state
> 	changes to BINDING.
> 
> 	If the probe succeeds then it changes to BOUND; otherwise
> 	it goes back to UNBOUND.
> 
> 	At the start of the subsystem's remove routine, the state
> 	changes to UNBINDING.  At the end it goes to UNBOUND.
> 
> When the state is anything other than BOUND, the subsystem's runtime PM 
> routines act as though there is no driver.  This works because the 
> subsystem makes sure that the device is ACTIVE with a nonzero usage 
> count before calling the driver's probe or remove routine, so no 
> runtime PM callbacks can occur at these awkward times.
> 
> If PCI adopted this strategy then your new patch would work okay.  I 
> think -- I haven't checked it thoroughly.
> 
> Alan Stern
> 

^ permalink raw reply

* Re: [patch] acpi, pm: fix build breakage
From: Rafael J. Wysocki @ 2012-11-08 21:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Aaron Lu, Huang Ying, Len Brown, Lv Zheng, Adrian Hunter,
	linux-kernel, linux-pm, linux-acpi
In-Reply-To: <alpine.DEB.2.00.1211081119350.8749@chino.kir.corp.google.com>

On Thursday, November 08, 2012 11:20:01 AM David Rientjes wrote:
> Commit b87b49cd0efd ("ACPI / PM: Move device PM functions related to sleep 
> states") declared acpi_target_system_state() for CONFIG_PM_SLEEP whereas 
> it is only defined for CONFIG_ACPI_SLEEP, resulting in the following link 
> error:
> 
> drivers/built-in.o: In function `acpi_pm_device_sleep_wake':
> drivers/acpi/device_pm.c:342: undefined reference to `acpi_target_system_state'
> drivers/built-in.o: In function `acpi_dev_suspend_late':
> drivers/acpi/device_pm.c:501: undefined reference to `acpi_target_system_state'
> drivers/built-in.o: In function `acpi_pm_device_sleep_state':
> drivers/acpi/device_pm.c:221: undefined reference to `acpi_target_system_state'
> 
> Define it only for CONFIG_ACPI_SLEEP and fallback to a dummy definition 
> for other configs.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

I've received the patch and will apply it when I get back home from Spain.

Thanks,
Rafael


> ---
>  include/acpi/acpi_bus.h |    8 ++++++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/include/acpi/acpi_bus.h b/include/acpi/acpi_bus.h
> --- a/include/acpi/acpi_bus.h
> +++ b/include/acpi/acpi_bus.h
> @@ -469,11 +469,9 @@ static inline int acpi_pm_device_run_wake(struct device *dev, bool enable)
>  #endif
>  
>  #ifdef CONFIG_PM_SLEEP
> -u32 acpi_target_system_state(void);
>  int __acpi_device_sleep_wake(struct acpi_device *, u32, bool);
>  int acpi_pm_device_sleep_wake(struct device *, bool);
>  #else
> -static inline u32 acpi_target_system_state(void) { return ACPI_STATE_S0; }
>  static inline int __acpi_device_sleep_wake(struct acpi_device *adev,
>  					   u32 target_state, bool enable)
>  {
> @@ -485,6 +483,12 @@ static inline int acpi_pm_device_sleep_wake(struct device *dev, bool enable)
>  }
>  #endif
>  
> +#ifdef CONFIG_ACPI_SLEEP
> +u32 acpi_target_system_state(void);
> +#else
> +static inline u32 acpi_target_system_state(void) { return ACPI_STATE_S0; }
> +#endif
> +
>  static inline bool acpi_device_power_manageable(struct acpi_device *adev)
>  {
>  	return adev->flags.power_manageable;
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox